# Manage Flight Data

This notebook is for exploring, cleansing, and modeling the FAA flights dataset.

- Load and inspect the data from `../data/flights.csv`
- Cleanse the data (replace nulls with zero)
- Build a model to predict the chance of delay for a given day and airport
- Save the model for external use
- Create a CSV of all airports and their IDs

In [12]:
# Step 1: Load the data and display the first few rows
import pandas as pd
df = pd.read_csv('./data/flights.csv')
df.head()




Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,Carrier,OriginAirportID,OriginAirportName,OriginCity,OriginState,DestAirportID,DestAirportName,DestCity,DestState,CRSDepTime,DepDelay,DepDel15,CRSArrTime,ArrDelay,ArrDel15,Cancelled
0,2013,9,16,1,DL,15304,Tampa International,Tampa,FL,12478,John F. Kennedy International,New York,NY,1539,4,0.0,1824,13,0,0
1,2013,9,23,1,WN,14122,Pittsburgh International,Pittsburgh,PA,13232,Chicago Midway International,Chicago,IL,710,3,0.0,740,22,1,0
2,2013,9,7,6,AS,14747,Seattle/Tacoma International,Seattle,WA,11278,Ronald Reagan Washington National,Washington,DC,810,-3,0.0,1614,-7,0,0
3,2013,7,22,1,OO,13930,Chicago O'Hare International,Chicago,IL,11042,Cleveland-Hopkins International,Cleveland,OH,804,35,1.0,1027,33,1,0
4,2013,5,16,4,DL,13931,Norfolk International,Norfolk,VA,10397,Hartsfield-Jackson Atlanta International,Atlanta,GA,545,-1,0.0,728,-9,0,0


In [13]:
# Step 1: Load the data and display OriginAirportID and OriginAirportName columns
import pandas as pd
df = pd.read_csv('./data/flights.csv')
df[['OriginAirportID', 'OriginAirportName']].head()


Unnamed: 0,OriginAirportID,OriginAirportName
0,15304,Tampa International
1,14122,Pittsburgh International
2,14747,Seattle/Tacoma International
3,13930,Chicago O'Hare International
4,13931,Norfolk International


In [14]:
#STEP: cleansing the data by identifying null values and replacing them with an appropriate value (zero in this case).
df.fillna(0, inplace=True)
print(df.isnull().sum())


Year                 0
Month                0
DayofMonth           0
DayOfWeek            0
Carrier              0
OriginAirportID      0
OriginAirportName    0
OriginCity           0
OriginState          0
DestAirportID        0
DestAirportName      0
DestCity             0
DestState            0
CRSDepTime           0
DepDelay             0
DepDel15             0
CRSArrTime           0
ArrDelay             0
ArrDel15             0
Cancelled            0
dtype: int64


# Step 2: Build a Model to Predict Flight Delays
Now, let's build a model to predict the chance a flight will be delayed by more than 15 minutes for a given day and airport.

In [None]:
# Create a binary target for delays over 15 minutes
# Assuming 'ArrDel15' is the flag column (1 if delayed > 15 min, 0 otherwise)
# If not, adjust to the correct column name

df['Delayed'] = df['ArrDel15']

# Select features: DayOfWeek and OriginAirportID
X = df[['DayOfWeek', 'OriginAirportID']]
y = df['Delayed']

# Split data into train and test sets
%pip install scikit-learn
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate the model
score = model.score(X_test, y_test)
print(f"Model accuracy: {score:.2f}")


# Step 3: Save the Model for External Use
We'll save the trained model to a file using joblib so it can be used in other applications.

In [None]:
import joblib
joblib.dump(model, './data/flight_delay_model.pkl')
print("Model saved to ./data/flight_delay_model.pkl")


# Step 4: Export Unique Airport IDs and Names
Extract all unique airport IDs and names and save them to a new CSV file.

In [None]:
# Extract unique airport IDs and names
airports = df[['OriginAirportID', 'OriginAirportName']].drop_duplicates()
airports.to_csv('./data/airports.csv', index=False)
print("Airport list saved to ./data/airports.csv")
