Data Collection & Preparation:

Collect the dataset:
Identify the data sources: Determine which airlines, airports, and weather agencies you need to gather data from to create your dataset.

Gather historical flight data: You can obtain historical flight data from various sources, such as government agencies, airline websites, and flight tracking websites. Some examples of data you may want to collect include flight number, origin and destination airport, scheduled departure and arrival times, actual departure and arrival times, and whether the flight was delayed or not.

Gather weather data: You can obtain weather data from weather agencies, such as the National Oceanic and Atmospheric Administration (NOAA). Some examples of weather data you may want to collect include temperature, wind speed, visibility, precipitation, and cloud cover.

Preprocess the data: Once you have gathered the data, you will need to clean and preprocess it to ensure it is in a usable format for machine learning algorithms. This may involve handling missing values, converting categorical variables to numerical, and scaling or normalizing the data.

Gather real-time data: In addition to historical data, you may also want to collect real-time data for current flights, such as weather conditions and flight information. You can obtain this data using APIs or web scraping techniques.

Combine the data: Combine the historical and real-time data to create a single dataset that you can use to train your machine learning model.

Label the data: To train a machine learning model to predict flight delays, you will need to label the dataset with whether or not each flight was delayed.

Split the dataset: Split the dataset into training and testing sets so you can evaluate the performance of your machine learning model.

*   List item
*   List item







In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Activity 1.1: Importing the libraries:

In [None]:

import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import sklearn
from sklearn.tree  import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score




Activity 1.2: Read the Dataset:

In [None]:
dataset= pd.read_csv("/content/flightdata.csv")

dataset.head()

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,UNIQUE_CARRIER,TAIL_NUM,FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN,...,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DEL15,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,DISTANCE,Unnamed: 25
0,2016,1,1,1,5,DL,N836DN,1399,10397,ATL,...,2143,2102.0,-41.0,0.0,0.0,0.0,338.0,295.0,2182.0,
1,2016,1,1,1,5,DL,N964DN,1476,11433,DTW,...,1435,1439.0,4.0,0.0,0.0,0.0,110.0,115.0,528.0,
2,2016,1,1,1,5,DL,N813DN,1597,10397,ATL,...,1215,1142.0,-33.0,0.0,0.0,0.0,335.0,300.0,2182.0,
3,2016,1,1,1,5,DL,N587NW,1768,14747,SEA,...,1335,1345.0,10.0,0.0,0.0,0.0,196.0,205.0,1399.0,
4,2016,1,1,1,5,DL,N836DN,1823,14747,SEA,...,607,615.0,8.0,0.0,0.0,0.0,247.0,259.0,1927.0,


Activity 2: Data Preparation:

As we have understood how the data is, let's pre-process the collected data.
The download data set is not suitable for training the machine learning model
as it might have so much randomness so we need to clean the dataset
properly in order to fetch good results. This activity includes the following
steps.
● Handling missing values
● Handling categorical data
Note: These are the general steps of pre-processing the data before using it
for machine learning. Depending on the condition of your dataset, you may or
may not have to go through all these steps.

Activity 2.1: Handling missing values:

In [None]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11231 entries, 0 to 11230
Data columns (total 26 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   YEAR                 11231 non-null  int64  
 1   QUARTER              11231 non-null  int64  
 2   MONTH                11231 non-null  int64  
 3   DAY_OF_MONTH         11231 non-null  int64  
 4   DAY_OF_WEEK          11231 non-null  int64  
 5   UNIQUE_CARRIER       11231 non-null  object 
 6   TAIL_NUM             11231 non-null  object 
 7   FL_NUM               11231 non-null  int64  
 8   ORIGIN_AIRPORT_ID    11231 non-null  int64  
 9   ORIGIN               11231 non-null  object 
 10  DEST_AIRPORT_ID      11231 non-null  int64  
 11  DEST                 11231 non-null  object 
 12  CRS_DEP_TIME         11231 non-null  int64  
 13  DEP_TIME             11124 non-null  float64
 14  DEP_DELAY            11124 non-null  float64
 15  DEP_DEL15            11124 non-null 

In [None]:
dataset = dataset.drop('Unnamed: 25', axis=1)
dataset.isnull().sum()

YEAR                     0
QUARTER                  0
MONTH                    0
DAY_OF_MONTH             0
DAY_OF_WEEK              0
UNIQUE_CARRIER           0
TAIL_NUM                 0
FL_NUM                   0
ORIGIN_AIRPORT_ID        0
ORIGIN                   0
DEST_AIRPORT_ID          0
DEST                     0
CRS_DEP_TIME             0
DEP_TIME               107
DEP_DELAY              107
DEP_DEL15              107
CRS_ARR_TIME             0
ARR_TIME               115
ARR_DELAY              188
ARR_DEL15              188
CANCELLED                0
DIVERTED                 0
CRS_ELAPSED_TIME         0
ACTUAL_ELAPSED_TIME    188
DISTANCE                 0
dtype: int64

In [None]:
dataset = dataset[["FL_NUM", "MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_ARR_TIME","DEP_DEL15", "ARR_DEL15"]]
dataset.isnull().sum()

FL_NUM            0
MONTH             0
DAY_OF_MONTH      0
DAY_OF_WEEK       0
ORIGIN            0
DEST              0
CRS_ARR_TIME      0
DEP_DEL15       107
ARR_DEL15       188
dtype: int64

In [None]:
dataset[dataset.isnull().any(axis=1)].head(10)

Unnamed: 0,FL_NUM,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_ARR_TIME,DEP_DEL15,ARR_DEL15
177,2834,1,9,6,MSP,SEA,852,0.0,
179,86,1,10,7,MSP,DTW,1632,,
184,557,1,10,7,MSP,DTW,912,0.0,
210,1096,1,10,7,DTW,MSP,1303,,
478,1542,1,22,5,SEA,JFK,723,,
481,1795,1,22,5,ATL,JFK,2014,,
491,2312,1,22,5,MSP,JFK,2149,,
499,423,1,23,6,JFK,ATL,1600,,
500,425,1,23,6,JFK,ATL,1827,,
501,427,1,23,6,JFK,SEA,1053,,


In [None]:
dataset['DEP_DEL15'].mode()

0    0.0
Name: DEP_DEL15, dtype: float64

In [None]:
dataset = dataset.fillna({'ARR_DEL15': 1})
dataset = dataset.fillna({'DEP_DEL15': 0})
dataset.iloc[177:185]

Unnamed: 0,FL_NUM,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_ARR_TIME,DEP_DEL15,ARR_DEL15
177,2834,1,9,6,MSP,SEA,852,0.0,1.0
178,2839,1,9,6,DTW,JFK,1724,0.0,0.0
179,86,1,10,7,MSP,DTW,1632,0.0,1.0
180,87,1,10,7,DTW,MSP,1649,1.0,0.0
181,423,1,10,7,JFK,ATL,1600,0.0,0.0
182,440,1,10,7,JFK,ATL,849,0.0,0.0
183,485,1,10,7,JFK,SEA,1945,1.0,0.0
184,557,1,10,7,MSP,DTW,912,0.0,1.0


In [None]:
import math

for index, row in dataset.iterrows():
  dataset.loc[index, 'CRS_ARR_TIME'] = math.floor(row['CRS_ARR_TIME'] / 100)
dataset.head()

Unnamed: 0,FL_NUM,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_ARR_TIME,DEP_DEL15,ARR_DEL15
0,1399,1,1,5,ATL,SEA,0,0.0,0.0
1,1476,1,1,5,DTW,MSP,0,0.0,0.0
2,1597,1,1,5,ATL,SEA,0,0.0,0.0
3,1768,1,1,5,SEA,MSP,0,0.0,0.0
4,1823,1,1,5,SEA,DTW,0,0.0,0.0


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dataset['DEST'] = le.fit_transform(dataset['DEST'])
dataset['ORIGIN'] = le.fit_transform(dataset['ORIGIN'])
dataset.head(5)

Unnamed: 0,FL_NUM,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN,DEST,CRS_ARR_TIME,DEP_DEL15,ARR_DEL15
0,1399,1,1,5,0,4,0,0.0,0.0
1,1476,1,1,5,1,3,0,0.0,0.0
2,1597,1,1,5,0,4,0,0.0,0.0
3,1768,1,1,5,4,3,0,0.0,0.0
4,1823,1,1,5,4,1,0,0.0,0.0


In [None]:
dataset['ORIGIN'].unique()

array([0, 1, 4, 3, 2])

In [None]:
x = dataset.iloc[:, 0:8].values
y = dataset.iloc[:, 8:9].values
x

array([[1.399e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 0.000e+00,
        1.000e+00],
       [1.476e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [1.597e+03, 1.000e+00, 1.000e+00, ..., 0.000e+00, 0.000e+00,
        1.000e+00],
       ...,
       [1.823e+03, 1.200e+01, 3.000e+01, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [1.901e+03, 1.200e+01, 3.000e+01, ..., 0.000e+00, 0.000e+00,
        1.000e+00],
       [2.005e+03, 1.200e+01, 3.000e+01, ..., 0.000e+00, 0.000e+00,
        1.000e+00]])

In [None]:
from sklearn.preprocessing import OneHotEncoder
oh = OneHotEncoder()
z=oh.fit_transform(x[:,4:5]).toarray()
t=oh.fit_transform(x[:,5:6]).toarray()

#x=np.delete(x,[4,7],axis=1)


In [None]:
z

array([[1.],
       [1.],
       [1.],
       ...,
       [1.],
       [1.],
       [1.]])

In [None]:
t

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [1., 0.],
       [1., 0.],
       [1., 0.]])