### Who are the delayed flights?
    the definition of flight delay is a flight that did not land on time.
   
    In cases where the flight was delayed at departure and arrived on time or arrived before the estimated landing time, this flight will not be considered a delayed flight.


### Libraries

In [91]:
import pandas as pd
import numpy as np
import datetime

In [92]:
dataset = pd.read_csv("data.csv")
dataset.head(5)

Unnamed: 0,FL_DATE,OP_CARRIER,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,TAXI_OUT,WHEELS_OFF,...,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 27
0,2018-01-01,UA,2429,EWR,DEN,1517,1512.0,-5.0,15.0,1527.0,...,268.0,250.0,225.0,1605.0,,,,,,
1,2018-01-01,UA,2427,LAS,SFO,1115,1107.0,-8.0,11.0,1118.0,...,99.0,83.0,65.0,414.0,,,,,,
2,2018-01-01,UA,2426,SNA,DEN,1335,1330.0,-5.0,15.0,1345.0,...,134.0,126.0,106.0,846.0,,,,,,
3,2018-01-01,UA,2425,RSW,ORD,1546,1552.0,6.0,19.0,1611.0,...,190.0,182.0,157.0,1120.0,,,,,,
4,2018-01-01,UA,2424,ORD,ALB,630,650.0,20.0,13.0,703.0,...,112.0,106.0,83.0,723.0,,,,,,


    as you can see , the size of the dataset is too big.
    In this notebook we will clean up the data and make sure to leave the relevant data for the model.
    In the proposal of the project, we emphasized some characteristics with which we would like to work :
  
       Date of the Flight:	FL_DATE
       Starting Airport Code:	ORIGIN
       Destination Airport Code:	DEST
       Actual Departure Time:	DEP_TIME
       Actual Arrival Time:	ARR_TIME
       The time duration between wheels of and wheels on time:	AIR_TIME
       The Distance between two airports:	DISTANCE


In [93]:
dataset = dataset[['FL_DATE' , 'ORIGIN','DEST','DEP_TIME','ARR_TIME','AIR_TIME','DISTANCE','ARR_DELAY']]
dataset.head(5)

Unnamed: 0,FL_DATE,ORIGIN,DEST,DEP_TIME,ARR_TIME,AIR_TIME,DISTANCE,ARR_DELAY
0,2018-01-01,EWR,DEN,1512.0,1722.0,225.0,1605.0,-23.0
1,2018-01-01,LAS,SFO,1107.0,1230.0,65.0,414.0,-24.0
2,2018-01-01,SNA,DEN,1330.0,1636.0,106.0,846.0,-13.0
3,2018-01-01,RSW,ORD,1552.0,1754.0,157.0,1120.0,-2.0
4,2018-01-01,ORD,ALB,650.0,936.0,83.0,723.0,14.0


### Dealing with Missing Values :

In [94]:
print(dataset.isna().sum())
dataset.shape

FL_DATE           0
ORIGIN            0
DEST              0
DEP_TIME     112317
ARR_TIME     119245
AIR_TIME     134442
DISTANCE          0
ARR_DELAY    137040
dtype: int64


(7213446, 8)

In [95]:
dataset = dataset.dropna(axis = 0)
print(dataset.isna().sum())
dataset.shape

FL_DATE      0
ORIGIN       0
DEST         0
DEP_TIME     0
ARR_TIME     0
AIR_TIME     0
DISTANCE     0
ARR_DELAY    0
dtype: int64


(7076405, 8)

### Change the date format

In [96]:
dataset['YEAR'] = pd.DatetimeIndex(dataset['FL_DATE']).year
dataset['DAY'] = pd.DatetimeIndex(dataset['FL_DATE']).day
dataset['MONTH'] = pd.DatetimeIndex(dataset['FL_DATE']).month
dataset = dataset[['YEAR','MONTH','DAY', 'ORIGIN','DEST','DEP_TIME','ARR_TIME','AIR_TIME','DISTANCE','ARR_DELAY']]
dataset.head(5)

Unnamed: 0,YEAR,MONTH,DAY,ORIGIN,DEST,DEP_TIME,ARR_TIME,AIR_TIME,DISTANCE,ARR_DELAY
0,2018,1,1,EWR,DEN,1512.0,1722.0,225.0,1605.0,-23.0
1,2018,1,1,LAS,SFO,1107.0,1230.0,65.0,414.0,-24.0
2,2018,1,1,SNA,DEN,1330.0,1636.0,106.0,846.0,-13.0
3,2018,1,1,RSW,ORD,1552.0,1754.0,157.0,1120.0,-2.0
4,2018,1,1,ORD,ALB,650.0,936.0,83.0,723.0,14.0


### Binary Classification
    Our problem is classification, where a "0" will correspond to a flight being on time, and a "1" to a flight being delayed.

In [97]:
# change ARR_DELAY column: "0" will correspond to a flight being on time, and a "1" to a flight being delayed.
status = []
for value in dataset['ARR_DELAY']:
    if value <= 0:
        status.append(0)
    else:
        status.append(1)
dataset['ARR_DELAY'] = status
dataset.head(10)

Unnamed: 0,YEAR,MONTH,DAY,ORIGIN,DEST,DEP_TIME,ARR_TIME,AIR_TIME,DISTANCE,ARR_DELAY
0,2018,1,1,EWR,DEN,1512.0,1722.0,225.0,1605.0,0
1,2018,1,1,LAS,SFO,1107.0,1230.0,65.0,414.0,0
2,2018,1,1,SNA,DEN,1330.0,1636.0,106.0,846.0,0
3,2018,1,1,RSW,ORD,1552.0,1754.0,157.0,1120.0,0
4,2018,1,1,ORD,ALB,650.0,936.0,83.0,723.0,1
5,2018,1,1,ORD,OMA,2244.0,3.0,62.0,416.0,0
6,2018,1,1,IAH,LAS,747.0,900.0,173.0,1222.0,0
7,2018,1,1,DEN,CID,1318.0,1600.0,85.0,692.0,0
8,2018,1,1,SMF,EWR,2237.0,636.0,280.0,2500.0,0
9,2018,1,1,RIC,DEN,1559.0,1756.0,217.0,1482.0,0


In [98]:
dataset['ARR_DELAY'].value_counts()

0    4560356
1    2516049
Name: ARR_DELAY, dtype: int64

    We can see that we have highly imbalanced data, as we there are only 35.5% rows with the value of 1.0 (Delay in flight).
    We will drop a significant amount of rows where our target variable is 0.0 (No delay in flight).
    And as a result, the size of our datta will be : (5032098, 10)

In [99]:
# Split the data into positive and negative
positive_rows = dataset.ARR_DELAY == 1.0
data_pos = dataset.loc[positive_rows]
data_neg = dataset.loc[~positive_rows]

# Merge the balanced data
dataset = pd.concat([data_pos, data_neg.sample(n = len(data_pos))], axis = 0)
# Shuffle the order of data
dataset = data.sample(n = len(data)).reset_index(drop = True)
dataset.head(5)

Unnamed: 0,YEAR,MONTH,DAY,ORIGIN,DEST,DEP_TIME,ARR_TIME,AIR_TIME,DISTANCE,ARR_DELAY
0,2018,1,6,CLE,BNA,1514.0,1532.0,67.0,448.0,0
1,2018,6,8,ORD,DEN,2144.0,2317.0,117.0,888.0,1
2,2018,5,28,JFK,MCO,802.0,1114.0,144.0,944.0,1
3,2018,3,5,LAS,RDU,2326.0,609.0,203.0,2026.0,0
4,2018,1,25,RDU,ATL,1712.0,1835.0,61.0,356.0,0


In [102]:
dataset['ARR_DELAY'].value_counts()

0    2516049
1    2516049
Name: ARR_DELAY, dtype: int64

### One Hot encoding
       Because of the fact that Logistic Regression cannot use non-numeric data
       we use one hot encoding with pandas library to convert the non numerical columns to numerical values.
       It is **important** to mention that the size of the dataset will be significantly larger (5032098 , 724).
       Thus, the model will work with a data that is a thousandth smaller than the original dataset.

In [104]:
dataset = pd.get_dummies(dataset,columns = ['ORIGIN','DEST'])
y_lable = dataset.pop('ARR_DELAY')
dataset = dataset.join(y_lable)
dataset.head(5)

Unnamed: 0,YEAR,MONTH,DAY,DEP_TIME,ARR_TIME,AIR_TIME,DISTANCE,ORIGIN_ABE,ORIGIN_ABI,ORIGIN_ABQ,...,DEST_VEL,DEST_VLD,DEST_VPS,DEST_WRG,DEST_WYS,DEST_XNA,DEST_YAK,DEST_YNG,DEST_YUM,ARR_DELAY
0,2018,1,6,1514.0,1532.0,67.0,448.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2018,6,8,2144.0,2317.0,117.0,888.0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2018,5,28,802.0,1114.0,144.0,944.0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,2018,3,5,2326.0,609.0,203.0,2026.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2018,1,25,1712.0,1835.0,61.0,356.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [110]:
newdataset = dataset.sample(frac = 0.1)
newdataset.reset_index()
newdataset.head(5)

Unnamed: 0,YEAR,MONTH,DAY,DEP_TIME,ARR_TIME,AIR_TIME,DISTANCE,ORIGIN_ABE,ORIGIN_ABI,ORIGIN_ABQ,...,DEST_VEL,DEST_VLD,DEST_VPS,DEST_WRG,DEST_WYS,DEST_XNA,DEST_YAK,DEST_YNG,DEST_YUM,ARR_DELAY
2502416,2018,1,12,1316.0,1738.0,178.0,1303.0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3428305,2018,6,19,1758.0,1938.0,260.0,1999.0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
140995,2018,10,10,905.0,1747.0,321.0,2548.0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
57505,2018,6,22,1107.0,1314.0,156.0,1014.0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
6726,2018,1,12,558.0,836.0,138.0,919.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
