# Predicting Flight Delays

This example shows use of classification models to predict flight delays. 
Original example can be found [here](https://github.com/frenchlam/dask_CDSW/blob/master/03_Dask_ML-LargeDS.ipynb) (dataset is [here](https://github.com/frenchlam/dask_CDSW/blob/master/data/1988.csv.bz2)).

### Verifying your setup
Run the following code to verify that your bodo environment is set up correctly:

In [2]:
import bodo

@bodo.jit
def hello():
    print("Hello World from rank", bodo.get_rank(), " Total ranks =", bodo.get_size())

hello()

Hello World from rank 8  Total ranks = 10
Hello World from rank 1  Total ranks = 10
Hello World from rank 0  Total ranks = 10
Hello World from rank 7  Total ranks = 10
Hello World from rank 2  Total ranks = 10
Hello World from rank 9  Total ranks = 10
Hello World from rank 4  Total ranks = 10
Hello World from rank 3  Total ranks = 10
Hello World from rank 6  Total ranks = 10
Hello World from rank 5  Total ranks = 10


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Scikit-learn to build and evaluate regression models

In [3]:
import pandas as pd
import time
import numpy as np

## Part 1. Pre-processing in Pandas

### 1. Read flights dataset

In [4]:
@bodo.jit(cache=True)
def read_flights(input_file):
    flight_df = pd.read_csv(input_file, sep=',', header=0,
        usecols=['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'Origin', 'Dest','Cancelled'])    
    print(flight_df.head())    
    return flight_df

input_file = "s3://bodo-example-data/flights/1988.csv.bz2"
flight_df = read_flights(input_file)


   Month  DayofMonth  DayOfWeek  CRSDepTime  ...  FlightNum Origin  Dest Cancelled
0      1           9          6        1331  ...        942    SYR   BWI         0
1      1          10          7        1331  ...        942    SYR   BWI         0
2      1          11          1        1331  ...        942    SYR   BWI         0
3      1          12          2        1331  ...        942    SYR   BWI         0
4      1          13          3        1331  ...        942    SYR   BWI         0

[5 rows x 10 columns]


### 2. Feature Engineering
1. Create routes from origin and destination

In [5]:
@bodo.jit(cache=True)
def create_routes(flight_df):
    flight_df['route'] = flight_df['Origin'] + "_" + flight_df['Dest']
    # show top 20 routes - As defined by nb of flights
    top_routes = flight_df['route'].value_counts(ascending=False)
    print(top_routes.head(10))
    #focus on 50 biggest routes - As defined by nb of flights 
    route_lst=top_routes.head(50)
    flight_df = flight_df[flight_df['route'].isin(route_lst.index)]
    return flight_df

flight_df = create_routes(flight_df)

route
LAX_SFO    20750
SFO_LAX    20658
LAX_PHX    13461
PHX_LAX    13273
LAX_LAS    12175
LGA_BOS    12027
LAS_LAX    11801
SJC_LAX    11535
LAX_SJC    11292
BOS_LGA    11141
Name: count, dtype: int64


2. Look at their cancellations

In [6]:
@bodo.jit(cache=True)
def check_cancelations(flight_df):
    res = flight_df[['route', 'Cancelled', 'Month']].groupby(by='route')\
         .agg({'Month':'size', 'Cancelled':'sum'})\
        .rename(columns={'Month':'count','Cancelled':'nb_cancelled'}) \
        .reset_index()\
        .sort_values(['count'], ascending=False)
    print(res.head(10))

check_cancelations(flight_df)

      route  count  nb_cancelled
7   LAX_SFO  20750           228
45  SFO_LAX  20658           206
13  LAX_PHX  13461            78
8   PHX_LAX  13273            71
0   LAX_LAS  12175            58
43  LGA_BOS  12027           287
      route  count  nb_cancelled
41  LAS_LAX  11801            47
44  SJC_LAX  11535            71
36  LAX_SJC  11292            71
12  BOS_LGA  11141           243


Bodo automatically distributes the data on the worker processes. You can view this distribution by running the simple jit'd function below.

In [7]:
@bodo.jit
def print_info(flight_df):
    with bodo.objmode:
        print(flight_df.shape)
print_info(flight_df)

(51016, 11)
(46886, 11)
(52800, 11)
(50445, 11)
(53258, 11)
(47820, 11)
(49994, 11)
(41944, 11)
(51026, 11)
(42064, 11)


3. Quick sanity check - count number of null values()

In [8]:
@bodo.jit
def check_count(flight_df):
    print(flight_df.isnull().sum())
    
check_count(flight_df)

Month            0
DayofMonth       0
DayOfWeek        0
CRSDepTime       0
CRSArrTime       0
UniqueCarrier    0
FlightNum        0
Origin           0
Dest             0
Cancelled        0
route            0
dtype: int64


### 3. Feature and label encoding encoding

#### 1. Encode Labels using Cancelled column

In [9]:
@bodo.jit(cache=True)
def encode_labels(flight_df):
    flight_df['Cancelled'] = pd.Categorical(flight_df["Cancelled"])
    flight_df['Label'] = flight_df['Cancelled'].cat.codes
    flight_df.drop(['Cancelled'], axis=1, inplace=True)
    return flight_df

flight_df = encode_labels(flight_df)

#### 2. Feature Encoding

This is needed because sklearn only supports numerical values

a. Get airport unique values

b. Encode origin, destination, and route features

In [10]:
import numpy as np

@bodo.jit(cache=True)
def get_airport_list(flight_df):
    airport_list = np.sort((pd.concat((flight_df['Origin'], flight_df['Dest']))).unique())
    return airport_list

airport_list = get_airport_list(flight_df)

In [11]:
from sklearn.preprocessing import LabelEncoder
@bodo.jit(cache=True)
def encode_features(flight_df, airport_list):
    t1 = time.time()    
    # encode airlines 
    le_carrier = LabelEncoder()
    flight_df['Carrier_encoded'] = pd.Series(le_carrier.fit_transform(flight_df['UniqueCarrier'].values))
    # Encode airports : Using same encoder for both origin and dest ( consistent encoding of airports )
    le_airport = LabelEncoder()
    le_airport.fit(airport_list)
    flight_df['Origin_encoded'] = pd.Series(le_airport.transform(flight_df['Origin']))
    flight_df['Dest_encoded'] = pd.Series(le_airport.transform(flight_df['Dest']))
    # Encode routes 
    le_route = LabelEncoder()
    flight_df['route_encoded'] = pd.Series(le_route.fit_transform(flight_df['route'].values))
    print("Encoding time: ", (time.time()-t1), " sec")
    return flight_df

flight_df = encode_features(flight_df, airport_list)

Encoding time:  2.776745000000119  sec


In [12]:
@bodo.jit(cache=True)
def sample(flight_df):
    print(flight_df[['UniqueCarrier','Carrier_encoded','Origin','Origin_encoded',
           'Dest', 'Dest_encoded', 'route', 'route_encoded' ]].sample(10))
    
sample(flight_df)

       UniqueCarrier  Carrier_encoded  ...    route  route_encoded
49335             UA               11  ...  ORD_DTW             29
234763            CO                2  ...  LAX_SJC             21
90259             WN               13  ...  HOU_DAL             12
141524            NW                6  ...  ORD_MSP             32

[4 rows x 8 columns]
       UniqueCarrier  Carrier_encoded  ...    route  route_encoded
551512            EA                4  ...  DCA_LGA              6

[1 rows x 8 columns]
        UniqueCarrier  Carrier_encoded  ...    route  route_encoded
5157376            AS                1  ...  SEA_SFO             44

[1 rows x 8 columns]
        UniqueCarrier  Carrier_encoded  ...    route  route_encoded
1761593            UA               11  ...  LGA_ORD             24

[1 rows x 8 columns]
        UniqueCarrier  Carrier_encoded  ...    route  route_encoded
2480363            EA                4  ...  DCA_BOS              5

[1 rows x 8 columns]
        Uniqu

In [13]:
from sklearn.model_selection import train_test_split
@bodo.jit(cache=True)
def split_data(flight_df):
    t1 = time.time()
    X_train, X_test, y_train, y_test = train_test_split(flight_df.drop(['UniqueCarrier','Origin','Dest','route'],axis=1),
                                                    flight_df['Label'], 
                                                    test_size=0.3, train_size=0.7,
                                                    random_state=100)
    print("Data splitting time: ", (time.time()-t1), " sec")    

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(flight_df)

Data splitting time:  4.1812700000000405  sec


## Part 2: Model Training - Using Scikit-learn

### 1. RandomForestClassifier

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score # evaluation metric
@bodo.jit(cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier()
    rf.fit(X_train.to_numpy(), y_train.values)
    y_pred = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

rf_model(X_train, X_test, y_train, y_test)

RandomForestClassifier fit and predict time:  3.7298070000001644
Accuracy score 1.0




### 2. Logistic Regression

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score  # evaluation metric
@bodo.jit(cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train.values)
    y_pred = lr.predict(X_test)
    print("Logistic Regression fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

lr_model(X_train, X_test, y_train, y_test)

  res = func(*args, **kwargs)


Logistic Regression fit and predict time:  0.8788690000001225
Accuracy score 0.9815770030647986


