# <font color='#8735fb'> **RAPIDS Single-CPU Workflow - XGBoost @ Airline Delays** </font> 

<img src='https://raw.githubusercontent.com/rapidsai/cloud-ml-examples/main/aws/img/airline_dataset.png' width='1250px'>

> **1. Mount S3 Dataset**

> **2. Data Ingestion**

> **3. ETL**
-> handle missing -> split

> **4. Train Classifier**
-> XGBoost

> **5. Inference**
-> FIL

In [None]:
import os
import time
import pandas
import xgboost
import joblib

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import glob

### <font color='#8735fb'> **Mount S3 Dataset** </font>

In [None]:
!wget https://sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com/2_year_2020.tar.gz
!tar xvzf 2_year_2020.tar.gz

### <font color='#8735fb'> **Ingest Parquet Data** </font>

At the heart of our analysis will be domestic carrier on-time reporting data that has been kept for decades by the U.S. Bureau of Transportation.

This rich source of data allows us to scale, so while in this notebook (ML_100.ipynb) we only use 1 GPU and 1 year of data, in the next notebook (ML200.ipynb) we'll use 10 years of data and multiple GPUs.

> **Dataset**: [US.DoT - Reporting Carrier On-Time Performance, 1987-Present](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236)

The public dataset contains logs/features about flights in the United States (17 airlines) including:

* locations and distance  ( `Origin`, `Dest`, `Distance` )
* airline / carrier ( `Reporting_Airline` )
* scheduled departure and arrival times ( `CRSDepTime` and `CRSArrTime` )
* actual departure and arrival times ( `DpTime` and `ArrTime` )
* difference between scheduled & actual times ( `ArrDelay` and `DepDelay` )
* binary encoded version of late, aka our target variable ( `ArrDelay15` )

In [None]:
airline_feature_columns = ['Year', 'Quarter', 'Month', 'DayOfWeek', 
                           'Flight_Number_Reporting_Airline', 'DOT_ID_Reporting_Airline',
                           'OriginCityMarketID', 'DestCityMarketID',
                           'DepTime', 'DepDelay', 'DepDel15', 'ArrDel15',
                           'AirTime', 'Distance']

airline_label_column = 'ArrDel15'

In [None]:
file_list = '2_year_2020/'

In [None]:
data = pandas.read_parquet(file_list, columns=airline_feature_columns)

In [None]:
data

### <font color='#8735fb'> **Handle Missing** </font>

In [None]:
%%time
data = data.dropna()

### <font color='#8735fb'> **Split** </font>

In [None]:
label_column = airline_label_column

train, test = train_test_split(data, random_state=0) 

# build X [ features ], y [ labels ] for the train and test subsets
y_train = train[label_column]
X_train = train.drop(label_column, axis=1)
y_test = test[label_column]
X_test = test.drop(label_column, axis=1)

In [None]:
X_train

### <font color='#8735fb'> **Train/Fit** </font>

In [None]:
model_params = {            
    'max_depth': 10,
    'num_boost_round': 300,
    'learning_rate': .25,
    'gamma': 0,
    'lambda': 1,
    'random_state': 0,
    'verbosity': 0,
    'seed': 0,   
    'objective': 'binary:logistic',
    'tree_method': 'hist',
    'nthreads': os.cpu_count()
} 

In [None]:
%%time
dtrain = xgboost.DMatrix(X_train, y_train)
trained_model = xgboost.train(model_params, dtrain, 
                              num_boost_round=model_params['num_boost_round'])

### <font color='#8735fb'> **Predict & Score** </font>

### <font color='#8735fb'> **XGBoost Native Predict & Score** </font>

In [None]:
threshold = 0.5
dtest = xgboost.DMatrix(X_test)

In [None]:
%%time
predictions = trained_model.predict( dtest)
predictions = (predictions > threshold) * 1.0
score = accuracy_score (y_test.astype('float32'),
                        predictions.astype('float32'))

print(f'score = {score}')

In [None]:
model_filename = 'trained-model.xgb'
trained_model.save_model(model_filename)

### <font color='#8735fb'> **ForestInference Predict & Score** </font>

In [None]:
from cuml import ForestInference

In [None]:
reloaded_model = ForestInference.load(model_filename)

In [None]:
%%time 
fil_predictions = reloaded_model.predict(X_test)
fil_predictions = (fil_predictions > threshold) * 1.0
score = accuracy_score (y_test.astype('float32'),
                        fil_predictions.astype('float32'))

print(f'fil score = {score}')

### <font color='#8735fb'> **Additional References** </font>