# <font color='#8735fb'> **Dask Multi-CPU Workflow - XGBoost @ Airline Delays** </font> 

<img src='https://raw.githubusercontent.com/rapidsai/cloud-ml-examples/main/aws/img/airline_dataset.png' width='1250px'>

> **1. Mount S3 Dataset**

> **2. Data Ingestion**

> **3. ETL**
-> handle missing -> split

> **4. Train Classifier**
-> XGBoost

> **5. Inference**
-> FIL

In [None]:
import os
from dask.distributed import LocalCluster
from dask.distributed import wait, Client
import dask

from dask_ml.model_selection import train_test_split
from cuml.dask.common.utils import persist_across_workers

import xgboost
from sklearn.metrics import accuracy_score

try:
    from cuml import ForestInference
except Exception as error:
    print(error)
    
import glob

### <font color='#8735fb'> **Mount S3 Dataset** </font>

In [None]:
!wget https://sagemaker-rapids-hpo-us-west-2.s3-us-west-2.amazonaws.com/2_year_2020.tar.gz
!tar xvzf 2_year_2020.tar.gz

### <font color='#8735fb'> **Create Cluster** </font>

In [None]:
cluster = LocalCluster(n_workers=os.cpu_count())
client = Client(cluster)
client

### <font color='#8735fb'> **Ingest Parquet Data** </font>

At the heart of our analysis will be domestic carrier on-time reporting data that has been kept for decades by the U.S. Bureau of Transportation.

This rich source of data allows us to scale, so while in this notebook (ML_100.ipynb) we only use 1 GPU and 1 year of data, in the next notebook (ML200.ipynb) we'll use 10 years of data and multiple GPUs.

> **Dataset**: [US.DoT - Reporting Carrier On-Time Performance, 1987-Present](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236)

The public dataset contains logs/features about flights in the United States (17 airlines) including:

* locations and distance  ( `Origin`, `Dest`, `Distance` )
* airline / carrier ( `Reporting_Airline` )
* scheduled departure and arrival times ( `CRSDepTime` and `CRSArrTime` )
* actual departure and arrival times ( `DpTime` and `ArrTime` )
* difference between scheduled & actual times ( `ArrDelay` and `DepDelay` )
* binary encoded version of late, aka our target variable ( `ArrDelay15` )

In [None]:
airline_feature_columns = ['Year', 'Quarter', 'Month', 'DayOfWeek', 
                           'Flight_Number_Reporting_Airline', 'DOT_ID_Reporting_Airline',
                           'OriginCityMarketID', 'DestCityMarketID',
                           'DepTime', 'DepDelay', 'DepDel15', 'ArrDel15',
                           'AirTime', 'Distance']
airline_label_column = 'ArrDel15'

In [None]:
data_target = "2_year_2020/*.parquet"

In [None]:
glob.glob(data_target)

In [None]:
%%time
data = dask.dataframe.read_parquet(data_target,  
                                   columns=airline_feature_columns)

In [None]:
# dask lesson : data is a future/lazy dataframe
data

In [None]:
# dask lesson : data is also a compute graph
data.visualize()

### <font color='#8735fb'> **Handle Missing** [ ETL ] </font>

In [None]:
%%time
data = data.dropna()

In [None]:
# dask lesson : note that now our compute graph has been extended
data.visualize()

In [None]:
# dask lesson: by adding operations we increase the complexity of the graph
data.sum().visualize()

In [None]:
# dask lesson: graphs can become really intricate but we'll let dask worry about that
data.mean().visualize()

### <font color='#8735fb'> **Split** </font>

In [None]:
label_column = airline_label_column

train, test = train_test_split(data, random_state=0, shuffle=True)

# build X [ features ], y [ labels ] for the train and test subsets
y_train = train[label_column]
X_train = train.drop(label_column, axis=1)

y_test = test[label_column]
X_test = test.drop(label_column, axis=1)

In [None]:
# dask lesson: let's check in on our compute graph so far [ nothing has been evaluated yet ]
X_train.visualize()

### <font color='#8735fb'> **Persist** </font>

In [None]:
# force execution
X_train, y_train, X_test, y_test = client.persist([X_train, y_train, X_test, y_test])
wait([X_train, y_train, X_test, y_test]);

In [None]:
# dask lesson: once we trigger computation via persist, the graph collapses to fully realized results
X_train.visualize()

### <font color='#8735fb'> **Train/Fit** </font>

In [None]:
model_params = {            
    'max_depth': 10,
    'num_boost_round': 300,
    'learning_rate': .25,
    'gamma': 0,
    'lambda': 1,
    'random_state': 0,
    'verbosity': 0,
    'seed': 0,   
    'objective': 'binary:logistic',
    'tree_method': 'hist',
    'nthreads': os.cpu_count()
}

In [None]:
%%time
dtrain = xgboost.dask.DaskDMatrix(client, X_train, y_train)
xgboost_output = xgboost.dask.train(client, model_params, dtrain, 
                                    num_boost_round = model_params['num_boost_round'])
trained_model = xgboost_output['booster']

### <font color='#8735fb'> **Predict & Score** </font>

In [None]:
%%time
# ensure that the inference target data (X_test, y_test) is computed [i.e., concrete values in local memory ]
y_test_computed = y_test.compute()
X_test_computed = X_test.compute()

### <font color='#8735fb'> **XGBoost Native Predict & Score** </font>

In [None]:
threshold = 0.5
dtest = xgboost.dask.DaskDMatrix(client, X_test)

In [None]:
%%time
predictions = xgboost.dask.predict(client, trained_model, dtest).compute()
predictions = (predictions > threshold ) * 1.0
score = accuracy_score(y_test_computed.astype('float32'),
                       predictions.astype('float32'))

print(f'score = {score}')

In [None]:
model_filename = 'trained-model.xgb'
trained_model.save_model(model_filename)

### <font color='#8735fb'> **ForestInference Predict & Score** </font>

In [None]:
reloaded_model = ForestInference.load(model_filename)

In [None]:
%%time 
fil_predictions = reloaded_model.predict( X_test_computed)
fil_predictions = (fil_predictions > threshold ) * 1.0
score = accuracy_score(y_test_computed.astype('float32'),
                       fil_predictions.astype('float32'))
print(f'fil score = {score}')

### <font color='#8735fb'> **Additional References** </font>