# DSTS Assignment 2
## On-Cloud Notebook 

### Alan Gaugler
### U885853
### November 3, 2023

# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab. - **Done**
2. Create a notebook instance and name it "oncloudproject". - **Done**
3. Increase the used memory to 25 GB from the additional configurations. - **Done**
4. Open Jupyter Lab and upload this notebook into it. - **Done**
5. Upload the two combined CSV files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project. - **See below**

The two CSV files were zipped, uploaded into the working directory and then unzipped. They are read in as CSV files as shown below.

In [None]:
# Import the required libraries
import warnings, requests, zipfile, io
warnings.simplefilter('ignore')
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.io import arff

import os
import boto3
import sagemaker
from sagemaker.image_uris import retrieve
from sklearn.model_selection import train_test_split

In [None]:
# Define a prefix.  This is effectively a directory in S3
prefix = 'sagemaker/flight-delay-prediction'

# Get the default bucket for SageMaker in the current region
bucket = sagemaker.Session().default_bucket()

print(f"Data will be uploaded to: s3://{bucket}/{prefix}")

**Load the files**

In [None]:
# Load the 2 CSV files
df1 = pd.read_csv('combined_csv_v1.csv')
df2 = pd.read_csv('combined_csv_v2.csv')

**Verify that the files are loaded**

In [None]:
# Check the header of df1
df1.head()

In [None]:
# Examine the target value counts
df1['target'].value_counts()

In [None]:
# Examine the target value counts as percentages
df1['target'].value_counts(1)

In [None]:
# Check the header of df1
df2.head()

In [None]:
# Examine the target value counts
df2['target'].value_counts()

In [None]:
# Examine the target value counts as percentages
df2['target'].value_counts(1)

The files have been successfully loaded.

### Create functions to calculate the model metrics

In [None]:
# Import the necessary libraries
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc, accuracy_score, precision_score, recall_score, f1_score

In [None]:
# Define the class labels
class_labels = ['No Delay', 'Delay']

In [None]:
# Display the confusion matrix
def plot_confusion_matrix(y_test, predicted_labels, class_labels):
    cm1 = confusion_matrix(y_test, predicted_labels)
    plt.figure(figsize=(3.5,3.5))
    sns.heatmap(cm1, annot=True, fmt='g', cbar=False,
    xticklabels=class_labels,
    yticklabels=class_labels)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()

In [None]:
# Function to display the ROC Curve
def plot_roc(y_test, predicted_labels):
    # Determine the false positive rate, true positive rate and thresholds
    fpr, tpr, thresholds = metrics.roc_curve(y_test, predicted_labels)
    
    # Calculate the area under the curve
    roc_auc = auc(fpr, tpr)
    
    plt.figure()
    # Plot the ROC
    plt.plot(fpr, tpr, color='orange', lw=2, label=f'ROC curve, AUC = {round(roc_auc,2)}')
    # Plot the line of no discrimination (45 degree angle)
    plt.plot([0, 1], [0, 1], color='blue', lw=2, linestyle='--')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic Curve')
    plt.legend(loc="lower right")
    plt.show()

In [None]:
# Function to plot the performance metrics
def plot_metrics(y_test, predicted_labels)    
    # Calcualte true/false positves/negatives
    tn, fp, fn, tp = confusion_matrix(y_test, predicted_labels).ravel()
    # Calculate the specificity
    specificity = tn / (tn + fp)
    
    # Print the evaluation metrics
    print('Evaluation Metrics')
    print('------------------')
    print('Accuracy: {:.5f}'.format(accuracy_score(test_labels, predicted_labels)))
    print('Precision: {:.5f}'.format(precision_score(test_labels, predicted_labels)))
    print('Recall (Sensitivity): {:.5f}'.format(recall_score(test_labels, predicted_labels)))
    print(f'Specificity: {specificity:.5f}')
    print('F1-score: {:.5f}'.format(f1_score(test_labels, predicted_labels)))

In [None]:
# This function converts the input values into a binary value
# according to the threshold. I have set it to 0.5 as default.
def binary_convert(x):
    threshold = 0.5
    if x > threshold:
        return 1
    else:
        return 0

In [None]:
import re

# This function will use regular expressions to obtain the probability score of the 
# predicted target value belonging to class 1 or 0.
def get_prob_score(value):
    match = re.search(r"score:([\d\.]+)\}", value)
    if match:
        return float(match.group(1))
    return np.nan

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

### <span style="color:darkblue">combined_csv_v1.csv</span>

#### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

**Check the target column position** 

The dataframe must have the target value in the first column.

Confirm if it is in the first column.

In [None]:
df1.head(2)

The target variable is indeed in the first position.

In [None]:
# Separate the train set. Stratifty the target variable for an even distribution.
train, test_and_validate = train_test_split(df1, test_size=0.3,
                            random_state=12, stratify=df1['target'])

# Split the further into the test and validation sets
test, validate = train_test_split(test_and_validate, test_size=0.5,
                              random_state=12, stratify=test_and_validate['target'])

In [None]:
# Examine the shape of the three datasets
print(train.shape)
print(test.shape)
print(validate.shape)

In [None]:
# Check the distribution of the classes
print(train['target'].value_counts())
print(test['target'].value_counts())
print(validate['target'].value_counts())
print()
# As percentages
print(train['target'].value_counts(1))
print(test['target'].value_counts(1))
print(validate['target'].value_counts(1))

There is an even distribution among the three sets.

#### 2. Use linear learner estimator to build a classifcation model.

In [None]:
# Import AWS Linear Learner for binary classification
from sagemaker import image_uris

In [None]:
# Set a prefix for the S3 bucket directories
prefix='lab3'

# Define the train test and validation file names
train_file='train.csv'
test_file='test.csv'
validate_file='validate.csv'

# Initialise a connection to the S3 bucket using boto3
s3_resource = boto3.Session().resource('s3')

In [None]:
# Create a function to upload a given dataframe as a CSV file to a S3 bucket
def upload_s3_csv(filename, folder, dataframe):
    # Crate a text stream in the memory
    csv_buffer = io.StringIO()
    # Save the df as a CSV file and store in the buffer
    dataframe.to_csv(csv_buffer, header=False, index=False )
    # Upload the content of the CSV buffer to the desired S3 location
    s3_resource.Bucket(bucket).Object(os.path.join(prefix, folder, filename)).put(Body=csv_buffer.getvalue())

In [None]:
# Upload the three datasets to S3
upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

In [None]:
# Retrieve the Linear Learner image
container = image_uris.retrieve('linear-learner', boto3.Session().region_name,'1.0-1')

# Set the hyperparameters for the LL model
hyperparams = {
    "feature_dim": train.shape[1] - 1, # Exclude the target column
    "predictor_type": "binary_classifier",
    "mini_batch_size": 1000} # A larger batch size of 1000 decreases training time

# Define the S3 location to save model outputs
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)

# Initialize the Linear Learner estimator using SageMaker's estimator API
ll_1=sagemaker.estimator.Estimator(container, # Container image defined above
                                       sagemaker.get_execution_role(),
                                       instance_count=1, # One training instance
                                       instance_type='ml.c5.9xlarge', # Compute optimized instance is set
                                       output_path=s3_output_location, # Output path defined above
                                        hyperparameters=hyperparams, # Hyperparams defined above
                                        sagemaker_session=sagemaker.Session())

#### 3. Host the model on another instance

The model is set up to be hosted on instance_type 'ml.c5.9xlarge' as defined in the estimator above and in the transformer object below.

In [None]:
# Set the training data location and the content type
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

# Set the validation data location and the content type
validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

# Create a dictionary to hold the training and validation data channels
data_channels = {'train': train_channel, 'validation': validate_channel}

In [None]:
# Train the model using the defined data channels. 
# logs are disabled as they produce a large output
ll_1.fit(inputs=data_channels, logs=False)

#### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Extract all columns except the first one from the test dataset except the target variable (1st column)
batch_X = test.iloc[:,1:];

# Set the filename for the batch input data to be uploaded to S3
batch_X_file='batch-in.csv'

# Upload the batch input data to S3 using the previously created function
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [None]:
# Set the S3 path to save the batch transform output
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)

# Set the S3 path for the batch input data
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

# Initialize the transformer object
ll_1_transformer = ll_1.transformer(instance_count=1,
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       strategy='MultiRecord',
                                       assemble_with='Line', # Line up the results
                                       output_path=batch_output)

In [None]:
# Execute the batch transform with the initialized transformer object
ll_1_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')

# Wait for the batch transform job to finish processing
ll_1_transformer.wait()

#### Evaluate the results

In [None]:
# Initialise the S3 client
s3 = boto3.client('s3')

# Obtain the output results of the batch transform job from S3
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))

# Read the stored results into a dataframe
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),names=['target'])

In [None]:
# Call the get probability score function to get the probability of the target variable
# ranging between 0 to 1.
target_predicted['target'] = target_predicted['target'].apply(get_prob_score)

# Print the header
print(target_predicted.head(5))

In [None]:
# Convert the 'target' column to a float before the next step
target_predicted['target'] = target_predicted['target'].astype(float)

# Convert the predicted target values into binary values using the binary_convert function
target_predicted_binary = target_predicted['target'].apply(binary_convert)

# Display a sample of the binary predictions
print(target_predicted_binary.head(5))

In [None]:
# Display the header of the test set
test.head(5)

#### 5. Report the performance metrics that you see. Test the model performance 

In [None]:
# Display a sample of the test labels
test_labels = test.iloc[:,0]
test_labels.head()

**Test Set**

**Confusion Matrix**

In [None]:
# Plot the confusion matrix on the test set
plot_confusion_matrix(test_labels, target_predicted_binary, class_labels)

**ROC Curve and Evaluation Metrics**

In [None]:
# Plot the ROC curve on the test set
plot_roc(test_labels, target_predicted_binary)

In [None]:
# Plot the performance metrics
plot_metrics(test_labels, target_predicted_binary)

**Classification Report**

In [None]:
# Print the classification report
print("Classification Report")
print("---------------------")
print(classification_report(test_labels, target_predicted_binary))

**Summary of Linear Learner Model 1 using Dataset combined_csv_v1.csv**

The linear learner models took a long time to run in my environment, however increasing the mini batch size from 200 to 1000 decreased processing time significantly. The results of the linear learner model on combined_csv_v1 on the test set is quite similar to the logistic regression model 1 that was trained on the same dataset. Observing the confusion matrix, very few 'Delay' classes were predicted. The model is heavily biased towards 'No Delay' which is the majority class. The extremely low recall of 0.13% and high specificity of 99.97% reflect the results from the confusion matrix. The F1-score is 0.25% and the ROC is 0.5% This is not a good model for predicting flight delays.

The classification report shows an overall accuracy of 79%, which of course is the percentage of values in the majority class. The recall is perfect for 'No Delay' (class 0) at 100% and terrible for 'Delay' (class 1) at 0%. These results will be compared to model 2. 

### <span style="color:darkblue">combined_csv_v2.csv</span>

#### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

**Check the target column position** 

The dataframe must have the target value in the first column.

Confirm if it is in the first column.

In [None]:
df2.head(2)

The target variable is indeed in the first position.

In [None]:
# Separate the train set. Stratifty the target variable for an even distribution.
train, test_and_validate = train_test_split(df2, test_size=0.3,
                            random_state=12, stratify=df1['target'])

# Split the further into the test and validation sets
test, validate = train_test_split(test_and_validate, test_size=0.5,
                              random_state=12, stratify=test_and_validate['target'])

In [None]:
# Examine the shape of the three datasets
print(train.shape)
print(test.shape)
print(validate.shape)

In [None]:
# Check the distribution of the classes
print(train['target'].value_counts())
print(test['target'].value_counts())
print(validate['target'].value_counts())
print()
# As percentages
print(train['target'].value_counts(1))
print(test['target'].value_counts(1))
print(validate['target'].value_counts(1))

There is an even distribution among the three sets.

#### 2. Use linear learner estimator to build a classifcation model.

In [None]:
# Set a prefix for the S3 bucket directories
prefix='lab3'

# Define the train test and validation file names
train_file='train.csv'
test_file='test.csv'
validate_file='validate.csv'

# Initialise a connection to the S3 bucket using boto3
s3_resource = boto3.Session().resource('s3')

In [None]:
# Upload the three datasets to S3 by calling the upload_s3_csv function
upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

In [None]:
# Retrieve the Linear Learner image
container = image_uris.retrieve('linear-learner', boto3.Session().region_name,'1.0-1')

# Set the hyperparameters for the LL model
hyperparams = {
    "feature_dim": train.shape[1] - 1, # Exclude the target column
    "predictor_type": "binary_classifier",
    "mini_batch_size": 1000} # A larger batch size of 1000 decreases training time

# Define the S3 location to save model outputs
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)

# Initialize the Linear Learner estimator using SageMaker's estimator API
ll_2=sagemaker.estimator.Estimator(container, # Container image defined above
                                       sagemaker.get_execution_role(),
                                       instance_count=1, # One training instance
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       output_path=s3_output_location, # Output path defined above
                                        hyperparameters=hyperparams, # Hyperparams defined above
                                        sagemaker_session=sagemaker.Session())

#### 3. Host the model on another instance

The model is set up to be hosted on instance_type 'ml.c5.9xlarge' as defined in the estimator above and in the transformer object below.

In [None]:
# Set the training data location and the content type
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

# Set the validation data location and the content type
validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

# Create a dictionary to hold the training and validation data channels
data_channels = {'train': train_channel, 'validation': validate_channel}

In [None]:
# Train the model using the defined data channels. 
# logs are disabled as they produce a large output
ll_2.fit(inputs=data_channels, logs=False)

#### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Extract all columns except the first one from the test dataset except the target variable (1st column)
batch_X = test.iloc[:,1:];

# Set the filename for the batch input data to be uploaded to S3
batch_X_file='batch-in.csv'

# Upload the batch input data to S3 using the previously created function
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [None]:
# Set the S3 path to save the batch transform output
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)

# Set the S3 path for the batch input data
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

# Initialize the transformer object
ll_2_transformer = ll_2.transformer(instance_count=1,
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       strategy='MultiRecord',
                                       assemble_with='Line', # Line up the results
                                       output_path=batch_output)

In [None]:
# Execute the batch transform with the initialized transformer object
ll_2_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')

# Wait for the batch transform job to finish processing
ll_2_transformer.wait()

#### Evaluate the results

In [None]:
# Initialise the S3 client
s3 = boto3.client('s3')

# Obtain the output results of the batch transform job from S3
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))

# Read the stored results into a dataframe
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),names=['target'])

In [None]:
# Call the get probability score function to get the probability of the target variable
# ranging between 0 to 1.
target_predicted['target'] = target_predicted['target'].apply(get_prob_score)

# Print the header
print(target_predicted.head(5))

In [None]:
# Convert the 'target' column to a float before the next step
target_predicted['target'] = target_predicted['target'].astype(float)

# Convert the predicted target values into binary values using the binary_convert function
target_predicted_binary = target_predicted['target'].apply(binary_convert)

# Display a sample of the binary predictions
print(target_predicted_binary.head(5))

In [None]:
# Display the header of the test set
test.head(5)

#### 5. Report the performance metrics that you see. Test the model performance 

In [None]:
# Display a sample of the test labels
test_labels = test.iloc[:,0]
test_labels.head()

**Test Set**

**Confusion Matrix**

In [None]:
# Plot the confusion matrix on the test set
plot_confusion_matrix(test_labels, target_predicted_binary, class_labels)

**ROC Curve and Evaluation Metrics**

In [None]:
# Plot the ROC curve on the test set
plot_roc(test_labels, target_predicted_binary)

In [None]:
# Plot the performance metrics
plot_metrics(test_labels, target_predicted_binary)

**Classification Report**

In [None]:
# Print the classification report
print("Classification Report")
print("---------------------")
print(classification_report(test_labels, target_predicted_binary))

**Summary of Linear Learner Model 2 using Dataset combined_csv_v2.csv**

The results of the linear learner Model 2 on combined_csv_v2 on the test set is quite similar to the logistic regression Model 2 that was trained on the same dataset, perhaps slightly better. Direct numbers are difficult to compare because the test set size is different to the on-premises model. Observing the confusion matrix, 3028 'Delay' classes were predicted. This is a big improvement from Model 1 with 65. The model is still biased towards 'No Delay' which is the majority class. The low recall of 5.88% and high specificity of 98.86% reflect the results from the confusion matrix. The F1-score is 10.67% and the ROC is 0.52 which are an improvement over Linear Learner Model 1. This is still not a good model for predicting flight delays.

Comparing these results to on-premesis Model 2, the AUC is equal at 0.52, but all other metrics have improved. This is the best model trained so far.

# Step 3: Build and evaluate ensemble models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

### <span style="color:darkblue">combined_csv_v1.csv</span>

#### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

**Check the target column position**  
The dataframe must have the target value in the first column.

Confirm if it is in the first position.

In [None]:
df1.head(2)

The target variable is indeed in the first position.

In [None]:
# Separate the train set. Stratifty the target variable for an even distribution.
train, test_and_validate = train_test_split(df1, test_size=0.3,
                            random_state=12, stratify=df1['target'])

# Split the further into the test and validation sets
test, validate = train_test_split(test_and_validate, test_size=0.5,
                              random_state=12, stratify=test_and_validate['target'])

In [None]:
# Examine the shape of the three datasets
print(train.shape)
print(test.shape)
print(validate.shape)

In [None]:
# Check the distribution of the classes
print(train['target'].value_counts())
print(test['target'].value_counts())
print(validate['target'].value_counts())
print()
# As percentages
print(train['target'].value_counts(1))
print(test['target'].value_counts(1))
print(validate['target'].value_counts(1))

There is an even distribution among the three sets.

#### 2. Use xgboost estimator to build a classifcation model.

In [None]:
# Set a prefix for the S3 bucket directories
prefix='lab3'

# Define the train test and validation file names
train_file='train.csv'
test_file='test.csv'
validate_file='validate.csv'

# Initialise a connection to the S3 bucket using boto3
s3_resource = boto3.Session().resource('s3')

In [None]:
# Upload the three datasets to S3
upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

In [None]:
# Retrieve the container image for XGBoost from SageMaker's repository
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

# Set the hyperparameters for the XGBoost model
hyperparams={"num_round":"42", # Set the number of boosting rounds
             "eval_metric": "auc", # Area under the curve is used in validation
             "objective": "binary:logistic"} # This is used for binary classification

# Define the S3 location to save model outputs
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)

# Initialize the XGBoost estimator using SageMaker's estimator API
xgb_1=sagemaker.estimator.Estimator(container, # Container image defined above
                                       sagemaker.get_execution_role(),
                                       instance_count=1, # One training instance
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       output_path=s3_output_location, # Output path defined above
                                        hyperparameters=hyperparams, # Hyperparams defined above
                                        sagemaker_session=sagemaker.Session())

#### 3. Host the model on another instance

The model is set up to be hosted on instance_type 'ml.c5.9xlarge' as defined in the estimator above and in the transformer object below.

In [None]:
# Set the training data location and the content type
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

# Set the validation data location and the content type
validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

# Create a dictionary to hold the training and validation data channels
data_channels = {'train': train_channel, 'validation': validate_channel}

In [None]:
# Train the model using the defined data channels. 
# logs are disabled as they produce a large output
xgb_1.fit(inputs=data_channels, logs=False)

#### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Extract all columns except the first one from the test dataset except the target variable (1st column)
batch_X = test.iloc[:,1:];

# Set the filename for the batch input data to be uploaded to S3
batch_X_file='batch-in.csv'

# Upload the batch input data to S3 using the previously created function
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [None]:
# Set the S3 path to save the batch transform output
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)

# Set the S3 path for the batch input data
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

# Initialize the transformer object
xgb_1_transformer = xgb_1.transformer(instance_count=1,
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       strategy='MultiRecord',
                                       assemble_with='Line', # Line up the results
                                       output_path=batch_output)

In [None]:
# Execute the batch transform with the initialized transformer object
xgb_1_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')

# Wait for the batch transform job to finish processing
xgb_1_transformer.wait()

#### Evaluate the results

In [None]:
# Initialise the S3 client
s3 = boto3.client('s3')

# Obtain the output results of the batch transform job from S3
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))

# Read the stored results into a dataframe
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),names=['target'])

In [None]:
# Convert the predicted target values into binary values using the binary_convert function
target_predicted_binary = target_predicted['target'].apply(binary_convert)

# Display a sample of the binary predictions
print(target_predicted_binary.head(5))

In [None]:
# Display the header of the test set
test.head(5)

#### 5. Report the performance metrics that you see better test the model performance  
The metrics for this model will be summarized in the conclusion.

In [None]:
# Display a sample of the test labels
test_labels = test.iloc[:,0]
test_labels.head()

**Test Set**

In [None]:
# Plot the confusion matrix on the test set
plot_confusion_matrix(test_labels, target_predicted_binary, class_labels)

In [None]:
test_labels.value_counts(1)

In [None]:
# Plot the ROC curve on the test set
plot_roc(test_labels, target_predicted_binary)

In [None]:
# Plot the performance metrics
plot_metrics(test_labels, target_predicted_binary)

In [None]:
# Print the classification report
print("Classification Report")
print("---------------------")
print(classification_report(test_labels, target_predicted_binary))

**Summary of XGBoost Model 1 using Dataset combined_csv_v1.csv**

The results of the confusion matrix, evaluation metrics, ROC curve and classification report for model XGB_1 will be compared with Model XGB_2 and discussed in more detail at the end of this notebook.

A brief summary shows that XGB_1 predicts slightly more accurately on dataset csv_v1 than the linear learner. The evaluation metrics were better on most metrics for XGB_1 including the AUC. In particular, the important metric of accurately predicting 'Delay' rose substantially from 65 to 859. The only exception is specificity, which is understandable when recall improves.

### <span style="color:darkblue">combined_csv_v2.csv</span>

#### 1. Split data into training, validation and testing sets (70% - 15% - 15%).

**Check the target column position**  
The dataframe must have the target value in the first column.  

Confirm if it is in the first position.

In [None]:
df2.head(2)

The target variable is indeed in the first position.

In [None]:
# Separate the train set
train, test_and_validate = train_test_split(df2, test_size=0.3,
                            random_state=12, stratify=df2['target'])

# Split the further into the test and validation sets
test, validate = train_test_split(test_and_validate, test_size=0.5,
                              random_state=12, stratify=test_and_validate['target'])

#### 2. Use xgboost estimator to build a classifcation model.

In [None]:
# Set a prefix for the S3 bucket directories
prefix='lab3'

# Define the train test and validation file names
train_file='train.csv'
test_file='test.csv'
validate_file='validate.csv'

# Initialise a connection to the S3 bucket using boto3
s3_resource = boto3.Session().resource('s3')

# Upload the three datasets to S3 by calling the previously defined function
upload_s3_csv(train_file, 'train', train)
upload_s3_csv(test_file, 'test', test)
upload_s3_csv(validate_file, 'validate', validate)

In [None]:
# Retrieve the container image for XGBoost from SageMaker's repository
container = retrieve('xgboost',boto3.Session().region_name,'1.0-1')

# Set the hyperparameters for the XGBoost model
hyperparams={"num_round":"42", # Set the number of boosting rounds
             "eval_metric": "auc", # Area under the curve is used in validation
             "objective": "binary:logistic"} # This is used for binary classification

# Define the S3 location to save model outputs
s3_output_location="s3://{}/{}/output/".format(bucket,prefix)

# Initialize the XGBoost estimator using SageMaker's estimator API
xgb_2=sagemaker.estimator.Estimator(container, # Container image defined above
                                       sagemaker.get_execution_role(),
                                       instance_count=1, # One training instance
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       output_path=s3_output_location, # Output path defined above
                                        hyperparameters=hyperparams, # Hyperparams defined above
                                        sagemaker_session=sagemaker.Session())

#### 3. Host the model on another instance

The model is set up to be hosted on instance_type 'ml.c5.9xlarge' as defined in the estimator above and in the transformer object below.

In [None]:
# Set the training data location and the content type
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/".format(bucket,prefix,train_file),
    content_type='text/csv')

# Set the validation data location and the content type
validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/".format(bucket,prefix,validate_file),
    content_type='text/csv')

# Create a dictionary to hold the training and validation data channels
data_channels = {'train': train_channel, 'validation': validate_channel}

In [None]:
# Train the model using the defined data channels 
# logs are disabled as they produce a large output
xgb_2.fit(inputs=data_channels, logs=False)

#### 4. Perform batch transform to evaluate the model on testing data

In [None]:
# Extract all columns except the first one from the test dataset except the target variable (1st column)
batch_X = test.iloc[:,1:];

# Set the filename for the batch input data to be uploaded to S3
batch_X_file='batch-in.csv'

# Upload the batch input data to S3 using the previously created function
upload_s3_csv(batch_X_file, 'batch-in', batch_X)

In [None]:
# Set the S3 path to save the batch transform output
batch_output = "s3://{}/{}/batch-out/".format(bucket,prefix)

# Set the S3 path for the batch input data
batch_input = "s3://{}/{}/batch-in/{}".format(bucket,prefix,batch_X_file)

# Initialize the transformer object
xgb_2_transformer = xgb_2.transformer(instance_count=1,
                                       instance_type='ml.c5.9xlarge', # Instance type is set
                                       strategy='MultiRecord',
                                       assemble_with='Line', # Line up the results
                                       output_path=batch_output)

In [None]:
# Execute the batch transform with the initialized transformer object
xgb_2_transformer.transform(data=batch_input,
                         data_type='S3Prefix',
                         content_type='text/csv',
                         split_type='Line')

# Wait for the batch transform job to finish processing
xgb_2_transformer.wait()

#### Evaluate the results

In [None]:
# Initialise the S3 client
s3 = boto3.client('s3')

# Obtain the output results of the batch transform job from S3
obj = s3.get_object(Bucket=bucket, Key="{}/batch-out/{}".format(prefix,'batch-in.csv.out'))

# Read the stored results into a dataframe
target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()),names=['target'])

In [None]:
# Convert the predicted target values into binary values using the binary_convert function
target_predicted_binary = target_predicted['target'].apply(binary_convert)

# Display a sample of the binary predictions
print(target_predicted_binary.head(5))

In [None]:
# Display the header of the test set
test.head(5)

#### 5. Report the performance metrics that you see better test the model performance  
The metrics for this model will be summarized in the conclusion.

In [None]:
# Display a sample of the test labels
test_labels = test.iloc[:,0]
test_labels.head()

**Test Set**

In [None]:
# Plot the confusion matrix on the test set
plot_confusion_matrix(test_labels, target_predicted_binary, class_labels)

In [None]:
test_labels.value_counts(1)

In [None]:
# Plot the ROC curve on the test set
plot_roc(test_labels, target_predicted_binary)

In [None]:
# Plot the performance metrics
plot_metrics(test_labels, target_predicted_binary)

In [None]:
# Print the classification report
print("Classification Report")
print("---------------------")
print(classification_report(test_labels, target_predicted_binary))

The results of the confusion matrix, evaluation metrics, ROC curve and classification report for model 2 will be compared with model 1 and discussed at the end of this notebook.

### <span style="color:darkblue">combined_csv_v2.csv</span>
### <span style="color:darkblue">Change the Binary Convert Threshold from 0.5 to 0.3</span>  

It will be observed how lowering the binary convert threshold from 0.5 to 0.3 affects the metrics. Comments are provided in the conclusion.

In [None]:
# This function converts the input values into a binary value
# according to the threshold. I have set it to 0.5 as default.
def binary_convert(x):
    threshold = 0.3
    if x > threshold:
        return 1
    else:
        return 0

In [None]:
# Convert the predicted target values into binary values using the binary_convert function
target_predicted_binary = target_predicted['target'].apply(binary_convert)

In [None]:
# Plot the confusion matrix on the test set
plot_confusion_matrix(test_labels, target_predicted_binary, class_labels)

In [None]:
# Plot the ROC curve on the test set
plot_roc(test_labels, target_predicted_binary)

In [None]:
# Plot the performance metrics
plot_metrics(test_labels, target_predicted_binary)

In [None]:
# Print the classification report
print("Classification Report")
print("---------------------")
print(classification_report(test_labels, target_predicted_binary))

**Summary of XGBoost Model 2 using Dataset combined_csv_v2.csv**

The results of the confusion matrix, evaluation metrics, ROC curve and classification report for model XGB_2 show it is the best performing model evaluated in this project (not including the random forest which is not part of the evaluation).

Its results are discussed in more detail below in the final comments.

#### 6. write down your observation on the difference between the performance of using the simple and ensemble models.

**Final Comments**

There are some notable differences between the linear and the ensemble methods. The linear model took fmore time to process than the XGBoost model, even though the pipeline is the same set up as the XGBoost model, apart from defining the model. I increased the mini batch size from 200 to 1000 and this improved the processing speed considerably withoout any loss in accuracy.

The linear model had similar performance to the on-premises logistic regression model. In the confusion matrix, very few ‘Delay’ classes were predicted (neither correctly nor incorrectly). The overall accuracy was very close to 79% or the percentage of ‘No Delay’ values in the target variable, the recall was very low and the AUC in the ROC curve was very close to 0.5, which is not a good result.

The ensemble XGBoost model performed considerably better than the linear learner on both datasets. The logical setting for the binary convert threshold is 0.5, which is what I set it too as default, meaning that if the output probability is less than 0.5 the predicted value is set to 0 or ‘no delay’. If it is greater than 0.5 then it is predicted as a 1 or a ‘delay’.

At a setting of 0.5, the model using combined_csv_v2.csv dataset performed better than for dataset v1. As was explained in the on-premises notebook, the extra features incorporated in v2, i.e., the holidays and the weather data, in particular heavy snow, rain or winds combined well to improve the accuracy of the second model. This was reflected in the metrics of the two models.

Model XGB_1 only predicted 859 Delays correctly, whereas XGB_2 predicted 6162, a significant improvement which is reflected in the improved recall, up from 1.67% to 11.97%. The overall accuracy and precision also improved slightly in the v2 dataset, as a result the specificity decline slightly from 99.74 to 98.41, but the overall measure, the F1-score increased significantly from 3.25% to 20.29%, a significant improvement. The AUC also had a good increase from 0.51 to 0.55. In spite of the improved metrics, I would consider this model to still be quite poor. In the classification report, The recall of the majority class is good at 98% but it is still poor for the monority class 'Delay' at 12%.

Many improvements could be tried, given more time. I have mentioned those in the conclusion to the on-premises notebook, so I won’t repeat them here in detail. One simple option to improve the performance of both the linear learner and the XGBoost model is hyperparameter tuning in a grid-search. This would however increase processing time significantly. A session is limited to only 2 hours which is not enough time for a proper grid-search.

As mentioned, another option in the on-premises solution is the binary convert threshold. As mentioned above, I set this to 0.5 which is the logical option. Looking at the last section of code, I change this to a setting of 0.3. This effectively predicts anything with a probability of greater than 0.3 as a 1 or a ‘delay’ class. This is manipulating the output data to increase the number of predictions of the minority class. Looking at the confusion matrices for a change in threshold from 0.5 to 0.3, the number of correctly predicted delays has increased starkly from 6162 to 20871, although this has come at a cost of fewer ‘no delays’ being correctly predicted, (190764 down to 169036). The overall accuracy has fallen from 80.27% to 77.41%, as has the precision (67.72% to 45.70%) and the specificity (98.41% to 87.21%), however the recall has risen sharply (11.97% to 40.53%) and the F1-score too (20.29% to 42.96%). The AUC in the ROC has also improved significantly from 0.55 to 0.64.

Manipulating this threshold is a trade-off of Recall vs precision and overall accuracy. It depends on how critical predicting the minority class is. In this case flight delays of 15 minutes or more are not very critical in my opinion, compared to say detecting cancer in a patient. More investigation would have to be done in setting this threshold, determining its importance in the business model and deciding on the optimal setting.

To conclude, the XGBoost is better than the linear learner (on-cloud) and the logistic regression models used in the on-premises notebook. It is better in all metrics over the other two models. The enhanced dataset of v2 with the added weather and holiday information also significantly improves the model’s accuracy in predicted flight delays.
