# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ]. 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [1]:
# Write the code here and add cells as you need

# Import Library
import pandas as pd
import os
from sklearn.model_selection import train_test_split
import warnings, io
warnings.simplefilter('ignore')

import boto3
import sagemaker
from sagemaker.image_uris import retrieve
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [2]:
# Bucket Name

bucket_name = 'c94466a2114434l5161811t1w516189673394-labbucket-1qvf0d3q2y24g'

# Initialize a Boto3 S3 client
s3 = boto3.client('s3')

# Use the head_bucket method to check if the bucket exists
try:
    s3.head_bucket(Bucket=bucket_name)
    print(f"The bucket {bucket_name} exists and is accessible.")
except Exception as e:
    print(f"The bucket {bucket_name} does not exist or is not accessible. Error: {str(e)}")



The bucket c94466a2114434l5161811t1w516189673394-labbucket-1qvf0d3q2y24g exists and is accessible.


In [3]:
# Load datset
data_v1 = pd.read_csv("combined_csv_v1.csv")

data_v2 = pd.read_csv("combined_csv_v2.csv")

In [4]:
def split_data(data):
    train, test_validate = train_test_split(data, test_size=0.3, random_state=42)
    validate, test = train_test_split(test_validate, test_size=0.5, random_state=42)
    return train, validate, test

v1_train_data, v1_validate_data, v1_test_data = split_data(data_v1)
v2_train_data, v2_validate_data, v2_test_data = split_data(data_v2)

In [6]:

# Use linear learner estimator to build a classification model
container = retrieve('linear-learner', boto3.Session().region_name)

# Upload the data to S3
s3_resource = boto3.Session().resource('s3')

def upload_s3_csv(filename, folder, dataframe, data_prefix):
    csv_buffer = io.StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    s3_resource.Bucket(bucket_name).Object(os.path.join(data_prefix, folder, filename)).put(Body=csv_buffer.getvalue())
    

In [None]:
# working on the "combined_csv_v1.csv" on `data_v1`

# For data_v1
v1_data_prefix = 'data_v1'
v1_train_file = 'v1_train.csv'
v1_test_file = 'v1_test.csv'
v1_validate_file = 'v1_validate.csv'

upload_s3_csv(v1_train_file, 'train', v1_train_data,v1_data_prefix)
upload_s3_csv(v1_test_file, 'test', v1_test_data, v1_data_prefix)
upload_s3_csv(v1_validate_file, 'validate', v1_validate_data, v1_data_prefix)

In [7]:
# working on the "combined_csv_v2.csv" on `data_v2`

# For data_v2
v2_data_prefix = 'data_v2'
v2_train_file = 'v2_train.csv'
v2_test_file = 'v2_test.csv'
v2_validate_file = 'v2_validate.csv'

upload_s3_csv(v2_train_file, 'train', v2_train_data,v2_data_prefix)
upload_s3_csv(v2_test_file, 'test', v2_test_data, v2_data_prefix)
upload_s3_csv(v2_validate_file, 'validate', v2_validate_data, v2_data_prefix)

In [None]:
# Create SageMaker Estimator for data_v1

v1_estimator = sagemaker.estimator.Estimator(
    image_uri=container,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path="s3://{}/{}/output/data_v1".format(bucket_name, v1_data_prefix),
    sagemaker_session=sagemaker.Session()
)

# Set hyperparameters for data_v1
v1_estimator.set_hyperparameters(
    predictor_type="binary_classifier",
    mini_batch_size=64,
)

# Train the data_v1 model
train_channel_v1 = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/{}".format(bucket_name, v1_data_prefix, v1_train_file),
    content_type='text/csv'
)

validate_channel_v1 = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/{}".format(bucket_name, v1_data_prefix, v1_validate_file),
    content_type='text/csv'
)

data_channels_v1 = {'train': train_channel_v1, 'validation': validate_channel_v1}

v1_estimator.fit(inputs=data_channels_v1)


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: linear-learner-2023-11-02-17-38-47-326


2023-11-02 17:38:47 Starting - Starting the training job...
2023-11-02 17:39:14 Starting - Preparing the instances for training.........
2023-11-02 17:40:42 Downloading - Downloading input data......
2023-11-02 17:41:23 Training - Downloading the training image......
2023-11-02 17:42:28 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/02/2023 17:42:49 INFO 140434838796096] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', '

In [None]:
# Create SageMaker Estimator for data_v2 
v2_estimator = sagemaker.estimator.Estimator(
    image_uri=container,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    output_path="s3://{}/{}/output/data_v2".format(bucket_name, v2_data_prefix),
    sagemaker_session=sagemaker.Session()
)

# Set hyperparameters for data_v2
v2_estimator.set_hyperparameters(
    predictor_type="binary_classifier",
    mini_batch_size=64,
)

# Train the data_v2 model 
train_channel_v2 = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/train/{}".format(bucket_name, v2_data_prefix, v2_train_file),
    content_type='text/csv'
)

validate_channel_v2 = sagemaker.inputs.TrainingInput(
    "s3://{}/{}/validate/{}".format(bucket_name, v2_data_prefix, v2_validate_file),
    content_type='text/csv'
)

data_channels_v2 = {'train': train_channel_v2, 'validation': validate_channel_v2}

v2_estimator.fit(inputs=data_channels_v2)

In [None]:
# Host the model for data_v1
v1_predictor = v1_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge"
)

# Host the model for data_v2
v2_predictor = v2_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m4.xlarge"
)

# Batch transform for data_v1
v1_batch_input = "s3://{}/{}/test/{}".format(bucket_name, v1_data_prefix, v1_test_file)
v1_batch_output = "s3://{}/{}/output/batch/data_v1".format(bucket_name, v1_data_prefix)

v1_transformer = v1_estimator.transformer(
    instance_count=1,
    instance_type="ml.m4.xlarge",
    strategy="SingleRecord",
    output_path=v1_batch_output,
)

v1_transformer.transform(data=v1_batch_input, content_type='text/csv', split_type='Line')
v1_transformer.wait()

# Batch transform for data_v2
v2_batch_input = "s3://{}/{}/test/{}".format(bucket_name, v2_data_prefix, v2_test_file)
v2_batch_output = "s3://{}/{}/output/batch/data_v2".format(bucket_name, v2_data_prefix)

v2_transformer = v2_estimator.transformer(
    instance_count=1,
    instance_type="ml.m4.xlarge",
    strategy="SingleRecord",
    output_path=v2_batch_output,
)

v2_transformer.transform(data=v2_batch_input, content_type='text/csv', split_type='Line')
v2_transformer.wait()

In [None]:
# Download the batch transform results for data_v1
v1_batch_output_path = "s3://{}/{}/output/batch/data_v1".format(bucket_name, v1_data_prefix)
v1_batch_output_file = "data_v1_batch_results.csv"

s3.download_file(bucket_name, v1_batch_output_path + "/test/combined_csv_v1.csv.out", v1_batch_output_file)
v1_batch_results = pd.read_csv(v1_batch_output_file)

In [None]:
# Download the batch transform results for data_v2
v2_batch_output_path = "s3://{}/{}/output/batch/data_v2".format(bucket_name, v2_data_prefix)
v2_batch_output_file = "data_v2_batch_results.csv"

s3.download_file(bucket_name, v2_batch_output_path + "/test/combined_csv_v2.csv.out", v2_batch_output_file)
v2_batch_results = pd.read_csv(v2_batch_output_file)

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [4]:
# Using the same data_v1 from above step 2:

# Define a function to create and train an XGBoost model
def train_xgboost_model(data_prefix, train_file, validate_file, test_data):
    
    # Use XGBoost container
    container = retrieve('xgboost', boto3.Session().region_name, '1.0-1')
    
    # Define hyperparameters
    hyperparams = {
        "num_round": "100",
        "eval_metric": "auc",
        "objective": "binary:logistic"
    }
    
    
    # Set S3 output location
    s3_output_location = f"s3://{bucket_name}/{data_prefix}/output/"
    
    # Create a SageMaker Estimator
    xgb_model = sagemaker.estimator.Estimator(
        container,
        sagemaker.get_execution_role(),
        instance_count=1,
        instance_type='ml.m4.xlarge',
        output_path=s3_output_location,
        hyperparameters=hyperparams,
        sagemaker_session=sagemaker.Session()
    )
    
    # Define training and validation channels
    train_channel = sagemaker.inputs.TrainingInput(
        f"s3://{bucket_name}/{data_prefix}/train/{train_file}",
        content_type='text/csv'
    )

    validate_channel = sagemaker.inputs.TrainingInput(
        f"s3://{bucket_name}/{data_prefix}/validate/{validate_file}",
        content_type='text/csv'
    )
    
    data_channels = {'train': train_channel, 'validation': validate_channel}
    
    # Train the model
    xgb_model.fit(inputs=data_channels, logs=False)
    
    # Create batch transform jobs and evaluate performance
    batch_X = test_data.iloc[:, 1:]
    
    batch_X_file = 'batch-in.csv'
    
    upload_s3_csv(batch_X_file, f'batch-in/{data_prefix}', batch_X)
    
    batch_output = f"s3://{bucket_name}/{data_prefix}/batch-out/"
    batch_input = f"s3://{bucket_name}/{data_prefix}/batch-in/{data_prefix}/{batch_X_file}"

    xgb_transformer = xgb_model.transformer(
        instance_count=1,
        instance_type='ml.m4.xlarge',
        strategy='MultiRecord',
        assemble_with='Line',
        output_path=batch_output
    )
    
    xgb_transformer.transform(data=batch_input, data_type='S3Prefix', content_type='text/csv', split_type='Line')
    xgb_transformer.wait()
    
    s3 = boto3.client('s3')
    obj = s3.get_object(Bucket=bucket_name, Key=f"{data_prefix}/batch-out/{data_prefix}/{batch_X_file}.out")
    target_predicted = pd.read_csv(io.BytesIO(obj['Body'].read()), names=['target'])
    
    # Evaluate model performance
    actual = test_data['target'].values
    predicted = target_predicted['target'].values
    accuracy = accuracy_score(actual, predicted)
    confusion = confusion_matrix(actual, predicted)
    
    return accuracy, confusion

# Train and evaluate the XGBoost model for dataset 1
v1_accuracy, v1_confusion = train_xgboost_model(v1_data_prefix, v1_train_file, v1_validate_file, v1_test_data)

# Train and evaluate the XGBoost model for data_v2
v2_accuracy, v2_confusion = train_xgboost_model(v2_data_prefix, v2_train_file, v2_validate_file, v2_test_data)

# Print the results
print(f"Dataset 1 Accuracy: {v1_accuracy}")
print(f"Dataset 1 Confusion Matrix:\n{v1_confusion}")

# Print the results for data_v2
print(f"Dataset 2 Accuracy: {v2_accuracy}")
print(f"Dataset 2 Confusion Matrix:\n{v2_confusion}")

# Comment

The last comment is tough to work with because the lab sessions aren't consistent, and it takes a lot of time to process the modeling data. This makes it hard to do a thorough analysis and come to solid conclusions.