# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
df_2 = pd.read_csv("data/combined_csv_v2.csv")
df_2.head(5)

Unnamed: 0,target,Distance,DepHourofDay,AWND_O,PRCP_O,TAVG_O,AWND_D,PRCP_D,TAVG_D,SNOW_O,...,Origin_PHX,Origin_SFO,Dest_CLT,Dest_DEN,Dest_DFW,Dest_IAH,Dest_LAX,Dest_ORD,Dest_PHX,Dest_SFO
0,0.0,1464.0,7,57,0,281.0,25,0,209.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,1464.0,7,72,81,284.0,11,0,226.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1464.0,7,49,0,219.0,24,0,244.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,1464.0,7,29,0,182.0,30,0,247.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,1464.0,7,52,0,214.0,41,0,220.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [4]:
print("\nInfo for combined_csv_v2:")
print(df_2.info())


Info for combined_csv_v2:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1635590 entries, 0 to 1635589
Data columns (total 85 columns):
 #   Column                Non-Null Count    Dtype  
---  ------                --------------    -----  
 0   target                1635590 non-null  float64
 1   Distance              1635590 non-null  float64
 2   DepHourofDay          1635590 non-null  int64  
 3   AWND_O                1635590 non-null  int64  
 4   PRCP_O                1635590 non-null  int64  
 5   TAVG_O                1635590 non-null  float64
 6   AWND_D                1635590 non-null  int64  
 7   PRCP_D                1635590 non-null  int64  
 8   TAVG_D                1635590 non-null  float64
 9   SNOW_O                1635590 non-null  float64
 10  SNOW_D                1635590 non-null  float64
 11  Year_2015             1635590 non-null  float64
 12  Year_2016             1635590 non-null  float64
 13  Year_2017             1635590 non-null  float64
 14  Year_20

In [5]:
# Check for missing values
print("\nMissing values in combined_csv_v2:")
print(df_2.isnull().sum())


Missing values in combined_csv_v2:
target          0
Distance        0
DepHourofDay    0
AWND_O          0
PRCP_O          0
               ..
Dest_IAH        0
Dest_LAX        0
Dest_ORD        0
Dest_PHX        0
Dest_SFO        0
Length: 85, dtype: int64


## File combined_csv_v2

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [7]:
X_2=df_2.drop('target', axis=1)
y_2=df_2['target']

In [8]:
# Split data into training (70%), validation (15%), and testing (15%)
X_train_2, X_temp_2, y_train_2, y_temp_2 = train_test_split(X_2, y_2, test_size=0.3, random_state=42)
X_valid_2, X_test_2, y_valid_2, y_test_2 = train_test_split(X_temp_2, y_temp_2, test_size=0.5, random_state=42)

# Convert to NumPy arrays
X_train_2 = X_train_2.to_numpy()
y_train_2 = y_train_2.to_numpy()
X_valid_2 = X_valid_2.to_numpy()
y_valid_2 = y_valid_2.to_numpy()
X_test_2 = X_test_2.to_numpy()
y_test_2 = y_test_2.to_numpy()

In [9]:
# Step 4: Prepare Data for SageMaker
os.makedirs('data/train', exist_ok=True)
os.makedirs('data/validation', exist_ok=True)
os.makedirs('data/test', exist_ok=True)
os.makedirs('data/batch', exist_ok=True)

In [10]:
# Save the data as CSV files
train_data_2 = pd.DataFrame(np.column_stack((y_train_2, X_train_2)))
train_data_2.to_csv('data/train/train_2.csv', header=False, index=False)

valid_data_2 = pd.DataFrame(np.column_stack((y_valid_2, X_valid_2)))
valid_data_2.to_csv('data/validation/validation_2.csv', header=False, index=False)

test_data_2 = pd.DataFrame(np.column_stack((y_test_2, X_test_2)))
test_data_2.to_csv('data/test/test_2.csv', header=False, index=False)

In [11]:
import boto3
import sagemaker
from sagemaker import image_uris
from sagemaker import Session
from sagemaker.estimator import Estimator

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [12]:
# Set up SageMaker
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'sagemaker/linear-v2'
region = boto3.Session().region_name

# Get the Linear Learner container image
linear_learner_container = image_uris.retrieve("linear-learner", region)

In [13]:
# Upload the training and validation data to S3
s3 = boto3.client('s3')

train_path_2 = sess.upload_data(path='data/train/train_2.csv', key_prefix=f'{prefix}/input/training')
valid_path_2 = sess.upload_data(path='data/validation/validation_2.csv', key_prefix=f'{prefix}/input/validation')
test_path_2 = sess.upload_data(path='data/test/test_2.csv', key_prefix=f'{prefix}/input/test')

In [14]:
# Print S3 paths
print(f'Training data uploaded to: {train_path_2}')
print(f'Validation data uploaded to: {valid_path_2}')
print(f'Test data uploaded to: {test_path_2}')

Training data uploaded to: s3://sagemaker-us-east-1-807330624080/sagemaker/linear-v2/input/training/train_2.csv
Validation data uploaded to: s3://sagemaker-us-east-1-807330624080/sagemaker/linear-v2/input/validation/validation_2.csv
Test data uploaded to: s3://sagemaker-us-east-1-807330624080/sagemaker/linear-v2/input/test/test_2.csv


In [15]:
# Define the Linear Learner Estimator
linear_estimator = Estimator(
    linear_learner_container,
    role=sagemaker.get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path='s3://{}/{}/output'.format(bucket,prefix))

# Setting mini_batch_size to 100 since the dataset is large
linear_estimator.set_hyperparameters(predictor_type='binary_classifier',mini_batch_size=100, epochs=10)


In [16]:
# Define the data channels with the correct content type
training_data_channel = sagemaker.TrainingInput(s3_data=train_path_2, content_type='text/csv')
validation_data_channel = sagemaker.TrainingInput(s3_data=valid_path_2, content_type='text/csv')


# Training Phase
### Fitting the model

In [17]:
# Fit the model with the correct training and validation data channels
linear_estimator.fit({'train': training_data_channel, 'validation': validation_data_channel})

INFO:sagemaker:Creating training-job with name: linear-learner-2024-10-30-23-32-38-721


2024-10-30 23:32:40 Starting - Starting the training job...
2024-10-30 23:32:55 Starting - Preparing the instances for training...
2024-10-30 23:33:17 Downloading - Downloading input data...
2024-10-30 23:33:47 Downloading - Downloading the training image......
2024-10-30 23:35:03 Training - Training image download completed. Training in progress..[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[10/30/2024 23:35:12 INFO 140661206583104] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', 'optimi

# Batch Transform Deployment

In [18]:
transformer = linear_estimator.transformer(instance_count=1, 
                                           instance_type="ml.m4.xlarge", 
                                           assemble_with="Line", 
                                           output_path=f"s3://{bucket}/{prefix}/batch_output")

INFO:sagemaker:Creating model with name: linear-learner-2024-10-31-00-33-46-371


In [19]:
test_df_2 = pd.read_csv("data/test/test_2.csv")
test_2_batch = test_df_2[test_df_2.columns[1:]]
test_2_batch.to_csv("data/batch/batch_input_2.csv", index = False, header = False)

In [20]:
batch_test_path = sess.upload_data(path="data/batch/batch_input_2.csv", key_prefix=prefix + "/batch_input")

In [21]:
transformer.transform(batch_test_path, content_type = "text/csv", split_type="Line")
transformer.wait()

INFO:sagemaker:Creating transform job with name: linear-learner-2024-10-31-00-33-57-006


..................................................[34mDocker entrypoint called with argument(s): serve[0m
[35mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[10/31/2024 00:42:15 INFO 140274911864640] Memory profiler is not enabled by the environment variable ENABLE_PROFILER.[0m
[35mRunning default environment configuration script[0m
[35m[10/31/2024 00:42:15 INFO 140274911864640] Memory profiler is not enabled by the environment variable ENABLE_PROFILER.[0m
  if num_device is 1 and 'dist' not in kvstore:[0m
  if num_device is 1 and 'dist' not in kvstore:[0m
  if cons['type'] is 'ineq':[0m
  if len(self.X_min) is not 0:[0m
  if cons['type'] is 'ineq':[0m
  if len(self.X_min) is not 0:[0m
[34m[10/31/2024 00:42:19 INFO 140274911864640] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[10/31/2024 00:42:19 INFO 140274911864640] loading entry points[0m
[34m[10/31/2024 00:42:19 INFO 

## Testing performance

In [26]:
import io

y_file = boto3.client("s3").get_object(Bucket = bucket, Key = f"{prefix}/batch_output/batch_input_2.csv.out")
y_pred = pd.read_csv(io.BytesIO(y_file["Body"].read()), header = None, names = ["Predicted"])

In [27]:
y_pred["target"] = y_pred.index
y_pred

Unnamed: 0,Predicted,target
"{""predicted_label"":0",score:0.126093283295631},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.151854246854782},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.066909521818161},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.122791454195976},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.112313963472843},"{""predicted_label"":0"
...,...,...
"{""predicted_label"":0",score:0.159813627600669},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.490087538957595},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.324230462312698},"{""predicted_label"":0"
"{""predicted_label"":0",score:0.278674840927124},"{""predicted_label"":0"


In [28]:
y_pred = y_pred['target'].apply(lambda x: 1 if x == 1 else 0)

y_true = test_df_2.iloc[:, 0]
accuracy = accuracy_score(y_true, y_pred)

In [29]:
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.7908


Even though the accuracy of both models is the same at 0.7908, the precision and recall of model 2 are higher than model 1. By adding two key features such as holiday dates and weather data, enhance the model performance and increase the recall value. This highlights the relevant feature engineering in improving model performance.

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [19]:
import pandas as pd

df_2 = pd.read_csv('data/combined_csv_v2.csv')

In [3]:
import boto3
import sagemaker
from sagemaker import image_uris
from sagemaker import Session
from sagemaker.estimator import Estimator

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [35]:
sess = sagemaker.Session()
bucket = sess.default_bucket()

region = boto3.Session().region_name
container = image_uris.retrieve('xgboost', region,version='latest')

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


In [30]:
#from sklearn.model_selection import train_test_split
train, validation = train_test_split(df_2, test_size=0.3)
validation, test = train_test_split(validation, test_size = 0.5)

In [31]:
train.to_csv("data/train/train.csv", index = False, header = False)
validation.to_csv("data/validation/validation.csv", index = False, header = False)

In [32]:
test_X = test.drop(test.columns[0], axis=1)
test_y = test[test.columns[0]]
test_X.to_csv("data/test/test.csv", index = False, header = False)

In [33]:
prefix = 'sagemaker/xgboost-v2'
train_path = sess.upload_data(path="data/train/train.csv", key_prefix=prefix + "input/training")
valid_path = sess.upload_data(path="data/validation/validation.csv", key_prefix=prefix + "input/validation")
test_X_path = sess.upload_data(path="data/test/test.csv", key_prefix=prefix + "input/test")

In [36]:
xgboost_estimator = Estimator(container, 
                              role=sagemaker.get_execution_role(), 
                              instance_count=1, 
                              instance_type='ml.m4.xlarge',
                            output_path='s3://{}/{}/output'.format(bucket,prefix))

In [37]:
xgboost_estimator.set_hyperparameters(objective='binary:logistic', num_round=20)

In [38]:
training_data_channel = sagemaker.TrainingInput(s3_data=train_path, content_type='text/csv')
validation_data_channel = sagemaker.TrainingInput(s3_data=valid_path, content_type='text/csv')

In [39]:
xgboost_estimator.fit({'train': training_data_channel,'validation': validation_data_channel})

INFO:sagemaker:Creating training-job with name: xgboost-2024-10-31-00-48-10-400


2024-10-31 00:48:11 Starting - Starting the training job...
2024-10-31 00:48:27 Starting - Preparing the instances for training...
2024-10-31 00:48:54 Downloading - Downloading input data...
2024-10-31 00:49:39 Downloading - Downloading the training image......
2024-10-31 00:50:20 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[2024-10-31:00:50:31:INFO] Running standalone xgboost training.[0m
[34m[2024-10-31:00:50:31:INFO] File size need to be processed in the node: 449.96mb. Available memory size in the node: 8450.46mb[0m
[34m[2024-10-31:00:50:31:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:50:31] S3DistributionType set as FullyReplicated[0m
[34m[00:50:34] 1144913x84 matrix with 96172692 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2024-10-31:00:50:34:INFO] Determined delimiter of CSV input is ','[0m
[34m[00:50:34] S3DistributionType set as FullyReplicated[0m
[34

In [40]:
xgboost_transformer = xgboost_estimator.transformer(
    instance_count=1,
    instance_type="ml.m4.xlarge",
    strategy="MultiRecord",
    assemble_with="Line",
    output_path="s3://{}/{}/output".format(bucket, prefix),
)

INFO:sagemaker:Creating model with name: xgboost-2024-10-31-00-51-58-078


In [41]:
xgboost_transformer.transform(test_X_path, content_type="text/csv", split_type="Line")
xgboost_transformer.wait()

INFO:sagemaker:Creating transform job with name: xgboost-2024-10-31-00-51-58-668


........................................[34mArguments: serve[0m
[35mArguments: serve[0m
[34m[2024-10-31 00:58:41 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[34m[2024-10-31 00:58:41 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2024-10-31 00:58:41 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2024-10-31 00:58:41 +0000] [21] [INFO] Booting worker with pid: 21[0m
[34m[2024-10-31 00:58:41 +0000] [22] [INFO] Booting worker with pid: 22[0m
[34m[2024-10-31 00:58:41 +0000] [23] [INFO] Booting worker with pid: 23[0m
[34m[2024-10-31 00:58:41 +0000] [24] [INFO] Booting worker with pid: 24[0m
  monkey.patch_all(subprocess=True)[0m
[34m[2024-10-31:00:58:41:INFO] Model loaded successfully for worker : 21[0m
  monkey.patch_all(subprocess=True)[0m
[35m[2024-10-31 00:58:41 +0000] [1] [INFO] Starting gunicorn 19.9.0[0m
[35m[2024-10-31 00:58:41 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[35m[2024-10-31 00:58:41 +0000] [1] [INFO] Using worke

In [42]:
s3 = boto3.client("s3")
response = s3.list_objects_v2(Bucket=bucket, Prefix=f"{prefix}/output/")
for obj in response.get('Contents', []):
    print(obj['Key'])

sagemaker/xgboost-v1/output/test.csv.out
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/debug-output/training_job_end.ts
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/output/model.tar.gz
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/profiler-output/framework/training_job_end.ts
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/profiler-output/system/incremental/2024103100/1730335680.algo-1.json
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/profiler-output/system/incremental/2024103100/1730335740.algo-1.json
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/profiler-output/system/incremental/2024103100/1730335800.algo-1.json
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/profiler-output/system/incremental/2024103100/1730335860.algo-1.json
sagemaker/xgboost-v1/output/xgboost-2024-10-31-00-48-10-400/profiler-output/system/training_job_end.ts


In [44]:
import io
y_file = boto3.client("s3").get_object(Bucket = bucket, Key = 'sagemaker/xgboost-v1/output/test.csv.out')
y_pred = pd.read_csv(io.BytesIO(y_file["Body"].read()), header = None, names = ["Predicted"])

In [45]:
y_pred["actual"] = y_pred["Predicted"].apply(lambda x : 1 if x > 0.5 else 0)
y_pred

Unnamed: 0,Predicted,actual
0,0.144983,0
1,0.411593,0
2,0.685295,1
3,0.227151,0
4,0.141333,0
...,...,...
245334,0.439213,0
245335,0.195322,0
245336,0.217292,0
245337,0.117663,0


In [46]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_y, y_pred["actual"])
print("Accuracy:", accuracy)

Accuracy: 0.8006594956366497


The results reveal notable differences between the simple linear model and the XGBoost model. The linear model achieved an accuracy of 0.7908, while XGBoost performed better with an accuracy of approximately 0.8007. This improvement in accuracy highlights the ensemble model's capability to capture more complex patterns in the data due to its ability to consider interactions between features more effectively. The addition of features such as holiday dates and weather data further enhances XGBoost's performance, allowing it to leverage this additional context to improve predictive accuracy. Overall, the ensemble model's accuracy emphasises the value of exploring advanced modelling techniques, especially when used with relevant features.