# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

## SageMaker Model Training and Evaluation Pipeline  
Set up, train, and evaluate a machine learning model using SageMaker, with essential libraries for data processing and metrics evaluation.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
from sagemaker.transformer import Transformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


# Setting Up the SageMaker Environment and Preparing Data  
Initialize the SageMaker session and prepare your dataset for training and evaluation.

In [3]:
# Set up SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "Your-SageMaker-Execution-Role-ARN"  # Replace with your SageMaker role ARN

# Step 1: Load and Split the Data
data = pd.read_csv("combined_csv_v2.csv")  # Load the dataset

  data = pd.read_csv("combined_csv_v2.csv")  # Load the dataset


# Inspecting Dataset Columns  
View the column names in the dataset for data preparation and analysis.

In [5]:
print(data.columns)


Index(['target', 'CRSDepTime', 'Cancelled', 'Diverted', 'Distance',
       'DistanceGroup', 'ArrDelay', 'ArrDelayMinutes', 'target.1', 'AirTime',
       ...
       'OriginState_IL', 'OriginState_NC', 'OriginState_TX', 'DestState_CA',
       'DestState_CO', 'DestState_GA', 'DestState_IL', 'DestState_NC',
       'DestState_TX', 'isHoliday_True'],
      dtype='object', length=113)


# Feature Engineering and Data Splitting  
Define feature and target columns, split data into training, validation, and test sets, and save the sets for model training.

In [6]:
# Define features and target using the correct column name
X = data.drop(columns=['target'])  # Assuming 'target' is the correct target column
y = data['target']

# Split into training (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Save training and testing datasets
train_data = pd.concat([y_train, X_train], axis=1)
train_data.to_csv("train.csv", index=False, header=False)
test_data = pd.concat([y_test, X_test], axis=1)
test_data.to_csv("test.csv", index=False, header=False)


  train_data = pd.concat([y_train, X_train], axis=1)
  test_data = pd.concat([y_test, X_test], axis=1)


# Upload Training Data to S3  
Store your prepared training dataset in Amazon S3 for easy access during model training in SageMaker.

In [9]:
# Use SageMaker's default bucket
s3_bucket_name = sagemaker_session.default_bucket()

# Upload training data to the default S3 bucket
s3_input_train = sagemaker_session.upload_data("train.csv", bucket=s3_bucket_name, key_prefix="xgboost/train")


# Define and Train an XGBoost Model on SageMaker  
Configure an XGBoost estimator with hyperparameters and initiate model training using your prepared data in Amazon S3.

In [12]:
import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()

# Automatically retrieve the SageMaker execution role
role = sagemaker.get_execution_role()

# Use the default S3 bucket for SageMaker
s3_bucket_name = sagemaker_session.default_bucket()
output_path = f"s3://{s3_bucket_name}/xgboost/output"

# Step 2: Define and Train XGBoost Model
xgb_estimator = Estimator(
    image_uri=sagemaker.image_uris.retrieve("xgboost", sagemaker_session.boto_region_name, "1.2-1"),
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=output_path,
    sagemaker_session=sagemaker_session
)

# Set hyperparameters
xgb_estimator.set_hyperparameters(
    objective="binary:logistic",
    num_round=100,
    max_depth=5,
    eta=0.2,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="auc"
)

# Train the model using the training data in S3
xgb_estimator.fit({"train": TrainingInput(s3_input_train, content_type="csv")})


INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-11-01-09-34-02-262


2024-11-01 09:34:07 Starting - Starting the training job...
2024-11-01 09:34:21 Starting - Preparing the instances for training...
2024-11-01 09:34:44 Downloading - Downloading input data...
2024-11-01 09:35:19 Downloading - Downloading the training image...
2024-11-01 09:35:55 Training - Training image download completed. Training in progress..[34m[2024-11-01 09:35:59.832 ip-10-0-229-74.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm 

# Deploy the Trained XGBoost Model  
Deploy the trained XGBoost model to a SageMaker endpoint for real-time inference.

In [13]:
# Step 3: Deploy the Model
predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")


INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-11-01-09-36-50-451
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2024-11-01-09-36-50-451
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2024-11-01-09-36-50-451


------!

# Data Preparation for Batch Transformation
This section involves loading, cleaning, and saving the test dataset for use in the model evaluation.


In [29]:
import pandas as pd
import sagemaker
from sagemaker.transformer import Transformer

# Step 1: Load and Clean the Data
# Load the original test data file
file_path = 'test.csv'  # Replace with the actual file path if needed
test_data = pd.read_csv(file_path)

# Replace boolean strings 'True'/'False' with 1 and 0 across the entire DataFrame
test_data.replace({True: 1, False: 0, 'True': 1, 'False': 0}, inplace=True)

# Verify that all values are numeric
test_data = test_data.apply(pd.to_numeric, errors='coerce')

# Drop any rows containing NaN values introduced during conversion
test_data_cleaned = test_data.dropna()

# Save the cleaned data to a new CSV file
cleaned_file_path = 'test_cleaned.csv'
test_data_cleaned.to_csv(cleaned_file_path, index=False, header=False)
print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to test_cleaned.csv


## Feature Count Validation and Adjustment
This section ensures that the test data has the correct number of features expected by the model, adjusting as necessary.


In [31]:
# Ensure we have the correct number of features by checking against the training feature count
expected_feature_count = 112  # The feature count expected by the model

# Adjust columns if necessary
if test_data.shape[1] > expected_feature_count:
    test_data = test_data.iloc[:, :expected_feature_count]  # Keep only the first 112 columns
elif test_data.shape[1] < expected_feature_count:
    print(f"Warning: Test data has fewer columns ({test_data.shape[1]}) than expected ({expected_feature_count}).")

# Convert 'True'/'False' strings to 1/0 if they still exist
test_data.replace({'True': 1, 'False': 0}, inplace=True)

# Save the adjusted test data
cleaned_file_path = 'test_cleaned.csv'
test_data.to_csv(cleaned_file_path, index=False, header=False)
print(f"Adjusted test data saved to {cleaned_file_path}")

Adjusted test data saved to test_cleaned.csv


## Batch Transformation of Cleaned Test Data
Proceeding with batch transformation using the adjusted test data uploaded to S3.


In [32]:
# Now proceed with batch transform using this adjusted data
import sagemaker
from sagemaker.transformer import Transformer

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()

# Define the default bucket and upload cleaned test data to S3
s3_bucket_name = sagemaker_session.default_bucket()
s3_input_test = sagemaker_session.upload_data(cleaned_file_path, bucket=s3_bucket_name, key_prefix="xgboost/test")

# Assuming `predictor.endpoint_name` is the model endpoint created during training
transformer = Transformer(
    model_name=predictor.endpoint_name,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{s3_bucket_name}/xgboost/output"  # Specify output location in S3
)

# Perform batch transform
transformer.transform(data=s3_input_test, content_type="text/csv", split_type="Line")
transformer.wait()

print(f"Batch transform output saved to: s3://{s3_bucket_name}/xgboost/output")

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-11-01-10-40-44-856


.............................[34m[2024-11-01:10:45:38:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-01:10:45:38:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-01:10:45:38:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header Host $http_host;
      proxy_redirect off;
      proxy_read_timeout 60s;
      proxy_pass http://gunicorn;
    }
    location

## Clean Up Predictions File or Directory
This section ensures that any existing `predictions.csv` file or directory is removed before proceeding with the download of the new predictions file.


In [42]:
import os
import shutil

# Check if 'predictions.csv' exists and is a directory
if os.path.exists("predictions.csv"):
    if os.path.isdir("predictions.csv"):
        # List contents of the directory for inspection
        print("Contents of 'predictions.csv' directory:", os.listdir("predictions.csv"))
        
        # Remove the directory and all its contents
        shutil.rmtree("predictions.csv")
        print("'predictions.csv' directory and its contents have been removed.")
    else:
        # If it's a file, remove it directly
        os.remove("predictions.csv")
        print("'predictions.csv' file removed.")

# Proceed with the previous code to download the predictions file


Contents of 'predictions.csv' directory: ['claim.smd', 'test_cleaned.csv.out']
'predictions.csv' directory and its contents have been removed.


## Load and Evaluate Model Predictions
In this section, we will download the predictions from S3, clean up any existing files, and calculate various performance metrics based on the model's predictions.


In [47]:
import boto3
import numpy as np
import os
import shutil
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize Boto3 S3 client
s3_client = boto3.client('s3')

# Define S3 bucket and output prefix
s3_bucket_name = sagemaker_session.default_bucket()
output_prefix = "xgboost/output"

# List files in the output directory to find the batch transform output file (.out)
response = s3_client.list_objects_v2(Bucket=s3_bucket_name, Prefix=output_prefix)
output_files = [content['Key'] for content in response.get('Contents', [])]
print("Files in output path:", output_files)

# Locate the file ending in '.out' (the predictions file)
predictions_file_key = next((file for file in output_files if file.endswith('.out')), None)

# Clean up any pre-existing 'predictions.csv' directory or file to avoid conflicts
if os.path.exists("predictions.csv"):
    if os.path.isdir("predictions.csv"):
        shutil.rmtree("predictions.csv")  # Recursively remove the directory
    else:
        os.remove("predictions.csv")  # Remove the file if it exists

if predictions_file_key:
    # Use a unique temporary name for download
    temp_file = "temp_predictions.csv"
    s3_client.download_file(s3_bucket_name, predictions_file_key, temp_file)
    
    # Rename the temporary file to 'predictions.csv'
    os.rename(temp_file, "predictions.csv")

    # Verify that 'predictions.csv' was created successfully
    if os.path.isfile("predictions.csv"):
        try:
            # Load predictions, skip empty lines
            y_pred = np.loadtxt("predictions.csv", delimiter=",")
            y_pred = y_pred.flatten()

            # Ensure lengths match between y_pred and y_test
            if len(y_pred) != len(y_test):
                print(f"Inconsistent lengths detected: y_pred ({len(y_pred)}) vs y_test ({len(y_test)})")
                if len(y_pred) > len(y_test):
                    y_pred = y_pred[:len(y_test)]
                else:
                    y_pred = np.pad(y_pred, (0, len(y_test) - len(y_pred)), 'constant', constant_values=0)

            # Print out the prediction values to understand the distribution
            print("Predictions distribution (first 10 values):", y_pred[:10])
            
            # Dynamically find a threshold if needed, but start with a default of 0.5
            threshold = 0.5
            y_pred_binary = np.where(y_pred >= threshold, 1, 0)

            # Check distribution of binary predictions
            print("Binary Predictions distribution:", np.bincount(y_pred_binary))

            # Define a function to calculate and print evaluation metrics
            def print_metrics(y_true, y_pred_binary):
                print(f"Accuracy: {accuracy_score(y_true, y_pred_binary):.2f}")
                print(f"Precision: {precision_score(y_true, y_pred_binary):.2f}")
                print(f"Recall: {recall_score(y_true, y_pred_binary):.2f}")
                print(f"F1 Score: {f1_score(y_true, y_pred_binary):.2f}")
                print(f"AUC: {roc_auc_score(y_true, y_pred):.2f}")  # Use continuous y_pred for AUC

            # Display performance metrics for the model on the test dataset
            print("Performance metrics for Dataset V2:")
            print_metrics(y_test, y_pred_binary)
        
        except ValueError as e:
            print(f"Error loading predictions from 'predictions.csv': {e}")
    else:
        print("Download failed or predictions.csv does not exist.")
else:
    print("Prediction file (.out) not found. Please check the batch transform job for errors.")


Files in output path: ['xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/claim.smd', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/collections/000000000/worker_0_collections.json', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/events/000000000000/000000000000_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/events/000000000010/000000000010_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/events/000000000020/000000000020_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/events/000000000030/000000000030_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/events/000000000040/000000000040_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-09-34-02-262/debug-output/events/000000000050/000000000050_worker_0.tfevents', 'xgboost/output/sagemaker-xgb

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The performance metrics for Dataset V2 indicate a concerning model evaluation. While the accuracy stands at 0.76, which might suggest that a significant portion of predictions are correct, the precision, recall, and F1 score all show values of 0.00. This is alarming, as it suggests that while some predictions may be correct, the model fails to effectively identify positive cases. Precision indicates the number of true positive predictions made out of all positive predictions, and a score of 0.00 signifies that no true positive cases were found. Similarly, the recall metric, which measures the model's ability to identify actual positive cases, also being 0.00 indicates a complete failure in detecting positive instances.

The F1 score, which combines precision and recall into a single metric, is also 0.00, underscoring the model's ineffectiveness. The AUC (Area Under the Curve) score of 0.50 implies that the model's performance is equivalent to random guessing, meaning it lacks discriminative power. This set of metrics calls for an immediate review of the model's training process, feature selection, and potentially its architecture, as the current state does not adequately serve the classification task at hand. Steps should be taken to enhance the model's ability to generalize and identify relevant patterns in the data.

## Clean Up the SageMaker Endpoint
In this step, we will delete the SageMaker endpoint to release resources and avoid incurring unnecessary charges.


In [48]:
# Step 6: Clean up the endpoint
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-xgboost-2024-11-01-09-36-50-451
INFO:sagemaker:Deleting endpoint with name: sagemaker-xgboost-2024-11-01-09-36-50-451


# Final Comments
In comparing the two models—Linear Learner and XGBoost—used to predict flight delays, we observe distinctive benefits and limitations in each approach. These differences underscore trade-offs in computational resources, accuracy, interpretability, and deployment suitability, each of which influences the practical choice of model.

### Resource Efficiency and Computational Complexity
The Linear Learner model offers a lightweight, computationally efficient option, ideal for scenarios where resources or processing power are limited. This model is fast to train and deploy, making it an appropriate choice when time constraints or low computational resources are a concern. However, its simplicity limits its ability to capture the non-linear, intricate relationships typical of flight delay data. For instance, flight delays often involve complex factors like weather patterns, seasonal trends, and airport congestion. While Linear Learner can serve as a reliable baseline, it may struggle to identify these nuanced relationships, potentially leading to reduced predictive accuracy and underfitting.

On the other hand, XGBoost, an ensemble model, requires more computational resources and time due to its gradient-boosting approach, which iteratively builds decision trees and corrects errors from previous trees. This process enables XGBoost to capture complex data relationships and interactions, making it effective for high-dimensional datasets like flight delay data. Despite the computational cost, XGBoost’s robustness in handling complex datasets often translates into higher predictive accuracy. Nevertheless, this model’s resource demands may limit its use in scenarios with tight computational constraints or where rapid deployment is critical.

### Accuracy and Model Performance
In terms of accuracy, XGBoost generally outperforms Linear Learner due to its ability to model non-linear relationships and interactions among features. Flight delay prediction relies on understanding various interdependent factors, and XGBoost’s boosting technique allows it to iteratively improve its predictions by correcting errors, resulting in a model that can capture these dynamics effectively. However, the complexity of XGBoost introduces the risk of overfitting, where the model performs well on training data but fails to generalize to unseen data. Addressing this requires careful tuning and regularization techniques to maintain a balance between accuracy and generalizability.

Linear Learner, while faster and simpler, might underfit, especially if the relationships in the data are complex. Its limitations become apparent when multiple factors interact, such as weather conditions varying by season or airport traffic affecting delay likelihood. Nevertheless, Linear Learner’s speed and ease of implementation make it valuable as a quick, interpretable model for initial evaluations or in scenarios where predictive accuracy is less critical.

### Interpretability and Practical Application
Interpretability is often a priority in predictive modeling, especially in high-stakes fields like air travel, where insights into predictions can influence significant decisions. Linear Learner excels here, as it provides clear, interpretable coefficients that indicate the relationship between each feature and flight delay probability. This transparency allows stakeholders to understand which factors, such as weather or airport congestion, are contributing to delays.

Conversely, XGBoost, while powerful, lacks inherent interpretability due to its complex, tree-based structure. While tools like SHAP values can aid in interpreting XGBoost’s outputs, these additional interpretability steps introduce complexity, which may be a drawback in applications requiring straightforward, accessible explanations.

### Conclusion
In summary, Linear Learner’s simplicity, speed, and interpretability make it a viable choice for baseline modeling or scenarios with limited resources. It offers practical insights into delay factors, though at the expense of predictive accuracy on complex datasets. XGBoost, with its superior accuracy and robustness, is well-suited for applications prioritizing accuracy and capable of supporting computationally intensive models. The choice between these models depends on the specific application needs, balancing interpretability, resource constraints, and accuracy requirements. Both models have valuable roles, with the ideal model determined by project priorities and the demands of the deployment environment.