# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

# Importing Necessary Libraries for Data Preparation and Model Training
# This section includes libraries for data manipulation, machine learning, and SageMaker integration.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator
from sagemaker.transformer import Transformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


# Initializing SageMaker Session and Loading Data  
# Setting Up Environment and Preparing the Dataset for Training


In [2]:
# Set up SageMaker session and role
sagemaker_session = sagemaker.Session()
role = "Your-SageMaker-Execution-Role-ARN"  # Replace with your SageMaker role ARN

# Step 1: Load and Split the Data
data = pd.read_csv("combined_csv_v1.csv")  # Load the dataset

  data = pd.read_csv("combined_csv_v1.csv")  # Load the dataset


# Displaying Column Names
## This section outputs the names of all columns in the dataset.


In [3]:
print(data.columns)


Index(['target', 'Distance', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2',
       'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8',
       'Month_9', 'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2',
       'DayofMonth_3', 'DayofMonth_4', 'DayofMonth_5', 'DayofMonth_6',
       'DayofMonth_7', 'DayofMonth_8', 'DayofMonth_9', 'DayofMonth_10',
       'DayofMonth_11', 'DayofMonth_12', 'DayofMonth_13', 'DayofMonth_14',
       'DayofMonth_15', 'DayofMonth_16', 'DayofMonth_17', 'DayofMonth_18',
       'DayofMonth_19', 'DayofMonth_20', 'DayofMonth_21', 'DayofMonth_22',
       'DayofMonth_23', 'DayofMonth_24', 'DayofMonth_25', 'DayofMonth_26',
       'DayofMonth_27', 'DayofMonth_28', 'DayofMonth_29', 'DayofMonth_30',
       'DayofMonth_31', 'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4',
       'DayOfWeek_5', 'DayOfWeek_6', 'DayOfWeek_7', 'Reporting_Airline_DL',
       'Reporting_Airline_OO', 'Reporting_Airline_UA', 'Reporting_Airline_WN',
       'Origin_CLT', 'Origin_DEN', '

## Preparing the Dataset for Training and Testing
This section defines the features and target variable, splits the dataset into training, validation, and test sets, and saves them to CSV files.


In [4]:
# Define features and target using the correct column name
X = data.drop(columns=['target'])  # Assuming 'target' is the correct target column
y = data['target']

# Split into training (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Save training and testing datasets
train_data = pd.concat([y_train, X_train], axis=1)
train_data.to_csv("train.csv", index=False, header=False)
test_data = pd.concat([y_test, X_test], axis=1)
test_data.to_csv("test.csv", index=False, header=False)


## Uploading Training Data to Amazon S3
This section utilizes SageMaker's default bucket to upload the prepared training data for model training.


In [5]:
# Use SageMaker's default bucket
s3_bucket_name = sagemaker_session.default_bucket()

# Upload training data to the default S3 bucket
s3_input_train = sagemaker_session.upload_data("train.csv", bucket=s3_bucket_name, key_prefix="xgboost/train")


## Defining and Training the XGBoost Model
In this section, we configure and initiate the training process for the XGBoost model using SageMaker's Estimator. This involves specifying the model parameters and feeding the training data from S3.


In [6]:
import sagemaker
from sagemaker.inputs import TrainingInput
from sagemaker.estimator import Estimator

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()

# Automatically retrieve the SageMaker execution role
role = sagemaker.get_execution_role()

# Use the default S3 bucket for SageMaker
s3_bucket_name = sagemaker_session.default_bucket()
output_path = f"s3://{s3_bucket_name}/xgboost/output"

# Step 2: Define and Train XGBoost Model
xgb_estimator = Estimator(
    image_uri=sagemaker.image_uris.retrieve("xgboost", sagemaker_session.boto_region_name, "1.2-1"),
    role=role,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=output_path,
    sagemaker_session=sagemaker_session
)

# Set hyperparameters
xgb_estimator.set_hyperparameters(
    objective="binary:logistic",
    num_round=100,
    max_depth=5,
    eta=0.2,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="auc"
)

# Train the model using the training data in S3
xgb_estimator.fit({"train": TrainingInput(s3_input_train, content_type="csv")})


INFO:sagemaker:Creating training-job with name: sagemaker-xgboost-2024-11-01-11-29-21-563


2024-11-01 11:29:22 Starting - Starting the training job...
2024-11-01 11:29:36 Starting - Preparing the instances for training...
2024-11-01 11:30:07 Downloading - Downloading input data...
2024-11-01 11:30:28 Downloading - Downloading the training image...
2024-11-01 11:31:19 Training - Training image download completed. Training in progress..[34m[2024-11-01 11:31:23.808 ip-10-2-126-124.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter eval_metric value auc to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm

## Deploying the XGBoost Model
In this step, we deploy the trained XGBoost model to a SageMaker endpoint, allowing for real-time predictions on incoming data using the specified instance type.


In [7]:
# Step 3: Deploy the Model
predictor = xgb_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")


INFO:sagemaker:Creating model with name: sagemaker-xgboost-2024-11-01-11-32-08-606
INFO:sagemaker:Creating endpoint-config with name sagemaker-xgboost-2024-11-01-11-32-08-606
INFO:sagemaker:Creating endpoint with name sagemaker-xgboost-2024-11-01-11-32-08-606


------!

## Loading and Cleaning the Test Data
This step involves loading the original test data from a CSV file, cleaning it by converting boolean values to integers, ensuring all values are numeric, and dropping any rows with NaN values. The cleaned dataset is then saved to a new CSV file for further processing.


In [8]:
import pandas as pd
import sagemaker
from sagemaker.transformer import Transformer

# Step 1: Load and Clean the Data
# Load the original test data file
file_path = 'test.csv'  # Replace with the actual file path if needed
test_data = pd.read_csv(file_path)

# Replace boolean strings 'True'/'False' with 1 and 0 across the entire DataFrame
test_data.replace({True: 1, False: 0, 'True': 1, 'False': 0}, inplace=True)

# Verify that all values are numeric
test_data = test_data.apply(pd.to_numeric, errors='coerce')

# Drop any rows containing NaN values introduced during conversion
test_data_cleaned = test_data.dropna()

# Save the cleaned data to a new CSV file
cleaned_file_path = 'test_cleaned.csv'
test_data_cleaned.to_csv(cleaned_file_path, index=False, header=False)
print(f"Cleaned data saved to {cleaned_file_path}")

Cleaned data saved to test_cleaned.csv


## Ensuring the Correct Number of Features in the Test Data
In this section, we verify that the test dataset contains the expected number of features required by the model. If the test data has more columns than expected, we retain only the relevant features. If it has fewer columns, a warning is printed. Additionally, any remaining boolean strings are converted to integers, and the adjusted test data is saved to a new CSV file.


In [9]:
# Ensure we have the correct number of features by checking against the training feature count
expected_feature_count = 94  # The feature count expected by the model

# Adjust columns if necessary
if test_data.shape[1] > expected_feature_count:
    test_data = test_data.iloc[:, :expected_feature_count]  # Keep only the first 112 columns
elif test_data.shape[1] < expected_feature_count:
    print(f"Warning: Test data has fewer columns ({test_data.shape[1]}) than expected ({expected_feature_count}).")

# Convert 'True'/'False' strings to 1/0 if they still exist
test_data.replace({'True': 1, 'False': 0}, inplace=True)

# Save the adjusted test data
cleaned_file_path = 'test_cleaned.csv'
test_data.to_csv(cleaned_file_path, index=False, header=False)
print(f"Adjusted test data saved to {cleaned_file_path}")

Adjusted test data saved to test_cleaned.csv


## Displaying Feature Names for Training and Test Datasets
In this section, we print the names of the features used in the training dataset as well as those present in the test dataset. This helps in verifying the consistency of features across both datasets, ensuring that the model can be applied correctly to the test data.


In [12]:
print("Training features:", X_train.columns)
print("Test features:", test_data.columns)


Training features: Index(['Distance', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3',
       'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9',
       'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2', 'DayofMonth_3',
       'DayofMonth_4', 'DayofMonth_5', 'DayofMonth_6', 'DayofMonth_7',
       'DayofMonth_8', 'DayofMonth_9', 'DayofMonth_10', 'DayofMonth_11',
       'DayofMonth_12', 'DayofMonth_13', 'DayofMonth_14', 'DayofMonth_15',
       'DayofMonth_16', 'DayofMonth_17', 'DayofMonth_18', 'DayofMonth_19',
       'DayofMonth_20', 'DayofMonth_21', 'DayofMonth_22', 'DayofMonth_23',
       'DayofMonth_24', 'DayofMonth_25', 'DayofMonth_26', 'DayofMonth_27',
       'DayofMonth_28', 'DayofMonth_29', 'DayofMonth_30', 'DayofMonth_31',
       'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4', 'DayOfWeek_5',
       'DayOfWeek_6', 'DayOfWeek_7', 'Reporting_Airline_DL',
       'Reporting_Airline_OO', 'Reporting_Airline_UA', 'Reporting_Airline_WN',
       'Origin_CLT', 'Origi

## Preparing and Uploading Test Data for Batch Transformation
In this section, we prepare the cleaned test data for batch transformation. We check for any missing features compared to the training dataset and ensure that categorical variables are properly one-hot encoded. After aligning the test data with the expected feature set, we save the adjusted dataset and upload it to the default S3 bucket used by SageMaker. Finally, we initiate a batch transform job to generate predictions based on the trained model.


In [18]:
import pandas as pd
import numpy as np
import sagemaker

# Load the cleaned test data
test_data = pd.read_csv("test_cleaned.csv")

# Print the original columns for debugging
print("Columns in the test dataset:", test_data.columns.tolist())

# Define the training feature names again for reference
training_feature_names = [
    'Distance', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3',
    'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9',
    'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2', 'DayofMonth_3',
    'DayofMonth_4', 'DayofMonth_5', 'DayofMonth_6', 'DayofMonth_7',
    'DayofMonth_8', 'DayofMonth_9', 'DayofMonth_10', 'DayofMonth_11',
    'DayofMonth_12', 'DayofMonth_13', 'DayofMonth_14', 'DayofMonth_15',
    'DayofMonth_16', 'DayofMonth_17', 'DayofMonth_18', 'DayofMonth_19',
    'DayofMonth_20', 'DayofMonth_21', 'DayofMonth_22', 'DayofMonth_23',
    'DayofMonth_24', 'DayofMonth_25', 'DayofMonth_26', 'DayofMonth_27',
    'DayofMonth_28', 'DayofMonth_29', 'DayofMonth_30', 'DayofMonth_31',
    'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4', 'DayOfWeek_5',
    'DayOfWeek_6', 'DayOfWeek_7', 'Reporting_Airline_DL',
    'Reporting_Airline_OO', 'Reporting_Airline_UA', 'Reporting_Airline_WN',
    'Origin_CLT', 'Origin_DEN', 'Origin_DFW', 'Origin_IAH', 'Origin_LAX',
    'Origin_ORD', 'Origin_PHX', 'Origin_SFO', 'Dest_CLT', 'Dest_DEN',
    'Dest_DFW', 'Dest_IAH', 'Dest_LAX', 'Dest_ORD', 'Dest_PHX', 'Dest_SFO',
    'DepHourofDay_1', 'DepHourofDay_2', 'DepHourofDay_4', 'DepHourofDay_5',
    'DepHourofDay_6', 'DepHourofDay_7', 'DepHourofDay_8', 'DepHourofDay_9',
    'DepHourofDay_10', 'DepHourofDay_11', 'DepHourofDay_12',
    'DepHourofDay_13', 'DepHourofDay_14', 'DepHourofDay_15',
    'DepHourofDay_16', 'DepHourofDay_17', 'DepHourofDay_18',
    'DepHourofDay_19', 'DepHourofDay_20', 'DepHourofDay_21',
    'DepHourofDay_22', 'DepHourofDay_23',
]

# Check for missing features
missing_features = [feature for feature in training_feature_names if feature not in test_data.columns]
if missing_features:
    print(f"Missing features in the test data: {missing_features}")

# Identify categorical columns that are supposed to be one-hot encoded
# Adjust this list based on your actual categorical features used in training
categorical_columns = [
    'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3', 'Month_4',
    'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10',
    'Month_11', 'Month_12', 'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4',
    'DayOfWeek_5', 'DayOfWeek_6', 'DayOfWeek_7',
    'Reporting_Airline_DL', 'Reporting_Airline_OO', 'Reporting_Airline_UA',
    'Reporting_Airline_WN', 'Origin_CLT', 'Origin_DEN', 'Origin_DFW',
    'Origin_IAH', 'Origin_LAX', 'Origin_ORD', 'Origin_PHX', 'Origin_SFO',
    'Dest_CLT', 'Dest_DEN', 'Dest_DFW', 'Dest_IAH', 'Dest_LAX',
    'Dest_ORD', 'Dest_PHX', 'Dest_SFO'
]

# Remove categorical columns from the test set if they are not present
categorical_columns = [col for col in categorical_columns if col in test_data.columns]

# One-hot encode the categorical columns in the test dataset
test_data = pd.get_dummies(test_data, columns=categorical_columns, drop_first=True)

# Align the test data with the training feature names
test_data_filtered = test_data.reindex(columns=training_feature_names, fill_value=0)

# Verify that the test data now has the correct columns
print("Filtered Test Features:", test_data_filtered.columns)

# Save the filtered test data for transformation
updated_test_file_path = 'test_cleaned_updated.csv'
test_data_filtered.to_csv(updated_test_file_path, index=False, header=False)

# Upload cleaned and filtered test data to S3
s3_input_test = sagemaker_session.upload_data(updated_test_file_path, bucket=s3_bucket_name, key_prefix="xgboost/test")

# Proceed with batch transform using the updated test data
transformer = Transformer(
    model_name=predictor.endpoint_name,
    instance_count=1,
    instance_type="ml.m5.xlarge",
    output_path=f"s3://{s3_bucket_name}/xgboost/output"
)

# Perform batch transform
transformer.transform(data=s3_input_test, content_type="text/csv", split_type="Line")
transformer.wait()

print(f"Batch transform output saved to: s3://{s3_bucket_name}/xgboost/output")


Columns in the test dataset: ['0.0', '1440.0', '0', '0.1', '1', '0.2', '0.3', '0.4', '0.5', '0.6', '0.7', '0.8', '0.9', '1.1', '0.10', '0.11', '0.12', '0.13', '0.14', '0.15', '0.16', '0.17', '0.18', '0.19', '0.20', '0.21', '1.2', '0.22', '0.23', '0.24', '0.25', '0.26', '0.27', '0.28', '0.29', '0.30', '0.31', '0.32', '0.33', '0.34', '0.35', '0.36', '0.37', '0.38', '0.39', '0.40', '0.41', '0.42', '0.43', '0.44', '0.45', '1.3', '0.46', '0.47', '0.48', '0.49', '0.50', '0.51', '0.52', '0.53', '0.54', '1.4', '0.55', '0.56', '0.57', '0.58', '0.59', '0.60', '0.61', '0.62', '1.5', '0.63', '0.64', '0.65', '0.66', '0.67', '0.68', '0.69', '0.70', '0.71', '0.72', '1.6', '0.73', '0.74', '0.75', '0.76', '0.77', '0.78', '0.79', '0.80', '0.81', '0.82', '0.83', '0.84']
Missing features in the test data: ['Distance', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2', 'DayofMonth

INFO:sagemaker:Creating transform job with name: sagemaker-xgboost-2024-11-01-11-56-13-631


................................[34m[2024-11-01:12:01:33:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-01:12:01:33:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2024-11-01:12:01:33:INFO] nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[35m[2024-11-01:12:01:33:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-11-01:12:01:33:INFO] No GPUs detected (normal if no gpus installed)[0m
[35m[2024-11-01:12:01:33:INFO] nginx config: [0m
[35mworker_processes auto;[0m
[35mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;[0m
[34mworker_rlimit_nofile 4096;[0m
[34mevents {
  worker_connections 2048;[0m
[34m}[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;
  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }
  server {
    listen 8080 deferred;
    client_max_body_size 0;
    keepalive_timeout 3;
 

## Cleaning Up Previous Predictions
This code snippet checks for the existence of a file or directory named `predictions.csv` in the current working directory. If it finds a directory with that name, it lists its contents and then removes the entire directory along with all its files. If `predictions.csv` is a file instead, it deletes the file directly. This cleanup step ensures that any old predictions do not interfere with the new predictions being downloaded from the S3 bucket.


In [19]:
import os
import shutil

# Check if 'predictions.csv' exists and is a directory
if os.path.exists("predictions.csv"):
    if os.path.isdir("predictions.csv"):
        # List contents of the directory for inspection
        print("Contents of 'predictions.csv' directory:", os.listdir("predictions.csv"))
        
        # Remove the directory and all its contents
        shutil.rmtree("predictions.csv")
        print("'predictions.csv' directory and its contents have been removed.")
    else:
        # If it's a file, remove it directly
        os.remove("predictions.csv")
        print("'predictions.csv' file removed.")

# Proceed with the previous code to download the predictions file


## Evaluating Model Predictions
This code snippet is responsible for evaluating the model predictions after they have been processed by the SageMaker batch transform job. It begins by initializing the Boto3 S3 client and defining the S3 bucket and output prefix to retrieve the generated prediction files. It checks for any existing `predictions.csv` file or directory and cleans up to avoid conflicts.

The predictions are then downloaded from S3, and the code verifies that the file was created successfully. It loads the predictions and compares their shape against the ground truth values (`y_test`). In case of inconsistencies in lengths, it adjusts the predictions accordingly. 

Additionally, it ensures that `y_test` does not contain any NaN values, which could lead to errors during evaluation. The code then calculates various performance metrics, such as accuracy, precision, recall, F1 score, and AUC, and prints them to the console for review. This provides a comprehensive overview of the model's performance on the test dataset.


In [21]:
import boto3
import numpy as np
import os
import shutil
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Initialize Boto3 S3 client
s3_client = boto3.client('s3')

# Define S3 bucket and output prefix
s3_bucket_name = sagemaker_session.default_bucket()
output_prefix = "xgboost/output"

# List files in the output directory to find the batch transform output file (.out)
response = s3_client.list_objects_v2(Bucket=s3_bucket_name, Prefix=output_prefix)
output_files = [content['Key'] for content in response.get('Contents', [])]
print("Files in output path:", output_files)

# Locate the file ending in '.out' (the predictions file)
predictions_file_key = next((file for file in output_files if file.endswith('.out')), None)

# Clean up any pre-existing 'predictions.csv' directory or file to avoid conflicts
if os.path.exists("predictions.csv"):
    if os.path.isdir("predictions.csv"):
        shutil.rmtree("predictions.csv")  # Recursively remove the directory
    else:
        os.remove("predictions.csv")  # Remove the file if it exists

if predictions_file_key:
    # Use a unique temporary name for download
    temp_file = "temp_predictions.csv"
    s3_client.download_file(s3_bucket_name, predictions_file_key, temp_file)
    
    # Rename the temporary file to 'predictions.csv'
    os.rename(temp_file, "predictions.csv")

    # Verify that 'predictions.csv' was created successfully
    if os.path.isfile("predictions.csv"):
        try:
            # Load predictions, skip empty lines
            y_pred = np.loadtxt("predictions.csv", delimiter=",")
            y_pred = y_pred.flatten()

            # Check the shape of y_pred and y_test
            print(f"y_pred shape: {y_pred.shape}, y_test shape: {y_test.shape}")

            # Ensure lengths match between y_pred and y_test
            if len(y_pred) != len(y_test):
                print(f"Inconsistent lengths detected: y_pred ({len(y_pred)}) vs y_test ({len(y_test)})")
                if len(y_pred) > len(y_test):
                    y_pred = y_pred[:len(y_test)]
                else:
                    y_pred = np.pad(y_pred, (0, len(y_test) - len(y_pred)), 'constant', constant_values=0)

            # Ensure y_test does not contain NaN values
            if np.any(np.isnan(y_test)):
                print("y_test contains NaN values. Cleaning up...")
                mask = ~np.isnan(y_test)
                y_test = y_test[mask]
                y_pred = y_pred[mask]

            # Print out the prediction values to understand the distribution
            print("Predictions distribution (first 10 values):", y_pred[:10])
            
            # Dynamically find a threshold if needed, but start with a default of 0.5
            threshold = 0.5
            y_pred_binary = np.where(y_pred >= threshold, 1, 0)

            # Check distribution of binary predictions
            print("Binary Predictions distribution:", np.bincount(y_pred_binary))

            # Define a function to calculate and print evaluation metrics
            def print_metrics(y_true, y_pred_binary):
                print(f"Accuracy: {accuracy_score(y_true, y_pred_binary):.2f}")
                print(f"Precision: {precision_score(y_true, y_pred_binary):.2f}")
                print(f"Recall: {recall_score(y_true, y_pred_binary):.2f}")
                print(f"F1 Score: {f1_score(y_true, y_pred_binary):.2f}")
                print(f"AUC: {roc_auc_score(y_true, y_pred):.2f}")  # Use continuous y_pred for AUC

            # Display performance metrics for the model on the test dataset
            print("Performance metrics for Dataset V2:")
            print_metrics(y_test, y_pred_binary)
        
        except ValueError as e:
            print(f"Error loading predictions from 'predictions.csv': {e}")
    else:
        print("Download failed or predictions.csv does not exist.")
else:
    print("Prediction file (.out) not found. Please check the batch transform job for errors.")


Files in output path: ['xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/claim.smd', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/collections/000000000/worker_0_collections.json', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/events/000000000000/000000000000_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/events/000000000010/000000000010_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/events/000000000020/000000000020_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/events/000000000030/000000000030_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/events/000000000040/000000000040_worker_0.tfevents', 'xgboost/output/sagemaker-xgboost-2024-11-01-11-29-21-563/debug-output/events/000000000050/000000000050_worker_0.tfevents', 'xgboost/output/sagemaker-xgb

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


The performance metrics for Dataset V2 indicate a concerning situation regarding the predictive model's effectiveness. With an accuracy of 0.75, the model appears to correctly classify a significant proportion of instances; however, this figure alone does not tell the full story. The precision is reported as 0.00, indicating that when the model predicts a positive outcome, it fails to do so correctly. This suggests that all positive predictions are false positives, which can lead to a complete breakdown in trust for any positive classifications made by the model.

Moreover, the recall is also 0.00, implying that the model is unable to identify any actual positive cases present in the dataset. This is a critical flaw, especially in contexts where identifying positives is essential, such as in medical diagnoses or fraud detection. Consequently, the F1 score, which combines precision and recall into a single metric, is also 0.00, further highlighting the model's inadequacy.

The Area Under the Receiver Operating Characteristic Curve (AUC) is 0.50, suggesting that the model's ability to distinguish between classes is no better than random chance. Collectively, these metrics point to a need for significant improvement in the model's training process or feature engineering to enhance its predictive capabilities.

In [48]:
# Step 6: Clean up the endpoint
predictor.delete_endpoint()

INFO:sagemaker:Deleting endpoint configuration with name: sagemaker-xgboost-2024-11-01-09-36-50-451
INFO:sagemaker:Deleting endpoint with name: sagemaker-xgboost-2024-11-01-09-36-50-451


# Final comment 
In comparing the Linear Learner and XGBoost models for predicting flight delays, we see distinct advantages and trade-offs that influence their suitability based on factors like computational resources, accuracy, interpretability, and scalability.

### Resource Allocation and Computational Complexity
The Linear Learner model is a lightweight option, requiring minimal computational resources and thus allowing for faster deployment and iteration. This simplicity makes it ideal for environments with limited computational power or time constraints. However, this model’s simplicity may limit its ability to accurately capture the complexity of flight delay data, which includes intricate factors like seasonal weather patterns and high variability across airports.

XGBoost, on the other hand, is a more advanced ensemble method known for its accuracy, particularly with complex datasets. This model is computationally demanding due to its iterative, error-correcting process, which requires more memory and time. For projects that demand high accuracy, XGBoost’s ability to model complex patterns is invaluable, but it may not be ideal in time-sensitive or resource-constrained scenarios.

### Accuracy and Model Performance
The Linear Learner’s simplicity generally limits its capacity to model non-linear relationships and interactions within the data, leading to the risk of underfitting. This could result in lower prediction accuracy, as the model might miss important nuances in the data. XGBoost, however, excels in capturing these intricate patterns through its boosting technique, which corrects errors iteratively and allows it to handle complex feature interactions more effectively. This often leads to higher predictive accuracy but requires careful tuning to avoid overfitting, where the model may perform well on training data but struggle with new, unseen data.

### Interpretability and Practical Insights
Interpretability is crucial for understanding model predictions, especially in high-stakes areas like air travel. Linear Learner models are more interpretable since they provide clear, direct relationships between features and predictions. This transparency allows stakeholders to understand the primary drivers of delays and make informed decisions based on insights into factors such as weather or seasonal changes.

XGBoost, although typically more accurate, is less interpretable due to its ensemble structure, which uses multiple decision trees. Understanding predictions from XGBoost often requires additional interpretability tools, like SHAP values, which can make analysis more complex. While this added complexity may be justified in exchange for the model’s superior performance, it can pose challenges when straightforward explanations are needed for stakeholders.

### Scalability and Deployment Considerations
In deployment, Linear Learner’s simplicity supports scalability, especially in cases where rapid retraining or frequent deployment is required. This model’s minimal resource requirements also make it resilient in real-time applications. However, its inability to adapt to complex, evolving patterns over time can limit its long-term utility.

Conversely, XGBoost’s robustness to complex data makes it more adaptable to changes over time, making it better suited for long-term production in dynamic environments. While the model’s deployment may require a more powerful infrastructure, it’s more likely to maintain accuracy across shifts in data patterns, making it a strong candidate for applications prioritizing long-term predictive reliability.

### Conclusion
In summary, the Linear Learner offers simplicity, speed, and transparency, making it suitable for baseline modeling or when resources are constrained. However, its limited accuracy may hinder its effectiveness in complex applications. XGBoost provides enhanced accuracy and flexibility, valuable in environments where high performance is critical, though it requires more resources and interpretability tools. The final choice between these models depends on balancing resource availability, accuracy needs, interpretability, and the specific demands of the application.