# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

In [5]:
# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

SyntaxError: invalid syntax (1920112552.py, line 4)

### 1. Load and Split Data
Load the dataset (`combined_csv_v2.csv`) and split it into training, validation, and testing sets.


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("combined_csv_v1.csv")



### 2. Check and Clean NaN Values
Check for NaN values in each column, remove rows with any NaN values, and verify that no NaNs remain. Finally, save the cleaned data back to the CSV file.


In [3]:
# Check for NaN values
nan_counts = df.isna().sum()
print("NaN values in each column:\n", nan_counts[nan_counts > 0])

# Remove rows with any NaN values
df.dropna(inplace=True)

# Verify that there are no NaN values left
print("\nNaN values after cleaning:\n", df.isna().sum().sum())  # Should output 0 if no NaNs are left

# Save the cleaned data back to the CSV
df.to_csv("combined_csv_v2_cleaned.csv", index=False)
print("Saved cleaned data to 'combined_csv_v2_cleaned.csv'")

NaN values in each column:
 target    22540
dtype: int64

NaN values after cleaning:
 0
Saved cleaned data to 'combined_csv_v2_cleaned.csv'


### 3. Convert and Split Data for Model Training
Convert any `'TRUE'/'FALSE'` strings to Boolean values if necessary. Split the dataset into features and target, then further split into training (70%), validation (15%), and test (15%) sets. Save these splits to CSV files for later use.


In [13]:
# Convert any 'TRUE'/'FALSE' strings to Boolean values if necessary
df.replace({'TRUE': True, 'FALSE': False}, inplace=True)

# Split features and target
X = df.drop(columns=['target'])  # Replace 'target' with the actual target column name if different
y = df['target']

# Split data into training (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Split temp data into validation (15%) and test (15%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Save these to CSV for uploading to S3
X_train.to_csv("X_train.csv", index=False)
y_train.to_csv("y_train.csv", index=False)
X_val.to_csv("X_val.csv", index=False)
y_val.to_csv("y_val.csv", index=False)
X_test.to_csv("X_test.csv", index=False)
y_test.to_csv("y_test.csv", index=False)


### 6. Check for NaN Values in S3 Files
Download each data file from the `flight-delay1` S3 bucket, load it into a DataFrame, and check for any NaN values. Print the columns with NaN values if found, or confirm if there are none.


In [15]:
import pandas as pd

# List of file names to check for NaN values
files = ["X_train.csv", "y_train.csv", "X_val.csv", "y_val.csv", "X_test.csv", "y_test.csv"]

# Function to check NaN values in each file
for file in files:
    # Load the file into a DataFrame
    df = pd.read_csv(file)
    
    # Check for NaN values
    nan_counts = df.isna().sum()
    nan_columns = nan_counts[nan_counts > 0]
    
    if not nan_columns.empty:
        print(f"\nNaN values found in {file}:")
        print(nan_columns)
    else:
        print(f"No NaN values in {file}")


No NaN values in X_train.csv
No NaN values in y_train.csv
No NaN values in X_val.csv
No NaN values in y_val.csv
No NaN values in X_test.csv
No NaN values in y_test.csv


### 7. Check Column Names in S3 Files
Download each data file from the `flight-delay1` S3 bucket, load it into a DataFrame, and print the column names for each file to verify consistency.


In [16]:
import pandas as pd

# List of file names to check for column names
files = ["X_train.csv", "y_train.csv", "X_val.csv", "y_val.csv", "X_test.csv", "y_test.csv"]

# Function to check column names in each file
for file in files:
    # Load the file into a DataFrame
    df = pd.read_csv(file)
    
    # Print column names
    print(f"\nColumn names in {file}:")
    print(df.columns.tolist())



Column names in X_train.csv:
['Distance', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2', 'DayofMonth_3', 'DayofMonth_4', 'DayofMonth_5', 'DayofMonth_6', 'DayofMonth_7', 'DayofMonth_8', 'DayofMonth_9', 'DayofMonth_10', 'DayofMonth_11', 'DayofMonth_12', 'DayofMonth_13', 'DayofMonth_14', 'DayofMonth_15', 'DayofMonth_16', 'DayofMonth_17', 'DayofMonth_18', 'DayofMonth_19', 'DayofMonth_20', 'DayofMonth_21', 'DayofMonth_22', 'DayofMonth_23', 'DayofMonth_24', 'DayofMonth_25', 'DayofMonth_26', 'DayofMonth_27', 'DayofMonth_28', 'DayofMonth_29', 'DayofMonth_30', 'DayofMonth_31', 'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4', 'DayOfWeek_5', 'DayOfWeek_6', 'DayOfWeek_7', 'Reporting_Airline_DL', 'Reporting_Airline_OO', 'Reporting_Airline_UA', 'Reporting_Airline_WN', 'Origin_CLT', 'Origin_DEN', 'Origin_DFW', 'Origin_IAH', 'Origin_LAX', 'Origin_ORD', 'Origin_PHX', 'Origin_

In [17]:
import pandas as pd

# List of file names to check for column names
files = {
    "X_train": "X_train.csv",
    "y_train": "y_train.csv",
    "X_val": "X_val.csv",
    "y_val": "y_val.csv",
    "X_test": "X_test.csv",
    "y_test": "y_test.csv"
}

# Function to summarize features and targets in each file
for key, file in files.items():
    # Load the file into a DataFrame
    df = pd.read_csv(file)
    
    # Print column names
    print(f"\nColumn names in {file}:")
    print(df.columns.tolist())
    
    # Summary: Count number of features or confirm target
    if 'target' in df.columns:
        print(f"{file} contains the target column with {len(df)} rows.")
    else:
        print(f"{file} has {len(df.columns)} features.")



Column names in X_train.csv:
['Distance', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12', 'DayofMonth_2', 'DayofMonth_3', 'DayofMonth_4', 'DayofMonth_5', 'DayofMonth_6', 'DayofMonth_7', 'DayofMonth_8', 'DayofMonth_9', 'DayofMonth_10', 'DayofMonth_11', 'DayofMonth_12', 'DayofMonth_13', 'DayofMonth_14', 'DayofMonth_15', 'DayofMonth_16', 'DayofMonth_17', 'DayofMonth_18', 'DayofMonth_19', 'DayofMonth_20', 'DayofMonth_21', 'DayofMonth_22', 'DayofMonth_23', 'DayofMonth_24', 'DayofMonth_25', 'DayofMonth_26', 'DayofMonth_27', 'DayofMonth_28', 'DayofMonth_29', 'DayofMonth_30', 'DayofMonth_31', 'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4', 'DayOfWeek_5', 'DayOfWeek_6', 'DayOfWeek_7', 'Reporting_Airline_DL', 'Reporting_Airline_OO', 'Reporting_Airline_UA', 'Reporting_Airline_WN', 'Origin_CLT', 'Origin_DEN', 'Origin_DFW', 'Origin_IAH', 'Origin_LAX', 'Origin_ORD', 'Origin_PHX', 'Origin_

In [18]:
# Function to check for duplicate columns across X files
x_files = ["X_train.csv", "X_val.csv", "X_test.csv"]
all_columns = set()

for file in x_files:
    df = pd.read_csv(file)
    file_columns = set(df.columns)
    duplicates = all_columns.intersection(file_columns)
    if duplicates:
        print(f"Duplicate columns found in {file}: {duplicates}")
    all_columns.update(file_columns)


Duplicate columns found in X_val.csv: {'Distance', 'DepHourofDay_7', 'DayofMonth_31', 'DayofMonth_12', 'DayofMonth_27', 'Origin_DFW', 'DepHourofDay_22', 'Month_12', 'Month_10', 'DayOfWeek_3', 'DayofMonth_21', 'Origin_LAX', 'Month_7', 'DayofMonth_15', 'DayOfWeek_5', 'DayofMonth_2', 'DayofMonth_3', 'Month_2', 'Reporting_Airline_DL', 'DayofMonth_30', 'DepHourofDay_13', 'DepHourofDay_6', 'Dest_SFO', 'Dest_DFW', 'DayofMonth_14', 'DepHourofDay_1', 'Month_9', 'Quarter_3', 'DayofMonth_10', 'Month_4', 'Month_6', 'DayofMonth_6', 'DayofMonth_19', 'DayofMonth_24', 'DepHourofDay_8', 'DayofMonth_20', 'Dest_LAX', 'DayOfWeek_2', 'Quarter_4', 'Dest_IAH', 'Origin_ORD', 'Dest_CLT', 'DepHourofDay_18', 'DepHourofDay_5', 'DepHourofDay_20', 'DayofMonth_16', 'Origin_CLT', 'DayofMonth_28', 'DayofMonth_26', 'Reporting_Airline_OO', 'Month_5', 'DayOfWeek_4', 'DepHourofDay_17', 'Month_8', 'DayofMonth_11', 'Origin_DEN', 'Dest_ORD', 'DayofMonth_5', 'Reporting_Airline_WN', 'DayofMonth_8', 'DayofMonth_7', 'DepHourofDa

In [19]:
# Function to print data types of columns in each X file
for key, file in files.items():
    if 'X' in key:  # Only for X files
        df = pd.read_csv(file)
        print(f"\nData types in {file}:")
        print(df.dtypes)



Data types in X_train.csv:
Distance           float64
Quarter_2             bool
Quarter_3             bool
Quarter_4             bool
Month_2               bool
                    ...   
DepHourofDay_19       bool
DepHourofDay_20       bool
DepHourofDay_21       bool
DepHourofDay_22       bool
DepHourofDay_23       bool
Length: 93, dtype: object

Data types in X_val.csv:
Distance           float64
Quarter_2             bool
Quarter_3             bool
Quarter_4             bool
Month_2               bool
                    ...   
DepHourofDay_19       bool
DepHourofDay_20       bool
DepHourofDay_21       bool
DepHourofDay_22       bool
DepHourofDay_23       bool
Length: 93, dtype: object

Data types in X_test.csv:
Distance           float64
Quarter_2             bool
Quarter_3             bool
Quarter_4             bool
Month_2               bool
                    ...   
DepHourofDay_19       bool
DepHourofDay_20       bool
DepHourofDay_21       bool
DepHourofDay_22       bool
Dep

In [22]:
# Step 1: Install XGBoost
!conda install -c conda-forge xgboost -y
import xgboost as xgb
print("XGBoost is successfully installed and imported!")

Retrieving notices: ...working... done
Channels:
 - conda-forge
 - nvidia
 - pytorch
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done


    current version: 24.7.1
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /home/ec2-user/anaconda3/envs/pytorch_p310

  added / updated specs:
    - xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0           8 KB  conda-forge
    libxgboost-2.1.2           |   cpu_h3a1dfae_0         3.1 MB  conda-forge
    numpy-2.1.2                |  py310hd6e36ab_0         7.5 MB  conda-forge
    py-xgboost-2.1.2           | cpu_pyh15c3653_0         131 KB  conda-forge
    xgboost-2.1.2              | cpu_pyhac85b48_0          15 KB  conda-forge
    ------------

In [23]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the data
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv")
X_val = pd.read_csv("X_val.csv")
y_val = pd.read_csv("y_val.csv")
X_test = pd.read_csv("X_test.csv")
y_test = pd.read_csv("y_test.csv")

# Prepare the data
# Assuming the target variable is binary, adjust as necessary
X_train = X_train.astype(float)  # Convert features to float if necessary
X_val = X_val.astype(float)
X_test = X_test.astype(float)

# Train the model
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# Fit the model
model.fit(X_train, y_train.values.ravel())  # Flatten y_train

# Make predictions
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

# Evaluate the model
print("Validation Accuracy:", accuracy_score(y_val, y_val_pred))
print("Validation Classification Report:")
print(classification_report(y_val, y_val_pred))

print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print("Test Classification Report:")
print(classification_report(y_test, y_test_pred))


Parameters: { "use_label_encoder" } are not used.



Validation Accuracy: 0.7959101321442256
Validation Classification Report:
              precision    recall  f1-score   support

         0.0       0.80      0.99      0.89    194137
         1.0       0.66      0.05      0.08     51201

    accuracy                           0.80    245338
   macro avg       0.73      0.52      0.48    245338
weighted avg       0.77      0.80      0.72    245338

Test Accuracy: 0.7940808432413925
Test Classification Report:
              precision    recall  f1-score   support

         0.0       0.80      0.99      0.88    193645
         1.0       0.67      0.04      0.08     51694

    accuracy                           0.79    245339
   macro avg       0.73      0.52      0.48    245339
weighted avg       0.77      0.79      0.72    245339



In [27]:
# Step 2: Import necessary libraries
import pandas as pd
import xgboost as xgb
import joblib
import boto3
import sagemaker
from sagemaker import get_execution_role

# Step 3: Load and train the model
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv")

model = xgb.XGBClassifier(eval_metric='logloss')
model.fit(X_train, y_train.values.ravel())  # Flatten y_train
joblib.dump(model, 'xgboost_model.joblib')  # Save the model locally


['xgboost_model.joblib']

In [28]:
# Step 4: Upload the model to the default S3 bucket
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
model_artifact = f's3://{bucket_name}/xgboost_model.joblib'
boto3.client('s3').upload_file('xgboost_model.joblib', bucket_name, 'xgboost_model.joblib')


In [31]:
# Step 5: Create a SageMaker model and deploy it
role = get_execution_role()  # Get the execution role
xgb_model = sagemaker.model.Model(
    model_data=model_artifact,
    role=role,
    entry_point='inference.py',  # Ensure this script exists
    sagemaker_session=sagemaker_session,
)

In [33]:
# Step 6: Deploy the model to an endpoint
predictor = xgb_model.deploy(
    initial_instance_count=1,  # Number of instances for the endpoint
    instance_type='ml.m5.large',  # Specify the instance type
    endpoint_name='xgboost-endpoint',
)


TypeError: expected string or bytes-like object

In [None]:
# Step 7: Make predictions (sample data)
sample_data = X_train.iloc[:5].values.tolist()  # Adjust this as necessary for your input
predictions = predictor.predict(sample_data)
print(predictions)