# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US.

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this link: [https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312]. Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following link: [https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ].

# Step 1: Prepare the environment

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

In [None]:
import gdown

#https://drive.google.com/file/d/1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE/view?usp=sharing
#https://drive.google.com/file/d/1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x/view?usp=drive_link

# Define a dictionary of file names and their corresponding file IDs
file_ids = {
    'combined_csv_v1.csv': '1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE',
    'combined_csv_v2.csv': '1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x',
}

# Define the destination folder where you want to save the files
destination_folder = '/content'

# Download the files
for file_name, file_id in file_ids.items():
    url = f'https://drive.google.com/uc?id={file_id}'
    output = f'{destination_folder}/{file_name}'
    gdown.download(url, output, quiet=False)

print('Files downloaded successfully.')


Downloading...
From: https://drive.google.com/uc?id=1Qyav9ORUYqGXN-S7nx8zrYVxAtYR97VE
To: /content/combined_csv_v1.csv
100%|██████████| 246M/246M [00:01<00:00, 182MB/s]
Downloading...
From: https://drive.google.com/uc?id=1b5dA5u_VnZP1ZjQxmhigbuIOfmqjx70x
To: /content/combined_csv_v2.csv
100%|██████████| 317M/317M [00:01<00:00, 190MB/s]


Files downloaded successfully.


# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

In [None]:
# Necessary imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


In [None]:
import pandas as pd

def read_optimized_csv(filename, target_column='target'):
    # Read a small sample to infer data types
    small_sample = pd.read_csv(filename, nrows=1000)

    # Identify columns to be converted to bool (binary columns)
    bool_columns = [col for col in small_sample.columns if col not in [target_column, 'Distance'] and small_sample[col].nunique() == 2]

    # Create a dictionary with specified data types
    column_types = {col: 'bool' for col in bool_columns}
    column_types['Distance'] = 'float32'
    column_types[target_column] = 'float32'

    # Read the full CSV with optimized data types
    df = pd.read_csv(filename, dtype=column_types)

    return df

combined_csv_v2= read_optimized_csv("combined_csv_v2.csv", target_column='target')
combined_csv_v1 =read_optimized_csv("combined_csv_v1.csv", target_column='target')

In [None]:
# Write the final comments here and turn the cell type into markdown
combined_csv_v1.head()

Unnamed: 0,target,Distance,Quarter_2,Quarter_3,Quarter_4,Month_2,Month_3,Month_4,Month_5,Month_6,...,Origin_PHX,Origin_SFO,Dest_CLT,Dest_DEN,Dest_DFW,Dest_IAH,Dest_LAX,Dest_ORD,Dest_PHX,Dest_SFO
0,1.0,606.0,0,1,0,0,0,0,0,0,...,False,False,False,False,False,False,False,True,False,False
1,1.0,606.0,0,1,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,False
2,0.0,1199.0,0,1,0,0,0,0,0,0,...,False,False,False,True,False,False,False,False,False,False
3,0.0,1199.0,0,1,0,0,0,0,0,0,...,False,False,False,False,False,False,False,False,False,False
4,1.0,731.0,0,1,0,0,0,0,0,0,...,False,False,False,False,True,False,False,False,False,False


In [None]:
# Write the final comments here and turn the cell type into markdown
combined_csv_v2.head()

Unnamed: 0,target,Distance,DepHourofDay,AWND_O,PRCP_O,TAVG_O,AWND_D,PRCP_D,TAVG_D,SNOW_O,...,Origin_SFO,Dest_CLT,Dest_DEN,Dest_DFW,Dest_IAH,Dest_LAX,Dest_ORD,Dest_PHX,Dest_SFO,is_holiday_1
0,1.0,606.0,72,34,30,269.0,32,0,229.0,0.0,...,False,False,False,False,False,False,True,False,False,0
1,1.0,606.0,90,32,0,229.0,34,30,269.0,0.0,...,False,False,False,False,False,False,False,False,False,0
2,0.0,1199.0,15,34,30,269.0,38,0,269.0,0.0,...,False,False,True,False,False,False,False,False,False,0
3,0.0,1199.0,17,38,0,269.0,34,30,269.0,0.0,...,False,False,False,False,False,False,False,False,False,0
4,1.0,731.0,17,34,30,269.0,62,0,334.0,0.0,...,False,False,False,True,False,False,False,False,False,0


In [None]:
def split_data(df):
    # Splitting the data: 85% for train+validation and 15% for test
    train_val, test = train_test_split(df, test_size=0.15, random_state=42)
    # Splitting the 85% into 82% train and 18% validation (which gives us a 70-15-15 split overall)
    train, val = train_test_split(train_val, test_size=0.18, random_state=42)
    return train, val, test

train_v1, val_v1, test_v1 = split_data(combined_csv_v1)
train_v2, val_v2, test_v2 = split_data(combined_csv_v2)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
from sklearn.linear_model import SGDClassifier
from sklearn.utils import gen_batches, shuffle

def train_model_with_batches_improved(train_df, val_df, target_column='target', batch_size=10000, epochs=10, learning_rate=0.01, alpha=0.0001):
    X_train = train_df.drop(columns=[target_column]).values  # Convert to numpy array
    y_train = train_df[target_column].values
    X_val = val_df.drop(columns=[target_column]).values  # Convert to numpy array
    y_val = val_df[target_column].values

    # Scaling the data
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_val = scaler.transform(X_val)

    # Instantiate the SGD classifier with log loss
    model = SGDClassifier(loss='log_loss', learning_rate='constant', eta0=learning_rate, alpha=alpha, max_iter=1, tol=None)

    # Ensure the data is in the right format
    y_train = y_train.astype(int)
    y_val = y_val.astype(int)

    # Training in batches with manual epochs
    n_samples = X_train.shape[0]
    for epoch in range(epochs):
        X_train, y_train = shuffle(X_train, y_train)  # Shuffle the data at the beginning of each epoch
        for batch in gen_batches(n_samples, batch_size=batch_size):
            model.partial_fit(X_train[batch], y_train[batch], classes=[0, 1])

        # Evaluate on validation set after each epoch
        val_predictions = model.predict_proba(X_val)[:, 1]
        val_loss = log_loss(y_val, val_predictions)
        print(f"Epoch {epoch + 1}/{epochs} - Validation Log Loss: {val_loss:.4f}")

    return model

# To use this function:
print("Model 1")
model_v1_batch = train_model_with_batches_improved(train_v1, val_v1, target_column='target')
print("Model 2")
model_v2_batch = train_model_with_batches_improved(train_v2, val_v2, target_column='target')


Model 1
Epoch 1/10 - Validation Log Loss: 0.5304
Epoch 2/10 - Validation Log Loss: 0.5397
Epoch 3/10 - Validation Log Loss: 0.5270
Epoch 4/10 - Validation Log Loss: 0.5407
Epoch 5/10 - Validation Log Loss: 0.5311
Epoch 6/10 - Validation Log Loss: 0.5339
Epoch 7/10 - Validation Log Loss: 0.5476
Epoch 8/10 - Validation Log Loss: 0.5249
Epoch 9/10 - Validation Log Loss: 0.5445
Epoch 10/10 - Validation Log Loss: 0.5304
Model 2
Epoch 1/10 - Validation Log Loss: 0.5313
Epoch 2/10 - Validation Log Loss: 0.5208
Epoch 3/10 - Validation Log Loss: 0.5157
Epoch 4/10 - Validation Log Loss: 0.5269
Epoch 5/10 - Validation Log Loss: 0.5170
Epoch 6/10 - Validation Log Loss: 0.5206
Epoch 7/10 - Validation Log Loss: 0.5336
Epoch 8/10 - Validation Log Loss: 0.5234
Epoch 9/10 - Validation Log Loss: 0.5236
Epoch 10/10 - Validation Log Loss: 0.5159


# 2. Use a linear learner estimator to build a classification model
```
def train_model(train_df, target_column='target'):
    X_train = train_df.drop(columns=[target_column])
    y_train = train_df[target_column]
    model = LogisticRegression(max_iter=10000)  # Increasing max_iter for convergence
    model.fit(X_train, y_train)
    return model

model_v1 = train_model(train_v1,target_column='is_delay')
model_v2 = train_model(train_v2,target_column='target')
```

In [None]:
def evaluate_model(model, test_df, target_column='target'):
    X_test = test_df.drop(columns=[target_column]).values  # Convert to numpy array
    predictions = model.predict(X_test)
    return predictions

predictions_v1 = evaluate_model(model_v1_batch, test_v1, target_column='target')
predictions_v2 = evaluate_model(model_v2_batch, test_v2, target_column='target')


In [None]:
from sklearn.metrics import classification_report

# Report Performance Metrics for combined_csv_v1
print("Metrics for combined_csv_v1:")
print(classification_report(test_v1['target'], predictions_v1, zero_division=1))  # Setting zero_division to 1

# Report Performance Metrics for combined_csv_v2
print("Metrics for combined_csv_v2:")
print(classification_report(test_v2['target'], predictions_v2, zero_division=1))  # Setting zero_division to 1


Metrics for combined_csv_v1:
              precision    recall  f1-score   support

         0.0       0.79      1.00      0.88    193677
         1.0       1.00      0.00      0.00     51662

    accuracy                           0.79    245339
   macro avg       0.89      0.50      0.44    245339
weighted avg       0.83      0.79      0.70    245339

Metrics for combined_csv_v2:
              precision    recall  f1-score   support

         0.0       0.80      0.96      0.87    193677
         1.0       0.42      0.10      0.16     51662

    accuracy                           0.78    245339
   macro avg       0.61      0.53      0.52    245339
weighted avg       0.72      0.78      0.72    245339



# Step 3: Build and evaluate ensembe models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use xgboost estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance
6. write down your observation on the difference between the performance of using the simple and ensemble models.
Note: You are required to perform the above steps on the two combined datasets separatey.

In [None]:
# Necessary imports
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Assuming you've loaded the datasets
# combined_csv_v1 = pd.read_csv("combined_csv_v1.csv")
# combined_csv_v2 = pd.read_csv("combined_csv_v2.csv")

# 1. Splitting the data
def split_data(df):
    train_val, test = train_test_split(df, test_size=0.15, random_state=42)
    train, val = train_test_split(train_val, test_size=0.18, random_state=42)
    return train, val, test

train_v1, val_v1, test_v1 = split_data(combined_csv_v1)
train_v2, val_v2, test_v2 = split_data(combined_csv_v2)




In [None]:
import xgboost as xgb

# 2. Train XGBoost model
def train_xgboost(train_df, val_df, target_column='target'):
    dtrain = xgb.DMatrix(train_df.drop(columns=[target_column]), label=train_df[target_column])
    dval = xgb.DMatrix(val_df.drop(columns=[target_column]), label=val_df[target_column])

    params = {
        'objective': 'binary:logistic',  # for binary classification tasks
        'eval_metric': 'logloss',
        'max_depth': 6,
        'eta': 0.3,
    }

    watchlist = [(dtrain, 'train'), (dval, 'val')]
    model = xgb.train(params, dtrain, evals=watchlist, num_boost_round=1000, early_stopping_rounds=10, verbose_eval=False)
    return model

model_v1 = train_xgboost(train_v1, val_v1,target_column='target')
model_v2 = train_xgboost(train_v2, val_v2,target_column='target')





NameError: ignored

In [None]:
# 3. "Hosting" the Model on Another Instance
# This step would be similar to the pseudocode provided in the previous steps for SageMaker.

In [None]:
# 4. Batch Transform to Evaluate the Model
def evaluate_xgboost(model, test_df, target_column='target'):
    dtest = xgb.DMatrix(test_df.drop(columns=[target_column]))
    predictions = model.predict(dtest)
    return predictions

predictions_v1 = evaluate_xgboost(model_v1, test_v1,target_column='target')
predictions_v2 = evaluate_xgboost(model_v2, test_v2,target_column='target')



In [None]:
# 5. Report Performance Metrics
print("Metrics for combined_csv_v1 with XGBoost:")
print(classification_report(test_v1['target'], predictions_v1.round()))

print("\nMetrics for combined_csv_v2 with XGBoost:")
print(classification_report(test_v2['target'], predictions_v2.round()))

# Observations:
# You would need to compare the metrics (accuracy, precision, recall, F1 score, etc.) from the Logistic Regression model
# and the XGBoost model for both datasets to determine the performance improvement using ensemble methods.

Metrics for combined_csv_v1 with XGBoost:
              precision    recall  f1-score   support

         0.0       0.81      0.98      0.89    193677
         1.0       0.64      0.13      0.22     51662

    accuracy                           0.80    245339
   macro avg       0.72      0.56      0.55    245339
weighted avg       0.77      0.80      0.75    245339


Metrics for combined_csv_v2 with XGBoost:
              precision    recall  f1-score   support

         0.0       0.83      0.97      0.89    193677
         1.0       0.67      0.26      0.37     51662

    accuracy                           0.82    245339
   macro avg       0.75      0.61      0.63    245339
weighted avg       0.80      0.82      0.78    245339



In [None]:
# Write the final comments here and turn the cell type into markdown