# **Flight Cancellation Predictions for Flyzy**

**The purpose of this model is to predict flight cancellations. Below is a structured approach that includes data preprocessing, model building, and evaluation.**

## **1. Data Preprocessing**



### **1.1 Load the Dataset**

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


In [None]:
# Loading the dataset

df = pd.read_excel(r"/content/Flyzy Flight Cancellation.xlsx")

### **1.2 Encoding Categorical Variables**

**Since machine learning models require numerical inputs, I have encoded categorical variables (Airline, Origin_Airport, Destination_Airport, and Airplane_Type) using techniques like one-hot encoding or label encoding.**

In [None]:
print(df.head())

   Flight ID    Airline  Flight_Distance Origin_Airport Destination_Airport  \
0    7319483  Airline D              475      Airport 3           Airport 2   
1    4791965  Airline E              538      Airport 5           Airport 4   
2    2991718  Airline C              565      Airport 1           Airport 2   
3    4220106  Airline E              658      Airport 5           Airport 3   
4    2263008  Airline E              566      Airport 2           Airport 2   

   Scheduled_Departure_Time  Day_of_Week  Month Airplane_Type  Weather_Score  \
0                         4            6      1        Type C       0.225122   
1                        12            1      6        Type B       0.060346   
2                        17            3      9        Type C       0.093920   
3                         1            1      8        Type B       0.656750   
4                        19            7     12        Type E       0.505211   

   Previous_Flight_Delay_Minutes  Airline_Ra

In [None]:
# Separate the target variable and features
X = df.drop(columns=['Flight_Cancelled'])  # Drop only the target column
y = df['Flight_Cancelled']

### **1.3 Splitting the Dataset**

**I have split the dataset into training and testing sets to evaluate the model's performance on unseen data.**

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## **2. Model Building**



### **2.1 Logistic Regression**

**I have used Logistic Regression, a common algorithm for binary classification problems, to build the model.**

In [None]:
# Identify categorical and numerical columns
categorical_cols = ['Airline', 'Origin_Airport', 'Destination_Airport', 'Airplane_Type']
numerical_cols = [col for col in X.columns if col not in categorical_cols]

In [None]:
# Create a column transformer with one-hot encoding for categorical variables and standard scaling for numerical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

In [None]:
# A pipeline that first applies the preprocessor and then fits the Logistic Regression model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])

In [None]:
# Train the model
model.fit(X_train, y_train)

In [None]:
# Make predictions on the test set
y_pred = model.predict(X_test)

## **3. Model Evaluation**



### **3.1 Evaluate the Model**

**This is the evaluation of the model using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC score**

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

In [None]:
# Print the evaluation metrics
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
print(f'ROC-AUC Score: {roc_auc:.2f}')

Accuracy: 0.80
Precision: 0.84
Recall: 0.89
F1 Score: 0.86
ROC-AUC Score: 0.75


## **Refining the model**

### **To improve the accuracy of the Logistic Regression model, I have performed several enhancements.**



**Hyperparameter Tuning:**

  Optimized the hyperparameters of the Logistic Regression model using techniques like GridSearchCV.

**Feature Engineering:**

  Created new features and transformed existing features to provide the model with more informative data.

**Handling Class Imbalance:**

  If there is an imbalance in the target classes, I will use techniques like SMOTE (Synthetic Minority Over-sampling Technique) or class weights to address it.

**Feature Selection:**

  Select the most important features to reduce noise and improve model performance.

**Additional Models:**

  Explore other models like Random Forest, Gradient Boosting, or XGBoost to see if they perform better than Logistic Regression.

I have implemented hyperparameter tuning using GridSearchCV to find the best parameters for Logistic Regression.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer


In [None]:
# Display the first few rows to understand the data structure
print(df.head())
print(df.info())
print(df.describe())

   Flight ID    Airline  Flight_Distance Origin_Airport Destination_Airport  \
0    7319483  Airline D              475      Airport 3           Airport 2   
1    4791965  Airline E              538      Airport 5           Airport 4   
2    2991718  Airline C              565      Airport 1           Airport 2   
3    4220106  Airline E              658      Airport 5           Airport 3   
4    2263008  Airline E              566      Airport 2           Airport 2   

   Scheduled_Departure_Time  Day_of_Week  Month Airplane_Type  Weather_Score  \
0                         4            6      1        Type C       0.225122   
1                        12            1      6        Type B       0.060346   
2                        17            3      9        Type C       0.093920   
3                         1            1      8        Type B       0.656750   
4                        19            7     12        Type E       0.505211   

   Previous_Flight_Delay_Minutes  Airline_Ra

### **Handle Missing Values: Impute or drop missing values.**


In [None]:
# Check for missing values
print(df.isnull().sum())

Flight ID                        0
Airline                          0
Flight_Distance                  0
Origin_Airport                   0
Destination_Airport              0
Scheduled_Departure_Time         0
Day_of_Week                      0
Month                            0
Airplane_Type                    0
Weather_Score                    0
Previous_Flight_Delay_Minutes    0
Airline_Rating                   0
Passenger_Load                   0
Flight_Cancelled                 0
dtype: int64


In [None]:
# Handling missing values
# Impute missing values for numerical columns with mean and categorical columns with mode
numerical_cols = df.select_dtypes(include=['float64', 'int64']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

In [None]:
imputer_num = SimpleImputer(strategy='mean')
df[numerical_cols] = imputer_num.fit_transform(df[numerical_cols])

In [None]:
imputer_cat = SimpleImputer(strategy='most_frequent')
df[categorical_cols] = imputer_cat.fit_transform(df[categorical_cols])

In [None]:
# Verify that there are no more missing values
print(df.isnull().sum())

Flight ID                        0
Airline                          0
Flight_Distance                  0
Origin_Airport                   0
Destination_Airport              0
Scheduled_Departure_Time         0
Day_of_Week                      0
Month                            0
Airplane_Type                    0
Weather_Score                    0
Previous_Flight_Delay_Minutes    0
Airline_Rating                   0
Passenger_Load                   0
Flight_Cancelled                 0
dtype: int64


### **Check for Duplicates**

In [None]:
# Check for duplicates and remove if any
df.drop_duplicates(inplace=True)

In [None]:
# Feature Engineering
# Assuming no additional features are required at this step

# Separate the target variable and features
X = df.drop(columns=['Flight_Cancelled'])
y = df['Flight_Cancelled']

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Identify categorical and numerical columns
categorical_cols = ['Airline', 'Origin_Airport', 'Destination_Airport', 'Airplane_Type']
numerical_cols = [col for col in X.columns if col not in categorical_cols]

In [None]:
# Create a column transformer with one-hot encoding for categorical variables and standard scaling for numerical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(), categorical_cols)
    ])

In [None]:
# Create a pipeline that first applies the preprocessor and then fits the Logistic Regression model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))
])


## **Improved Model with Hyperparameter Tuning**

In [None]:
# Define the hyperparameters for GridSearchCV
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear']
}

In [None]:
# Implement GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')

In [None]:
# Train the model using GridSearchCV
grid_search.fit(X_train, y_train)


In [None]:
# Get the best model from GridSearchCV
best_model = grid_search.best_estimator_


In [None]:
# Make predictions on the test set
y_pred = best_model.predict(X_test)

In [None]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

In [None]:
# Print the evaluation metrics and best hyperparameters
print(f'Best Hyperparameters: {grid_search.best_params_}')
print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
print(f'ROC-AUC Score: {roc_auc:.2f}')

Best Hyperparameters: {'classifier__C': 0.1, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}
Accuracy: 0.80
Precision: 0.84
Recall: 0.88
F1 Score: 0.86
ROC-AUC Score: 0.75


## **Conclusion**


This code provides a complete workflow for predicting flight cancellations using Logistic Regression. It includes data preprocessing steps such as encoding categorical variables and feature scaling, model training, and evaluation. By implementing this predictive model, Flyzy can enhance customer satisfaction, optimize operational efficiency, improve business reputation, and increase profitability, aligning with its business objectives.