## Dataset Column Descriptions

| **Column Name**        | **Description**                                                                |
|------------------------|--------------------------------------------------------------------------------|
| **flightdate**         | The date of the flight in YYYY-MM-DD format.                                   |
| **day_of_week**        | The day of the week the flight occurred (e.g., Monday, Tuesday).               |
| **airline**            | The airline operating the flight (e.g., Delta, United).                        |
| **tail_number**        | The unique identifier for the aircraft.                                        |
| **dep_airport**        | The IATA code of the departure airport.                                        |
| **dep_cityname**       | The name of the city where the departure airport is located.                   |
| **deptime_label**      | The time label of the flight's departure (e.g., morning, afternoon).           |
| **dep_delay**          | The delay time in minutes for the departure.                                   |
| **dep_delay_tag**      | Whether the flight was delayed or not.                                         |
| **dep_delay_type**     | The severity of the delay time (e.g., low, medium, high).                      |
| **arr_airport**        | The IATA code of the arrival airport.                                          |
| **arr_cityname**       | The name of the city where the arrival airport is located.                     |
| **arr_delay**          | The delay time in minutes for the arrival.                                     |
| **arr_delay_type**     | The severity of the delay time (e.g., low, medium, high).                      |
| **flight_duration**    | The total duration of the flight in minutes.                                   |
| **distance_type**      | A classification of the flight's distance (e.g., short-haul, long-haul).       |
| **delay_carrier**      | Delay time (in minutes) attributed to the carrier.                             |
| **delay_weather**      | Delay time (in minutes) caused by weather conditions.                          |
| **delay_nas**          | Delay time (in minutes) due to National Airspace System issues.                |
| **delay_security**     | Delay time (in minutes) due to security-related issues.                        |
| **delay_lastaircraft** | Delay time (in minutes) caused by the previous aircraft's issues.              |
| **manufacturer**       | The manufacturer of the aircraft (e.g., Boeing, Airbus).                       |
| **model**              | The specific model of the aircraft (e.g., 737-800, A320).                      |
| **aicraft_age**        | The age of the aircraft in years.                                              |
| **tavg**               | The average temperature (°C) on the flight date.                               |
| **tmin**               | The minimum temperature (°C) on the flight date.                               |
| **tmax**               | The maximum temperature (°C) on the flight date.                               |
| **prcp**               | The amount of precipitation (mm) on the flight date.                           |
| **snow**               | The amount of snowfall (mm) on the flight date.                                |
| **wdir**               | The wind direction (degrees) on the flight date.                       |
| **wspd**               | The wind speed (km/h) on the flight date.                              |
| **pres**               | The atmospheric pressure (hPa) on the flight date.                             |

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import boto3
from io import BytesIO
from botocore.exceptions import NoCredentialsError, ClientError
import seaborn as sns
import time
import matplotlib.pyplot as plt
import warnings
from pyspark.sql import SparkSession
from pyspark.ml.classification import (
    LogisticRegression,
    RandomForestClassifier,
    GBTClassifier,
    DecisionTreeClassifier
)

# For Pipeline and Feature Transformation
from pyspark.ml import Pipeline
from pyspark.ml.feature import (
    VectorAssembler,
    StringIndexer,
    OneHotEncoder,
    StandardScaler,
    MinMaxScaler,
    RobustScaler
)

# For Model Evaluation
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# For SQL and DataFrame Operations
from pyspark.sql.functions import col, monotonically_increasing_id
from pyspark.sql.types import DoubleType

In [2]:
pd.set_option("display.max_columns", None)

warnings.filterwarnings("ignore")

sns.set(style="whitegrid")

In [3]:
spark = SparkSession.builder \
    .appName("Classification") \
    .config("spark.executor.memory", "16g") \
    .config("spark.driver.memory", "16g") \
    .config("spark.memory.offHeap.enabled", True) \
    .config("spark.memory.offHeap.size", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.task.maxFailures", "4") \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


25/01/09 07:02:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
spark.sparkContext.setLogLevel("ERROR")

## Import Data from S3 Bucket

In [5]:
bucket = 'baobucketaws'
prefix = 'asm3/raw/airport/'

# Create an S3 client
s3 = boto3.client('s3')


def read_csv_from_s3_as_df(bucket, key):
    try:
        # Get the object from S3
        obj = s3.get_object(Bucket=bucket, Key=key)

        # Read the contents of the file into a pandas DataFrame
        df_pre_clean = pd.read_csv(BytesIO(obj['Body'].read()), header=0)

        return df_pre_clean
    except NoCredentialsError:
        print("Credentials not available")
    except ClientError as e:
        print(f"An error occurred: {e}")
    except Exception as e:
        print(f"An error occurred during DataFrame conversion: {e}")

In [6]:
def list_s3_csv_files(bucket, prefix):
    try:
        response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix)
        if 'Contents' in response:
            return [obj['Key'] for obj in response['Contents'] if obj['Key'].endswith('.csv')]
        else:
            return []
    except ClientError as e:
        print(f"An error occurred while listing files: {e}")
        return []


# Fetch all CSV files in the folder
csv_keys = list_s3_csv_files(bucket, prefix)

# Initialize an empty list to hold DataFrames
dataframes = []

# Read and concatenate all CSV files
for key in csv_keys:
    df = read_csv_from_s3_as_df(bucket, key)
    if df is not None:
        dataframes.append(df)

# Concatenate all DataFrames into a single DataFrame
if dataframes:
    concatenated_df = pd.concat(dataframes, ignore_index=True)
    print("Concatenated DataFrame:")
    print(concatenated_df)
else:
    print("No dataframes to concatenate.")

Concatenated DataFrame:
         FlightDate  Day_Of_Week                Airline Tail_Number  \
0        2023-01-02            1           Endeavor Air      N601LR   
1        2023-01-03            2           Endeavor Air      N910XJ   
2        2023-01-04            3           Endeavor Air      N607LR   
3        2023-01-05            4           Endeavor Air      N600LR   
4        2023-01-06            5           Endeavor Air      N607LR   
...             ...          ...                    ...         ...   
6743399  2023-12-19            2  Skywest Airlines Inc.      N507SY   
6743400  2023-12-29            5  Skywest Airlines Inc.      N520SY   
6743401  2023-12-22            5  Skywest Airlines Inc.      N706SK   
6743402  2023-12-28            4  Skywest Airlines Inc.      N707SK   
6743403  2023-12-29            5  Skywest Airlines Inc.      N719EV   

        Dep_Airport                    Dep_CityName DepTime_label  Dep_Delay  \
0               ABE  Allentown/Bethlehem/Ea

## Data Exploration

In [7]:
df = concatenated_df

In [8]:
df.columns = df.columns.str.lower()

In [9]:
df.head(5)

Unnamed: 0,flightdate,day_of_week,airline,tail_number,dep_airport,dep_cityname,deptime_label,dep_delay,dep_delay_tag,dep_delay_type,arr_airport,arr_cityname,arr_delay,arr_delay_type,flight_duration,distance_type,delay_carrier,delay_weather,delay_nas,delay_security,delay_lastaircraft,manufacturer,model,aicraft_age,time,tavg,tmin,tmax,prcp,snow,wdir,wspd,pres,airport_id
0,2023-01-02,1,Endeavor Air,N601LR,ABE,"Allentown/Bethlehem/Easton, PA",Afternoon,-5,0,Low <5min,ATL,"Atlanta, GA",-23,Low <5min,117,Short Haul >1500Mi,0,0,0,0,0,CANADAIR REGIONAL JET,CRJ,17,2023-01-02,5.4,0.0,11.7,0.0,0.0,353.0,3.6,1019.6,ABE
1,2023-01-03,2,Endeavor Air,N910XJ,ABE,"Allentown/Bethlehem/Easton, PA",Afternoon,-4,0,Low <5min,ATL,"Atlanta, GA",-5,Low <5min,134,Short Haul >1500Mi,0,0,0,0,0,CANADAIR REGIONAL JET,CRJ,17,2023-01-03,8.4,7.2,9.4,15.2,0.0,50.0,5.0,1013.9,ABE
2,2023-01-04,3,Endeavor Air,N607LR,ABE,"Allentown/Bethlehem/Easton, PA",Afternoon,9,1,Low <5min,ATL,"Atlanta, GA",1,Low <5min,127,Short Haul >1500Mi,0,0,0,0,0,CANADAIR REGIONAL JET,CRJ,16,2023-01-04,11.1,6.7,17.2,0.0,0.0,302.0,4.7,1009.8,ABE
3,2023-01-05,4,Endeavor Air,N600LR,ABE,"Allentown/Bethlehem/Easton, PA",Afternoon,-5,0,Low <5min,ATL,"Atlanta, GA",12,Low <5min,152,Short Haul >1500Mi,0,0,0,0,0,CANADAIR REGIONAL JET,CRJ,17,2023-01-05,12.7,6.7,14.4,7.9,0.0,292.0,7.2,1013.0,ABE
4,2023-01-06,5,Endeavor Air,N607LR,ABE,"Allentown/Bethlehem/Easton, PA",Afternoon,-7,0,Low <5min,ATL,"Atlanta, GA",-23,Low <5min,119,Short Haul >1500Mi,0,0,0,0,0,CANADAIR REGIONAL JET,CRJ,16,2023-01-06,5.8,2.8,7.2,5.8,0.0,308.0,9.0,1016.6,ABE


In [10]:
def sample_flights(dataframe, dep_airport_col='dep_airport', date_col='flightdate', n_flights=10):
    dataframe[date_col] = pd.to_datetime(dataframe[date_col])

    # Group by dep_airport and date
    grouped = dataframe.groupby([dep_airport_col, date_col])

    # Sample 10 flights from each group, allowing replacement if a group has fewer rows
    sampled_data = grouped.apply(lambda x: x.sample(n=n_flights, replace=len(x) < n_flights)).reset_index(drop=True)

    return sampled_data

In [11]:
df = sample_flights(df, dep_airport_col='dep_airport', date_col='flightdate', n_flights=10)

# Verify the sampling
df.shape

(1192190, 34)

In [12]:
df = spark.createDataFrame(df)

## Data Cleaning

In [13]:
df.dtypes

[('flightdate', 'timestamp'),
 ('day_of_week', 'bigint'),
 ('airline', 'string'),
 ('tail_number', 'string'),
 ('dep_airport', 'string'),
 ('dep_cityname', 'string'),
 ('deptime_label', 'string'),
 ('dep_delay', 'bigint'),
 ('dep_delay_tag', 'bigint'),
 ('dep_delay_type', 'string'),
 ('arr_airport', 'string'),
 ('arr_cityname', 'string'),
 ('arr_delay', 'bigint'),
 ('arr_delay_type', 'string'),
 ('flight_duration', 'bigint'),
 ('distance_type', 'string'),
 ('delay_carrier', 'bigint'),
 ('delay_weather', 'bigint'),
 ('delay_nas', 'bigint'),
 ('delay_security', 'bigint'),
 ('delay_lastaircraft', 'bigint'),
 ('manufacturer', 'string'),
 ('model', 'string'),
 ('aicraft_age', 'bigint'),
 ('time', 'string'),
 ('tavg', 'double'),
 ('tmin', 'double'),
 ('tmax', 'double'),
 ('prcp', 'double'),
 ('snow', 'double'),
 ('wdir', 'double'),
 ('wspd', 'double'),
 ('pres', 'double'),
 ('airport_id', 'string')]

In [14]:
df = df.drop("time", "airport_id")

In [15]:
num_cols = [col for col, dtype in df.dtypes if dtype in ["double", "bigint", "int"]]

cat_cols = [col for col, dtype in df.dtypes if dtype in ["string", "timestamp"]]

In [16]:
print(f"Numerical Columns: {num_cols} with count of {len(num_cols)}","\n")

print(f"Categorical Columns: {cat_cols} with count of {len(cat_cols)}")

Numerical Columns: ['day_of_week', 'dep_delay', 'dep_delay_tag', 'arr_delay', 'flight_duration', 'delay_carrier', 'delay_weather', 'delay_nas', 'delay_security', 'delay_lastaircraft', 'aicraft_age', 'tavg', 'tmin', 'tmax', 'prcp', 'snow', 'wdir', 'wspd', 'pres'] with count of 19 

Categorical Columns: ['flightdate', 'airline', 'tail_number', 'dep_airport', 'dep_cityname', 'deptime_label', 'dep_delay_type', 'arr_airport', 'arr_cityname', 'arr_delay_type', 'distance_type', 'manufacturer', 'model'] with count of 13


## Train Model (Classification)

In [17]:
def encode_data(X_train, X_test, encoder_type='label', columns=None):
    if columns is None:
        # Default to all string (categorical) columns if no columns are specified
        columns = [coll for coll, dtype in X_train.dtypes if dtype == 'string']

    stages = []  # List to store transformation stages

    for coll in columns:
        if encoder_type == 'label':
            # Label Encoding
            indexer = StringIndexer(inputCol=coll, outputCol=f"{coll}_indexed", handleInvalid="keep")
            stages.append(indexer)
        elif encoder_type == 'onehot':
            # One-Hot Encoding
            indexer = StringIndexer(inputCol=coll, outputCol=f"{coll}_indexed", handleInvalid="keep")
            onehot_encoder = OneHotEncoder(inputCol=f"{coll}_indexed", outputCol=f"{coll}_encoded")
            stages.append(indexer)
            stages.append(onehot_encoder)

    # Create a Pipeline with all the stages
    pipeline = Pipeline(stages=stages)

    # Fit the pipeline on the training data
    pipeline_model = pipeline.fit(X_train)

    # Transform both train and test data
    X_train_encoded = pipeline_model.transform(X_train)
    X_test_encoded = pipeline_model.transform(X_test)

    return X_train_encoded, X_test_encoded

In [18]:
def encode_target(y_train, y_test, encoder_type='label'):
    if encoder_type == 'label':
        # Combine y_train and y_test for consistent encoding
        combined_df = y_train.union(y_test)

        # StringIndexer for label encoding
        indexer = StringIndexer(inputCol="label", outputCol="label_indexed")
        indexer_model = indexer.fit(combined_df)  # Fit on combined data to ensure consistency

        # Transform training and testing data
        y_train_encoded = indexer_model.transform(y_train)
        y_test_encoded = indexer_model.transform(y_test)
    else:
        raise ValueError("Invalid encoder_type. Currently supported: 'label'.")

    return y_train_encoded, y_test_encoded

In [19]:
def scale_data(X_train, X_test, scaler_type='standard'):
    # Step 1: Choose the appropriate scaler
    if scaler_type == 'standard':
        scaler = StandardScaler(inputCol="features", outputCol="scaled_features", withMean=True, withStd=True)
    elif scaler_type == 'minmax':
        scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
    elif scaler_type == 'robust':
        scaler = RobustScaler(inputCol="features", outputCol="scaled_features")
    else:
        raise ValueError("Invalid scaler_type. Choose from 'standard', 'minmax', 'robust'.")

    # Step 2: Create a Pipeline
    pipeline = Pipeline(stages=[scaler])

    # Step 3: Fit the pipeline on the training data
    pipeline_model = pipeline.fit(X_train)

    # Step 4: Transform both train and test data
    X_train_scaled = pipeline_model.transform(X_train).drop("features").withColumnRenamed("scaled_features", "features")
    X_test_scaled = pipeline_model.transform(X_test).drop("features").withColumnRenamed("scaled_features", "features")

    return X_train_scaled, X_test_scaled

In [20]:
def evaluate_classification_models(X_train, y_train, X_test, y_test, models):
    # Combine features and labels into single DataFrames
    train_data = X_train.join(y_train, "id")
    test_data = X_test.join(y_test, "id")

    model_results = []
    trained_models = {}

    for model in models:
        start_time = time.time()

        # Train the model using a pipeline
        pipeline = Pipeline(stages=[model])
        trained_model = pipeline.fit(train_data)
        trained_models[model.__class__.__name__] = trained_model

        # Make predictions on both train and test datasets
        predictions = trained_model.transform(test_data)

        # Define evaluators for metrics
        evaluator_accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
        evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
        evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
        evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
        evaluator_roc = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
        evaluator_pr = BinaryClassificationEvaluator(labelCol="label", rawPredictionCol="rawPrediction", metricName="areaUnderPR")

        # Calculate metrics for test data
        accuracy = evaluator_accuracy.evaluate(predictions)
        f1_score = evaluator_f1.evaluate(predictions)
        precision = evaluator_precision.evaluate(predictions)
        recall = evaluator_recall.evaluate(predictions)
        roc_auc = evaluator_roc.evaluate(predictions)
        pr_auc = evaluator_pr.evaluate(predictions)

        inference_time = time.time() - start_time  # Inference time in seconds

        # Log results
        print(f"{model.__class__.__name__} is ready")

        model_results.append({
            "Model-Name": model.__class__.__name__,
            "Test_Accuracy": accuracy,
            "F1_Score": f1_score,
            "Precision": precision,
            "Recall": recall,
            "ROC_AUC": roc_auc,
            "PR_AUC": pr_auc,
            "Inference Time (ms)": inference_time * 1000
        })

    # Convert results to a pandas DataFrame
    models_df = pd.DataFrame(model_results)
    models_df = models_df.set_index("Model-Name")

    return models_df.sort_values("Test_Accuracy", ascending=False), trained_models

In [21]:
def evaluate_classification_metrics(y_true, y_pred, target_names=None, display=True):
    # Ensure the columns are named correctly for PySpark evaluation
    data = y_true.join(y_pred, "id")

    # Initialize evaluators
    evaluator_accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
    evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
    evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
    evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")

    # Calculate metrics
    accuracy = evaluator_accuracy.evaluate(data)
    f1_score = evaluator_f1.evaluate(data)
    precision = evaluator_precision.evaluate(data)
    recall = evaluator_recall.evaluate(data)

    # Confusion matrix
    cm_df = data.groupBy("label", "prediction").count().toPandas()
    cm = pd.crosstab(cm_df["label"], cm_df["prediction"], values=cm_df["count"], aggfunc="sum").fillna(0).to_numpy()

    # Normalized confusion matrix
    cm_normalized = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]

    # Optionally display confusion matrix
    if display:
        plt.figure(figsize=(8, 6))
        sns.heatmap(cm_normalized, annot=True, fmt=".2f", cmap="Blues", xticklabels=target_names, yticklabels=target_names)
        plt.title("Normalized Confusion Matrix")
        plt.xlabel("Predicted Label")
        plt.ylabel("True Label")
        plt.show()

    # Generate classification report
    if target_names is None:
        target_names = [str(i) for i in range(cm.shape[0])]

    classification_report_df = pd.DataFrame({
        "Precision": cm.diagonal() / cm.sum(axis=0),
        "Recall": cm.diagonal() / cm.sum(axis=1),
        "F1-Score": 2 * (cm.diagonal() / cm.sum(axis=0)) * (cm.diagonal() / cm.sum(axis=1)) /
                        ((cm.diagonal() / cm.sum(axis=0)) + (cm.diagonal() / cm.sum(axis=1))),
    }, index=target_names).fillna(0)

    # Return results in a dictionary
    evaluation_results = {
        "Accuracy": accuracy,
        "F1 Score": f1_score,
        "Precision": precision,
        "Recall": recall,
        "Confusion Matrix": cm,
        "Confusion Matrix Normalized": cm_normalized,
        "Classification Report": classification_report_df,
    }

    return evaluation_results

In [22]:
classification_models = [
    LogisticRegression(featuresCol="features", labelCol="label", maxIter=10, regParam=0.01),

    RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=50, seed=42),

    GBTClassifier(featuresCol="features", labelCol="label", maxIter=10, seed=42),

    DecisionTreeClassifier(featuresCol="features", labelCol="label", maxDepth=5, seed=42),
]

In [23]:
# Combine X and y into a single DataFrame
data = df.select(
    *[col(c) for c in df.columns if c not in [
        'dep_delay', "flightdate", "tail_number", "deptime_label",
        "dep_airport", "dep_cityname", "tmin", "tmax", "day_of_week",
        "delay_carrier", "delay_nas", "delay_security", "delay_lastaircraft", "delay_weather"
    ]],  # Keep only desired columns
).withColumn("label", col("dep_delay_tag").cast(DoubleType()))  # Set 'label' column as target variable

# Drop the original target column to avoid duplication
data = data.drop("dep_delay_tag")

In [24]:
data = data.withColumn("id", monotonically_increasing_id())

In [25]:
data.dtypes

[('airline', 'string'),
 ('dep_delay_type', 'string'),
 ('arr_airport', 'string'),
 ('arr_cityname', 'string'),
 ('arr_delay', 'bigint'),
 ('arr_delay_type', 'string'),
 ('flight_duration', 'bigint'),
 ('distance_type', 'string'),
 ('manufacturer', 'string'),
 ('model', 'string'),
 ('aicraft_age', 'bigint'),
 ('tavg', 'double'),
 ('prcp', 'double'),
 ('snow', 'double'),
 ('wdir', 'double'),
 ('wspd', 'double'),
 ('pres', 'double'),
 ('label', 'double'),
 ('id', 'bigint')]

In [26]:
feature_columns = [col for col, dtype in data.dtypes if dtype in ['double', 'bigint'] and col != 'label']
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
data = vector_assembler.transform(data).select("id", "features", "label")

In [27]:
# Split the data into training and testing sets
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

X_train = train_data.select("id", "features")
y_train = train_data.select("id", "label")
X_test = test_data.select("id", "features")
y_test = test_data.select("id", "label")

In [28]:
X_train_encoded, X_test_encoded = encode_data(X_train, X_test, encoder_type='label')

In [29]:
y_train_encoded, y_test_encoded = encode_target(y_train, y_test)

                                                                                

In [30]:
models_class_no_s, trained_no_s = evaluate_classification_models(X_train_encoded, y_train_encoded, X_test_encoded,
                                                                 y_test_encoded, classification_models)

                                                                                

LogisticRegression is ready


                                                                                

RandomForestClassifier is ready


                                                                                

GBTClassifier is ready


                                                                                

DecisionTreeClassifier is ready


In [31]:
models_class_no_s.iloc[:, :-1]

Unnamed: 0_level_0,Test_Accuracy,F1_Score,Precision,Recall,ROC_AUC,PR_AUC
Model-Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GBTClassifier,0.860186,0.855004,0.857366,0.860186,0.910736,0.846864
DecisionTreeClassifier,0.857656,0.85287,0.854386,0.857656,0.548771,0.578424
RandomForestClassifier,0.854373,0.851973,0.851264,0.854373,0.881739,0.809275
LogisticRegression,0.826738,0.804625,0.841806,0.826738,0.901547,0.823468


In [32]:
X_train_ss, X_test_ss = scale_data(X_train_encoded, X_test_encoded, scaler_type="standard")

                                                                                

In [33]:
models_class_ss, trained_ss = evaluate_classification_models(X_train_ss, y_train_encoded, X_test_ss,
                                                             y_test_encoded, classification_models)

models_class_ss.iloc[:, :-1]

                                                                                

LogisticRegression is ready


                                                                                

RandomForestClassifier is ready


                                                                                

GBTClassifier is ready


                                                                                

DecisionTreeClassifier is ready


Unnamed: 0_level_0,Test_Accuracy,F1_Score,Precision,Recall,ROC_AUC,PR_AUC
Model-Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GBTClassifier,0.860207,0.855301,0.857237,0.860207,0.910774,0.846757
DecisionTreeClassifier,0.857028,0.854012,0.853687,0.857028,0.544734,0.580071
RandomForestClassifier,0.853556,0.850413,0.850029,0.853556,0.887456,0.813551
LogisticRegression,0.826738,0.804625,0.841806,0.826738,0.901546,0.82347


In [34]:
X_train_sm, X_test_sm = scale_data(X_train_encoded, X_test_encoded, scaler_type="minmax")

                                                                                

In [35]:
models_class_mm, trained_mm = evaluate_classification_models(X_train_sm, y_train_encoded, X_test_sm,
                                                             y_test_encoded, classification_models)

models_class_mm.iloc[:, :-1]

                                                                                

LogisticRegression is ready


                                                                                

RandomForestClassifier is ready


                                                                                

GBTClassifier is ready


                                                                                

DecisionTreeClassifier is ready


Unnamed: 0_level_0,Test_Accuracy,F1_Score,Precision,Recall,ROC_AUC,PR_AUC
Model-Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GBTClassifier,0.859319,0.854858,0.856088,0.859319,0.910653,0.846566
DecisionTreeClassifier,0.857325,0.853857,0.853884,0.857325,0.551643,0.584544
RandomForestClassifier,0.853665,0.850546,0.850151,0.853665,0.889513,0.812863
LogisticRegression,0.826738,0.804625,0.841806,0.826738,0.901546,0.823468


In [36]:
X_train_sr, X_test_sr = scale_data(X_train_encoded, X_test_encoded, scaler_type="robust")

                                                                                

In [37]:
models_class_rs, trained_rs = evaluate_classification_models(X_train_sr, y_train_encoded, X_test_sr,
                                                             y_test_encoded, classification_models)

models_class_rs.iloc[:, :-1]

                                                                                

LogisticRegression is ready


                                                                                

RandomForestClassifier is ready


                                                                                

GBTClassifier is ready


                                                                                

DecisionTreeClassifier is ready


Unnamed: 0_level_0,Test_Accuracy,F1_Score,Precision,Recall,ROC_AUC,PR_AUC
Model-Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GBTClassifier,0.860186,0.855004,0.857366,0.860186,0.910766,0.846868
DecisionTreeClassifier,0.857656,0.85287,0.854386,0.857656,0.548771,0.578424
RandomForestClassifier,0.854365,0.851985,0.851269,0.854365,0.881719,0.808871
LogisticRegression,0.826729,0.804614,0.841799,0.826729,0.901546,0.823471


In [38]:
models_class_no_s["Scaler"] = "No Scaling"
models_class_ss["Scaler"] = "Standard Scaler"
models_class_mm["Scaler"] = "MinMax Scaler"
models_class_rs["Scaler"] = "Robust Scaler"


all_models = pd.concat([models_class_no_s, models_class_ss, models_class_mm, models_class_rs], axis=0)
all_models = all_models.sort_values(by="Test_Accuracy", ascending=False)
all_models

Unnamed: 0_level_0,Test_Accuracy,F1_Score,Precision,Recall,ROC_AUC,PR_AUC,Inference Time (ms),Scaler
Model-Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
GBTClassifier,0.860207,0.855301,0.857237,0.860207,0.910774,0.846757,82801.120758,Standard Scaler
GBTClassifier,0.860186,0.855004,0.857366,0.860186,0.910736,0.846864,85721.712351,No Scaling
GBTClassifier,0.860186,0.855004,0.857366,0.860186,0.910766,0.846868,83913.283825,Robust Scaler
GBTClassifier,0.859319,0.854858,0.856088,0.859319,0.910653,0.846566,82483.63924,MinMax Scaler
DecisionTreeClassifier,0.857656,0.85287,0.854386,0.857656,0.548771,0.578424,75446.5518,No Scaling
DecisionTreeClassifier,0.857656,0.85287,0.854386,0.857656,0.548771,0.578424,76135.913372,Robust Scaler
DecisionTreeClassifier,0.857325,0.853857,0.853884,0.857325,0.551643,0.584544,75215.095758,MinMax Scaler
DecisionTreeClassifier,0.857028,0.854012,0.853687,0.857028,0.544734,0.580071,75246.641397,Standard Scaler
RandomForestClassifier,0.854373,0.851973,0.851264,0.854373,0.881739,0.809275,115112.721443,No Scaling
RandomForestClassifier,0.854365,0.851985,0.851269,0.854365,0.881719,0.808871,114909.654856,Robust Scaler
