# Problem: Predicting Airplane Delays

The goals of this notebook are:
- Process and create a dataset from downloaded ZIP files
- Exploratory data analysis (EDA)
- Establish a baseline model and improve it

## Introduction to business scenario
You work for a travel booking website that is working to improve the customer experience for flights that were delayed. The company wants to create a feature to let customers know if the flight will be delayed due to weather when the customers are booking the flight to or from the busiest airports for domestic travel in the US. 

You are tasked with solving part of this problem by leveraging machine learning to identify whether the flight will be delayed due to weather. You have been given access to the a dataset of on-time performance of domestic flights operated by large air carriers. You can use this data to train a machine learning model to predict if the flight is going to be delayed for the busiest airports.

### Dataset
The provided dataset contains scheduled and actual departure and arrival times reported by certified US air carriers that account for at least 1 percent of domestic scheduled passenger revenues. The data was collected by the Office of Airline Information, Bureau of Transportation Statistics (BTS). The dataset contains date, time, origin, destination, airline, distance, and delay status of flights for flights between 2014 and 2018.
The data are in 60 compressed files, where each file contains a CSV for the flight details in a month for the five years (from 2014 - 2018). The data can be downloaded from this [link](https://ucstaff-my.sharepoint.com/:f:/g/personal/ibrahim_radwan_canberra_edu_au/Er0nVreXmihEmtMz5qC5kVIB81-ugSusExPYdcyQTglfLg?e=bNO312). Please download the data files and place them on a relative path. Dataset(s) used in this assignment were compiled by the Office of Airline Information, Bureau of Transportation Statistics (BTS), Airline On-Time Performance Data, available with the following [link](https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ). 

# Step 1: Prepare the environment 

Use one of the labs which we have practised on with the Amazon Sagemakers where you perform the following steps:
1. Start a lab.
2. Create a notebook instance and name it "oncloudproject".
3. Increase the used memory to 25 GB from the additional configurations.
4. Open Jupyter Lab and upload this notebook into it.
5. Upload the two combined CVS files (combined_csv_v1.csv and combined_csv_v2.csv), which you created in Part A of this project.

# Step 2: Build and evaluate simple models

Write code to perform the follwoing steps:
1. Split data into training, validation and testing sets (70% - 15% - 15%).
2. Use linear learner estimator to build a classifcation model.
3. Host the model on another instance
4. Perform batch transform to evaluate the model on testing data
5. Report the performance metrics that you see better test the model performance 

Note: You are required to perform the above steps on the two combined datasets separatey and to comments on the difference.

### 1. Load and Split Data
Load the dataset (`combined_csv_v2.csv`) and split it into training, validation, and testing sets.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv("combined_csv_v2.csv")



  df = pd.read_csv("combined_csv_v2.csv")


### 2. Check and Clean NaN Values
Check for NaN values in each column, remove rows with any NaN values, and verify that no NaNs remain. Finally, save the cleaned data back to the CSV file.


In [2]:
# Check for NaN values
nan_counts = df.isna().sum()
print("NaN values in each column:\n", nan_counts[nan_counts > 0])

# Remove rows with any NaN values
df.dropna(inplace=True)

# Verify that there are no NaN values left
print("\nNaN values after cleaning:\n", df.isna().sum().sum())  # Should output 0 if no NaNs are left

# Save the cleaned data back to the CSV
df.to_csv("combined_csv_v2_cleaned.csv", index=False)
print("Saved cleaned data to 'combined_csv_v2_cleaned.csv'")

NaN values in each column:
 Year_2016         1
Year_2017         1
Year_2018         1
Quarter_2         1
Quarter_3         1
                 ..
DestState_GA      1
DestState_IL      1
DestState_NC      1
DestState_TX      1
isHoliday_True    1
Length: 93, dtype: int64

NaN values after cleaning:
 0
Saved cleaned data to 'combined_csv_v2_cleaned.csv'


### 3. Convert and Split Data for Model Training
Convert any `'TRUE'/'FALSE'` strings to Boolean values if necessary. Split the dataset into features and target, then further split into training (70%), validation (15%), and test (15%) sets. Save these splits to CSV files for later use.


In [3]:
# Convert any 'TRUE'/'FALSE' strings to Boolean values if necessary
df.replace({'TRUE': True, 'FALSE': False}, inplace=True)

# Split features and target
X = df.drop(columns=['target'])  # Replace 'target' with the actual target column name if different
y = df['target']

# Split data into training (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)

# Split temp data into validation (15%) and test (15%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Save these to CSV for uploading to S3
X_train.to_csv("X_train.csv", index=False)
y_train.to_csv("y_train.csv", index=False)
X_val.to_csv("X_val.csv", index=False)
y_val.to_csv("y_val.csv", index=False)
X_test.to_csv("X_test.csv", index=False)
y_test.to_csv("y_test.csv", index=False)


### 4. Initialize S3 Client and Create Bucket
Initialize the S3 client using `boto3` and create a bucket (`flight-delay1`) in the current region. Handle different region constraints accordingly, with error handling to capture any issues during bucket creation.


In [4]:
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')
bucket_name = 'flight-delay1'  # Bucket name specified

# Get the current region
current_region = boto3.session.Session().region_name

# Create the bucket based on region
try:
    if current_region == 'us-east-1':
        # For us-east-1, we do not need to specify LocationConstraint
        response = s3.create_bucket(Bucket=bucket_name)
    else:
        # For other regions, include the LocationConstraint
        response = s3.create_bucket(
            Bucket=bucket_name,
            CreateBucketConfiguration={'LocationConstraint': current_region}
        )
    print(f"Bucket '{bucket_name}' created successfully in {current_region}!")
except Exception as e:
    print(f"Error creating bucket: {e}")


Bucket 'flight-delay1' created successfully in us-east-1!


### 5. Upload Data Files to S3 Bucket
Use `boto3` to upload training, validation, and test data files to the `flight-delay1` S3 bucket under the `linear_learner` folder.


In [5]:
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')
bucket_name = 'flight-delay1'  # The existing bucket name

# Upload files to the flight-delay S3 bucket
for file in ["X_train.csv", "y_train.csv", "X_val.csv", "y_val.csv", "X_test.csv", "y_test.csv"]:
    s3.upload_file(file, bucket_name, f'linear_learner/{file}')
    print(f"Uploaded {file} to s3://{bucket_name}/linear_learner/{file}")


Uploaded X_train.csv to s3://flight-delay1/linear_learner/X_train.csv
Uploaded y_train.csv to s3://flight-delay1/linear_learner/y_train.csv
Uploaded X_val.csv to s3://flight-delay1/linear_learner/X_val.csv
Uploaded y_val.csv to s3://flight-delay1/linear_learner/y_val.csv
Uploaded X_test.csv to s3://flight-delay1/linear_learner/X_test.csv
Uploaded y_test.csv to s3://flight-delay1/linear_learner/y_test.csv


### 6. Check for NaN Values in S3 Files
Download each data file from the `flight-delay1` S3 bucket, load it into a DataFrame, and check for any NaN values. Print the columns with NaN values if found, or confirm if there are none.


In [6]:
import pandas as pd
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')
bucket_name = 'flight-delay1'

# List of file names to check for NaN values
files = ["X_train.csv", "y_train.csv", "X_val.csv", "y_val.csv", "X_test.csv", "y_test.csv"]

# Function to check NaN values in each file
for file in files:
    # Download file from S3
    s3.download_file(bucket_name, f'linear_learner/{file}', file)
    
    # Load the file into a DataFrame
    df = pd.read_csv(file)
    
    # Check for NaN values
    nan_counts = df.isna().sum()
    nan_columns = nan_counts[nan_counts > 0]
    
    if not nan_columns.empty:
        print(f"\nNaN values found in {file}:")
        print(nan_columns)
    else:
        print(f"No NaN values in {file}")


No NaN values in X_train.csv
No NaN values in y_train.csv
No NaN values in X_val.csv
No NaN values in y_val.csv
No NaN values in X_test.csv
No NaN values in y_test.csv


### 7. Check Column Names in S3 Files
Download each data file from the `flight-delay1` S3 bucket, load it into a DataFrame, and print the column names for each file to verify consistency.


In [7]:
import pandas as pd
import boto3

# Initialize the S3 client
s3 = boto3.client('s3')
bucket_name = 'flight-delay1'

# List of file names to check for column names
files = ["X_train.csv", "y_train.csv", "X_val.csv", "y_val.csv", "X_test.csv", "y_test.csv"]

# Function to check column names in each file
for file in files:
    # Download file from S3
    s3.download_file(bucket_name, f'linear_learner/{file}', file)
    
    # Load the file into a DataFrame
    df = pd.read_csv(file)
    
    # Print column names
    print(f"\nColumn names in {file}:")
    print(df.columns.tolist())



Column names in X_train.csv:
['CRSDepTime', 'Cancelled', 'Diverted', 'Distance', 'DistanceGroup', 'ArrDelay', 'ArrDelayMinutes', 'target.1', 'AirTime', 'DepHourofDay', 'AWND_O', 'PRCP_O', 'SNOW_O', 'TAVG_O', 'AWND_D', 'PRCP_D', 'SNOW_D', 'TAVG_D', 'Year_2015', 'Year_2016', 'Year_2017', 'Year_2018', 'Quarter_2', 'Quarter_3', 'Quarter_4', 'Month_2', 'Month_3', 'Month_4', 'Month_5', 'Month_6', 'Month_7', 'Month_8', 'Month_9', 'Month_10', 'Month_11', 'Month_12', 'Day_2', 'Day_3', 'Day_4', 'Day_5', 'Day_6', 'Day_7', 'Day_8', 'Day_9', 'Day_10', 'Day_11', 'Day_12', 'Day_13', 'Day_14', 'Day_15', 'Day_16', 'Day_17', 'Day_18', 'Day_19', 'Day_20', 'Day_21', 'Day_22', 'Day_23', 'Day_24', 'Day_25', 'Day_26', 'Day_27', 'Day_28', 'Day_29', 'Day_30', 'Day_31', 'DayOfWeek_2', 'DayOfWeek_3', 'DayOfWeek_4', 'DayOfWeek_5', 'DayOfWeek_6', 'DayOfWeek_7', 'Reporting_Airline_DL', 'Reporting_Airline_OO', 'Reporting_Airline_UA', 'Reporting_Airline_WN', 'Origin_CLT', 'Origin_DEN', 'Origin_DFW', 'Origin_IAH', 'O

### 8. Combine and Upload Training and Validation Data
Load training and validation features and targets, combine them with the target column as the first column, save these combined files, and upload them to the `flight-delay1` S3 bucket under the `linear_learner` folder.


In [8]:
import pandas as pd
import boto3

# Load training and validation features and targets, then combine them
X_train = pd.read_csv("X_train.csv")
y_train = pd.read_csv("y_train.csv")
train_combined = pd.concat([y_train, X_train], axis=1)  # Place target as the first column

X_val = pd.read_csv("X_val.csv")
y_val = pd.read_csv("y_val.csv")
val_combined = pd.concat([y_val, X_val], axis=1)

# Save combined files
train_combined.to_csv("train_combined.csv", index=False)
val_combined.to_csv("val_combined.csv", index=False)

# Upload to S3
s3 = boto3.client('s3')
bucket_name = 'flight-delay1'
s3.upload_file("train_combined.csv", bucket_name, "linear_learner/train_combined.csv")
s3.upload_file("val_combined.csv", bucket_name, "linear_learner/val_combined.csv")
print("Combined files uploaded to S3.")


Combined files uploaded to S3.


### 9. Train a Binary Classifier with SageMaker Linear Learner
Set up a SageMaker session, download combined training and validation data from the `flight-delay1` S3 bucket, prepare the data, and define a `LinearLearner` estimator for binary classification. Convert data to RecordSets and initiate model training.


In [9]:
from sagemaker import LinearLearner
import boto3
import pandas as pd
import sagemaker

# Set up SageMaker session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket_name = 'flight-delay1'

# Download and load data
train_file = "train_combined.csv"
val_file = "val_combined.csv"
s3 = boto3.client('s3')
s3.download_file(bucket_name, f'linear_learner/{train_file}', train_file)
s3.download_file(bucket_name, f'linear_learner/{val_file}', val_file)

# Prepare training and validation data as numpy arrays
train_df = pd.read_csv(train_file)
val_df = pd.read_csv(val_file)
X_train_np = train_df.drop(columns=['target']).values.astype('float32')
y_train_np = train_df['target'].values.astype('float32')
X_val_np = val_df.drop(columns=['target']).values.astype('float32')
y_val_np = val_df['target'].values.astype('float32')

# Define Linear Learner estimator
linear = LinearLearner(
    role=role,
    instance_count=1,
    instance_type='ml.m5.large',
    predictor_type='binary_classifier',
    sagemaker_session=sagemaker_session,
)

# Convert to RecordSets
train_record_set = linear.record_set(X_train_np, labels=y_train_np)
val_record_set = linear.record_set(X_val_np, labels=y_val_np, channel="validation")

# Train the model by passing the RecordSets directly
linear.fit([train_record_set, val_record_set])


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating training-job with name: linear-learner-2024-10-27-11-04-18-005


2024-10-27 11:04:19 Starting - Starting the training job...
2024-10-27 11:04:33 Starting - Preparing the instances for training...
2024-10-27 11:05:00 Downloading - Downloading input data...
2024-10-27 11:05:41 Downloading - Downloading the training image........[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[10/27/2024 11:07:06 INFO 140606739629888] Reading default configuration from /opt/amazon/lib/python3.8/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '10000000', 'init_method': 'uniform', 'init_scale': '0.07', 'init_sigma': '0.01', 'init_bias': '0.0', 'optimizer': 'auto', 'loss': 'auto', 'margin': '1.0', 'quantile': '0.5', 'loss_insensitivity':

### 10. Deploy the Model to Create an Endpoint
Deploy the trained model to create an endpoint with `LinearLearner`. Retrieve and print the endpoint name for future inference.


In [10]:
# Deploy the model to create an endpoint
linear_predictor = linear.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.large"
)

# Get the endpoint name
model_name = linear_predictor.endpoint_name
print(f"Model deployed with endpoint name: {model_name}")


INFO:sagemaker.image_uris:Same images used for training and inference. Defaulting to image scope: inference.
INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.
INFO:sagemaker:Creating model with name: linear-learner-2024-10-27-11-08-10-588
INFO:sagemaker:Creating endpoint-config with name linear-learner-2024-10-27-11-08-10-588
INFO:sagemaker:Creating endpoint with name linear-learner-2024-10-27-11-08-10-588


-------!Model deployed with endpoint name: linear-learner-2024-10-27-11-08-10-588


### 11. Clean and Prepare Test Data
Load the test data without headers to avoid misinterpretation, convert all values to numeric (replacing non-numeric data with NaN), and remove rows with NaN values. Verify that the data has the expected number of columns, then save the cleaned test data as `X_test_final.csv` without headers or index.


In [21]:
import pandas as pd

# Load the test CSV without headers to avoid accidental misinterpretation
X_test = pd.read_csv("X_train.csv", header=None)

# Check for non-numeric data and ensure all values are numeric
X_test = X_test.apply(pd.to_numeric, errors='coerce')  # Convert to NaN if non-numeric

# Remove rows with NaN values
X_test.dropna(inplace=True)

# Ensure consistent row length matches expected number of features
expected_columns = 112  # Adjust this number based on your model's input features
if X_test.shape[1] != expected_columns:
    print(f"Error: Expected {expected_columns} columns, but found {X_test.shape[1]}.")
else:
    # Save the cleaned data without headers and index
    X_test.to_csv("X_test_final.csv", index=False, header=False)
    print("Cleaned data saved as X_test_final.csv.")


  X_test = pd.read_csv("X_train.csv", header=None)


Cleaned data saved as X_test_final.csv.


### 12. Validate and Adjust Test Data Column Count
Load the training and cleaned test datasets, and check for column count consistency. If `X_test` has extra columns, adjust it to match the expected number from `X_train`. Save the corrected test data as `X_test_final.csv`.


In [24]:
# Load the training and test datasets
X_train = pd.read_csv("X_train.csv")
X_test = pd.read_csv("X_test_final.csv")

# Check for matching column count
if X_test.shape[1] != X_train.shape[1]:
    print(f"Mismatch detected: X_test has {X_test.shape[1]} columns but the model expects {X_train.shape[1]}.")
    # Drop any extra columns if present
    X_test = X_test.iloc[:, :X_train.shape[1]]
    print(f"Adjusted X_test to {X_test.shape[1]} columns to match the model.")

# Save the corrected file
X_test.to_csv("X_test_final.csv", index=False)

### 13. Upload Cleaned Test Data to S3
Upload the cleaned test file (`X_test_final.csv`) to the `flight-delay1` S3 bucket under the `linear_learner` folder.


In [25]:
import boto3

# Upload the cleaned test file to S3
s3 = boto3.client('s3')
s3.upload_file("X_test_final.csv", "flight-delay1", "linear_learner/X_test_final.csv")
print("Uploaded cleaned test file to S3.")


Uploaded cleaned test file to S3.


### 14. Verify Cleaned Test Data
Load the cleaned test data (`X_test_final.csv`) to check its dimensions and ensure all columns are numeric.


In [27]:
import pandas as pd
test_data = pd.read_csv("X_test_final.csv")
print(test_data.shape)  # Check dimensions
print(test_data.dtypes) # Ensure all columns are numeric


(23819, 112)
1355.0    float64
0.0       float64
0.0.1     float64
2139.0    float64
9.0       float64
           ...   
0.0.85    float64
0.0.86    float64
0.0.87    float64
0.0.88    float64
0.0.89    float64
Length: 112, dtype: object


### 15. Reload and Clean Test Data with Simple Column Names
Reload the test data with simplified numeric column names, ensure all columns are numeric, and save the cleaned data as `X_test_final_cleaned.csv`.


In [28]:
# Reload with simple column names
test_data.columns = range(test_data.shape[1]) 
# Ensure all columns are numeric
test_data = test_data.apply(pd.to_numeric, errors='coerce')
# Save cleaned data
test_data.to_csv("X_test_final_cleaned.csv", index=False)


### 16. Perform Batch Transformation with SageMaker
Reload and clean test data, upload it to S3, and initiate a SageMaker batch transform job using the specified model endpoint. After processing, the transformation results are saved to the `linear_learner/output` folder in the `flight-delay1` S3 bucket.


In [29]:
from sagemaker.transformer import Transformer
import pandas as pd
import boto3
import os

# Initialize S3 client and specify bucket and file paths
s3 = boto3.client('s3')
bucket_name = "flight-delay1"
input_file = "linear_learner/X_test_final.csv"
output_path = f"s3://{bucket_name}/linear_learner/output"

# Reload, clean, and save test data
df = pd.read_csv(f"s3://{bucket_name}/{input_file}")
df.columns = range(df.shape[1])  # Rename columns for consistency
df = df.apply(pd.to_numeric, errors='coerce')
df.to_csv("/tmp/X_test_final_cleaned.csv", index=False)

# Upload cleaned data to S3
s3.upload_file("/tmp/X_test_final_cleaned.csv", bucket_name, input_file)

# Initialize Transformer with the correct model endpoint
transformer = Transformer(
    model_name="linear-learner-2024-10-27-11-08-10-588",
    instance_count=1,
    instance_type="ml.m5.large",
    output_path=output_path
)

# Run the batch transform job with cleaned data
transformer.transform(f"s3://{bucket_name}/{input_file}", content_type="text/csv", split_type="Line")
transformer.wait()

print("Batch transformation completed. Results saved to S3.")


severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.

INFO:botocore.credentials:Found credentials from IAM Role: BaseNotebookInstanceEc2InstanceRole
INFO:sagemaker:Creating transform job with name: linear-learner-2024-10-27-11-44-34-525


...............................[34mDocker entrypoint called with argument(s): serve[0m
[34mRunning default environment configuration script[0m
[34m[10/27/2024 11:49:42 INFO 140355388847936] Memory profiler is not enabled by the environment variable ENABLE_PROFILER.[0m
  if num_device is 1 and 'dist' not in kvstore:[0m
  if cons['type'] is 'ineq':[0m
  if len(self.X_min) is not 0:[0m
[34m[10/27/2024 11:49:46 INFO 140355388847936] loaded entry point class algorithm.serve.server_config:config_api[0m
[34m[10/27/2024 11:49:46 INFO 140355388847936] loading entry points[0m
[34m[10/27/2024 11:49:46 INFO 140355388847936] loaded request iterator application/json[0m
[34m[10/27/2024 11:49:46 INFO 140355388847936] loaded request iterator application/jsonlines[0m
[34m[10/27/2024 11:49:46 INFO 140355388847936] loaded request iterator application/x-recordio-protobuf[0m
[34m[10/27/2024 11:49:46 INFO 140355388847936] loaded request iterator text/csv[0m
[34m[10/27/2024 11:49:46 INFO

### 17. List Batch Transformation Output Files in S3
List the objects in the `linear_learner/output` folder of the `flight-delay1` S3 bucket to verify the files generated from the batch transformation job.


In [30]:
# List objects in the output folder
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket='flight-delay1', Prefix='linear_learner/output/')
for obj in response.get('Contents', []):
    print(obj['Key'])


linear_learner/output/X_test_final.csv.out


### 18. Download Batch Transformation Results
Download the batch transformation output file (`X_test_final.csv.out`) from the `linear_learner/output` folder in the `flight-delay1` S3 bucket and save it locally as `predictions.csv`. Adjust the file name if necessary.


In [31]:
# Replace 'X_test.csv.out' with the correct file name if necessary
s3.download_file('flight-delay1', 'linear_learner/output/X_test_final.csv.out', 'predictions.csv')


### 19. Evaluate Model Predictions
Download the batch transformation predictions from the S3 bucket, parse them, and compare with the actual labels to evaluate the model's performance. Calculate and display metrics such as accuracy and a classification report.


In [32]:
import pandas as pd
import json
from sklearn.metrics import accuracy_score, classification_report
import boto3

# Define the S3 path for predictions and download them
output_key = 'linear_learner/output/X_test_final.csv.out'
s3 = boto3.client('s3')
s3.download_file('flight-delay1', output_key, 'predictions.csv')

# Load predictions, parsing JSON if needed
with open('predictions.csv', 'r') as f:
    predictions = pd.Series([json.loads(line)['predicted_label'] for line in f])

# Load the actual labels
y_test = pd.read_csv('y_test.csv', header=None).squeeze()

# Ensure both y_test and predictions are of numeric type
y_test = pd.to_numeric(y_test, errors='coerce')
predictions = pd.to_numeric(predictions, errors='coerce')

# Drop any rows where conversion failed (if any NaNs were introduced by conversion)
y_test = y_test.dropna().reset_index(drop=True)
predictions = predictions.dropna().reset_index(drop=True)

# Align lengths
min_length = min(len(y_test), len(predictions))
y_test_aligned = y_test.iloc[:min_length]
predictions_aligned = predictions.iloc[:min_length]

# Calculate and display performance metrics
print("Test Accuracy:", accuracy_score(y_test_aligned, predictions_aligned))
print(classification_report(y_test_aligned, predictions_aligned))


Test Accuracy: 0.6389212827988339
              precision    recall  f1-score   support

         0.0       0.76      0.76      0.76      5216
         1.0       0.25      0.25      0.25      1644

    accuracy                           0.64      6860
   macro avg       0.50      0.50      0.50      6860
weighted avg       0.64      0.64      0.64      6860



### Final Comments

The model achieved an overall test accuracy of **63.9%**. The performance metrics for each class are as follows:

- **Class 0 (Non-target)**: High precision (0.76) and recall (0.76), indicating that the model is relatively effective at identifying this majority class.
- **Class 1 (Target)**: Lower precision (0.25) and recall (0.25), suggesting the model struggles to identify this minority class accurately.

### Observations:

- The **macro average** f1-score (0.50) highlights the uneven performance across classes, while the **weighted average** (0.64) reflects the model's bias towards the majority class.

### Recommendations:

To improve Class 1 predictions, consider:
- **Balancing the dataset** through techniques like oversampling or synthetic data generation for the minority class.
- **Adjusting the model** to account for class imbalance, such as using class weights or exploring ensemble methods that could enhance minority class prediction.
