# Part D: Classification with Random Forest
This notebook addresses Part D of the assignment step by step.
The goal is to train a classification model using the provided training dataset (`datasetTV.csv`) and apply it to predict labels for the test dataset (`datasetTest.csv`).

## 1. Load and Preprocess the Data
We will:
- Load the training dataset (`datasetTV.csv`) and split it into features and labels.
- Load the test dataset (`datasetTest.csv`).
- Ensure the column names of the test dataset match the training dataset's feature names.
- Normalize the data using `StandardScaler`.

In [None]:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Load datasets
dataset_tv_path = "datasetTV.csv"
dataset_test_path = "datasetTest.csv"

train_data = pd.read_csv(dataset_tv_path)
test_data = pd.read_csv(dataset_test_path)

# Split into features and labels
X_train = train_data.iloc[:, :-1]  # All features except the last column
y_train = train_data.iloc[:, -1]   # Last column as the label

# Rename columns of test data to match training data
X_test = test_data
X_test.columns = X_train.columns

# Normalize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 2. Train a Random Forest Classifier
We will train a `RandomForestClassifier` using hyperparameter tuning via `GridSearchCV`.
The best model will be selected based on cross-validation accuracy.

In [None]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Train a Random Forest Classifier with hyperparameter tuning
rf_clf = RandomForestClassifier(random_state=42)
params = {
    'n_estimators': [50, 100, 150],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(rf_clf, params, cv=3, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

# Best model
best_rf = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# Evaluate on the training data
y_train_pred = best_rf.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy:.4f}")


## 3. Predict on the Test Set
Using the trained model, we predict the labels for the test dataset and save the predictions as a NumPy array.

In [None]:

# Predict on the test set
y_test_pred = best_rf.predict(X_test_scaled)

# Save predictions to a numpy file
output_path = "labelsX.npy"
np.save(output_path, y_test_pred)

print(f"Predictions saved to {output_path}")


## Observations
- The model was trained using a Random Forest with hyperparameter tuning.
- The predictions for the test dataset were saved in NumPy format (`labelsX.npy`).
- This notebook ensures consistency and reproducibility for Part D of the assignment.