# TDS Project: Part 2 - Advanced Model Analysis, Optimization, and Conclusions
**Group Members:**

- Adir Elmakais - 316413640

## Installation Guide

#### Use Python 3.12.0

To get started with the project, ensure you are using **Python 3.12.0**.

1. **Install Python 3.12.0**:
   - Download the installer for Python 3.12.0 from the [official Python website](https://www.python.org/downloads/release/python-3120/).
   - During the installation, make sure to check the box **"Add Python to PATH"**.

2. **macOS: Install `libomp`**:
   - For macOS users, you need to install `libomp` for compatibility with XGBoost. Run the following command:
     ```bash
     brew install libomp
     ```

3. **Install Required Packages**:
   - Once Python 3.12.0 is installed, you can install the necessary packages listed in the `requirements.txt` file by running the following command in your terminal:
     ```bash
     pip install -r requirements.txt
     ```

## Introduction
In the second part of the TDS Project, we aim to enhance our initial machine learning pipeline by addressing the weaknesses identified during error analysis. This involves optimizing the model through hyperparameter tuning, feature engineering, and other advanced techniques to improve its predictive performance. Subsequently, we'll analyze the improved model, compare it with the baseline, and draw meaningful conclusions from our findings.



## Loading the Baseline Model and Data
We'll begin by loading the cleaned dataset and the baseline model saved in Part 1. This will allow us to build upon the existing pipeline and apply further optimizations.

In [5]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
import os

# Setting plot styles
sns.set_theme(style="whitegrid")
%matplotlib inline

# Paths to the cleaned data and the baseline model
cleaned_data_path = os.path.join('data', 'StudentPerformaceFactorsClean.csv')
model_path = os.path.join('models', 'pipeline_model.joblib')

# Loading the cleaned dataset
data_cleaned = pd.read_csv(cleaned_data_path)
print("Cleaned dataset loaded successfully.")

# Loading the baseline model
pipeline = joblib.load(model_path)
print("Baseline model loaded successfully.")


Cleaned dataset loaded successfully.
Baseline model loaded successfully.


1. **Error Analysis Conclusions & Work Plan**
   
   - **Residual Distribution**
     - **Observation**: Slight heteroscedasticity observed in residuals, with increased variance for higher predicted scores.
     - **Action**: Investigate transformations of the target variable or features to stabilize variance.
   
   - **Subgroup Performance**
     - **Observation**: Lower performance for students with Low and High parental involvement levels.
     - **Action**: Explore interactions between parental involvement and other features or incorporate additional relevant features.
   
   - **Feature Importance**
     - **Observation**: Attendance, Hours_Studied, and Previous_Scores are significant predictors, but other features may also play crucial roles.
     - **Action**: Conduct feature engineering to create new features or transform existing ones to capture more information.
   
   - **Bias in Predictions**
     - **Observation**: Slight tendency towards underestimation in predictions.
     - **Action**: Adjust the model to reduce bias, possibly by addressing data imbalance or refining the loss function.
   
   - **Outliers**
     - **Observation**: Presence of outliers affecting model performance.
     - **Action**: Implement robust scaling or outlier detection methods to mitigate their impact.
   
2. **Work Plan**
   
   - **Hyperparameter Tuning**: Optimize XGBoost parameters to enhance model performance.
   - **Feature Engineering**: Create new features and transform existing ones based on domain knowledge and EDA insights.
   - **Handling Outliers**: Apply techniques to detect and handle outliers effectively.
   - **Data Balancing**: If applicable, ensure that the model is not biased towards certain subgroups by balancing the data.


## 2. Improving Model Performance

### a. Hyperparameter Tuning

Hyperparameter tuning involves adjusting the model’s parameters to find the optimal configuration that minimizes error and maximizes performance. We’ll use GridSearchCV to perform an exhaustive search over specified parameter values for the XGBoost regressor.

Step-by-Step Implementation:
	1.	Define Features and Target
	2.	Identify Categorical and Numerical Features
	3.	Handle Categorical Data
	4.	Perform Train-Test Split
	5.	Set Up and Run GridSearchCV
	6.	Save the Improved Model

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV

# Define the parameter grid for XGBoost
param_grid = {
    'model__n_estimators': [100, 200, 300],
    'model__max_depth': [3, 5, 7],
    'model__learning_rate': [0.01, 0.1, 0.2],
    'model__subsample': [0.7, 0.8, 0.9],
    'model__colsample_bytree': [0.7, 0.8, 0.9]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    verbose=1,
    n_jobs=-1
)

# Define features and target
X = data_cleaned.drop('Exam_Score', axis=1)
y = data_cleaned['Exam_Score']

# Identify categorical and numerical features
categorical_features = X.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
numeric_features = X.select_dtypes(include=[np.number]).columns.tolist()

# Convert boolean columns to integers if any
for col in categorical_features:
    if X[col].dtype == 'bool':
        X[col] = X[col].astype(int)

# Perform train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit GridSearchCV
print("Starting hyperparameter tuning...")
grid_search.fit(X_train, y_train)
print("Hyperparameter tuning completed.")

# Best parameters
print("Best Parameters:")
print(grid_search.best_params_)

# Best R2 Score
print(f"Best R² Score from GridSearchCV: {grid_search.best_score_:.4f}")

# Update the pipeline with best parameters
best_pipeline = grid_search.best_estimator_

# Save the improved model
improved_model_path = os.path.join('models', 'improved_pipeline_model.joblib')
joblib.dump(best_pipeline, improved_model_path)
print(f"Improved model saved at '{improved_model_path}'.")