# Student Marks Prediction

## Description

This project aims to predict the marks obtained by students based on their study time and the number of courses they have opted for. The dataset for this project has been downloaded from the UCI Machine Learning Repository. 

The dataset consists of a small number of instances and attributes, making it both simple and challenging. Despite the limited number of features and samples, the goal is to build a regression model that captures the patterns in the data while ensuring generalizability.

## Properties of the Dataset:

- **Number of Instances**: 100
- **Number of Attributes**: 3 (including the target variable)

## Objective:

1. **Understand the Dataset & Cleanup**: Explore and clean the dataset as necessary.
2. **Build Regression Models**: Develop regression models to predict student marks based on study time and number of courses.
3. **Model Evaluation**: Evaluate the models using appropriate metrics like R-squared (R²), Root Mean Squared Error (RMSE), etc., and compare the results.

## Source

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/yasserh/student-marks-dataset/data

## Data Dictionary:

- **number_courses**: Numeric data. This is the Number of Courses Opted by the student.
- **time_study**: Numeric data. This is the Average Time Studied per day by the student (in hours).
- **Marks**: Marks obtained by the student (target variable).

## Problem Statement

1. **Model Training**: The objective of model training is to train the model with the dataset so that it can predict the marks of student.
2. **Model Evaluation**: Evaluate the performance of the trained model using different metrics like R2 Score, and find the loss or residual present in the model with Root Mean Squeared Error, Mean Squared Error and Mean Absolute Error.
3. **Model Optimization**: The objective of model optimization is to find and optimal model for this dataset using Cross Validation and Hyperparameter Tuning to enhance the performance of the trained model and reduce the loss present in the model so that it can predict the marks of student accurately.

### Load Necessary Libraries

In [16]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Model and Evaluation Metrics
from xgboost import XGBRegressor
from sklearn.metrics import r2_score, mean_absolute_error,mean_squared_error

# Hyperparameter tuning
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

### Settings

In [3]:
# Warnings
warnings.filterwarnings("ignore")
# Path
data_path = "../data"
csv_path = os.path.join(data_path, "Student_Marks.csv")

### Load Data

In [4]:
# Load data
df = pd.read_csv(csv_path)

In [5]:
# Check Dataset
df.head()

Unnamed: 0,number_courses,time_study,Marks
0,3,4.508,19.202
1,4,0.096,7.734
2,4,3.133,13.811
3,6,7.909,53.018
4,8,7.811,55.299


### Preprocessing

- Preprocessing is needed to do the following:
    - Separate the input and output features present in the dataset.
    - Split the data into training and testing purpose so that we can train the model with training data and evaluate the performance and loss with testing data.

In [6]:
# Separate the input and output features
X = df.drop("Marks", axis= 1)
y = df["Marks"]

In [7]:
# Split the data into training ans testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 42)

### Train the Model and Evaluate Performance

- Train the model with the training dataset so that we can use the trained model to predict the marks of a student depending on the number of courses for with the student enrolled and time(in hours) spent in study.
- Evaluate the the performance of the trained model with the metrics R2 score.
- Evaluate the loss or residual of the trained model with the metrics Root Mean Squeared Error, Mean Squared Error and Mean Absolute Error.

In [8]:
# Function to train the model and evaluate the model
def train_and_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Make prediction for training and tesing data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Print the evaluation metrics
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAINING DATA")
    print("=" * 60)
    print(f"Score: {r2_score(y_train, y_train_pred): .2f}")
    print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_train, y_train_pred): .2f}")
    print(f"Mean Squared Error(MSE): {mean_squared_error(y_train, y_train_pred): .2f}")
    print(f"Root Mean Squared Error(RMSE): {np.sqrt(mean_squared_error(y_train, y_train_pred)): .2f}")
    print("=" * 60)
    print("EVALUATION METRICS FOR TESTING DATA")
    print("=" * 60)
    print(f"Score: {r2_score(y_test, y_test_pred): .2f}")
    print(f"Mean Absolute Error(MAE): {mean_absolute_error(y_test, y_test_pred): .2f}")
    print(f"Mean Squared Error(MSE): {mean_squared_error(y_test, y_test_pred): .2f}")
    print(f"Root Mean Squared Error(RMSE): {np.sqrt(mean_squared_error(y_test, y_test_pred)): .2f}")

In [9]:
# Try XGBoost Regressor model to train and evalueate
xgbr = XGBRegressor()
train_and_evaluate(xgbr)

EVALUATION METRICS FOR TRAINING DATA
Score:  1.00
Mean Absolute Error(MAE):  0.00
Mean Squared Error(MSE):  0.00
Root Mean Squared Error(RMSE):  0.00
EVALUATION METRICS FOR TESTING DATA
Score:  0.98
Mean Absolute Error(MAE):  1.54
Mean Squared Error(MSE):  5.55
Root Mean Squared Error(RMSE):  2.36


### Insights

- **Training Score of 1.00**:

A training score of 1.00 means the model has perfectly predicted the training data, which is often a sign that the model might be too closely fitted to the training data.
This could indicate that the model is memorizing the data instead of learning the underlying patterns, especially if the dataset is not large or diverse.

- **Testing Score of 0.98**:

A testing score of 0.98 indicates that the model performs extremely well on unseen data, which generally means the model generalizes well.
However, the testing score is slightly lower than the training score, which could suggest some level of overfitting, but not excessively severe.


### Hyperparameter Tuning

- **Cross-Validation**: Use techniques like **k-fold cross-validation** to check how the model performs on different subsets of the data. This will help you ensure the model generalizes well across various parts of the data.
- **Regularization**: Introduce regularization (e.g., L1, L2) to penalize overly complex models and reduce overfitting. For instance, in XGBoost, you can add parameters like alpha and lambda for L1 and L2 regularization.

In [12]:
# Define KFold
kf = KFold(n_splits= 5, shuffle= True)
# Define Regressor model
xgbr_cv = XGBRegressor()

# Find cross validation score
scores = cross_val_score(xgbr_cv, X, y, cv= kf, scoring= "r2")
# print the mean score
print(f"Mean Score: {np.mean(scores): 0.2f}")

Mean Score:  0.98


In [18]:
# Tune hyperparameter for regularization
#xgbr_ht = XGBRegressor(reg_lambda=1.0, reg_alpha= 0.1, gamma= 0.1)
#train_and_evaluate(xgbr_ht)
# Define parameter dictionary
param_grid = {
    'alpha': [0, 0.1, 1, 10],
    'lambda': [0, 1, 10],
    'gamma': [0, 0.1, 1]
}

# Define model
xgbr_ht = XGBRegressor()

# Define Grid Search
gscv = GridSearchCV(
    estimator= xgbr_ht,
    param_grid = param_grid,
    cv= 5,
    verbose= 1,
    scoring= "r2"
)
# Train with different hyperparameter
gscv.fit(X, y)
# Get best parameter set
best_params = gscv.best_params_
print(f"Best Score: {gscv.best_score_: 0.2f}")
print(best_params)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
Best Score:  0.99
{'alpha': 0.1, 'gamma': 0, 'lambda': 10}


In [19]:
# Train the model with best parameter set
model = XGBRegressor(**best_params)
train_and_evaluate(model)

EVALUATION METRICS FOR TRAINING DATA
Score:  1.00
Mean Absolute Error(MAE):  0.10
Mean Squared Error(MSE):  0.02
Root Mean Squared Error(RMSE):  0.13
EVALUATION METRICS FOR TESTING DATA
Score:  0.99
Mean Absolute Error(MAE):  1.36
Mean Squared Error(MSE):  3.81
Root Mean Squared Error(RMSE):  1.95


### Conclusion

After regularizing the model with hyperparameter tuning it is found **99%** accuracy for the dataset which indicates slight overfitting is present in the model. This might be 