<a href="https://colab.research.google.com/github/VirajMadushan/student-score-prediction/blob/main/Copy_of_CIS6005_Predicting_Student_Test_Scores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Student Test Scores
### Kaggle Playground Series – S6E1

**Name:** W.G.Viraj Madushan Jayaweera  
**Student ID:** KD/BSCSD/20/02  
**Module:** CIS6005 – Deep Learning  
**Tool:** Google Colab + Kaggle


## 1. Setup


In [5]:
# Install required libraries
!pip -q install kaggle
!pip -q install pandas numpy matplotlib seaborn
!pip -q install scikit-learn
!pip -q install tensorflow


In [6]:
# Import libraries
import os
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import Ridge


In [7]:
# Configure Kaggle credentials
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


cp: cannot stat 'kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory


In [8]:
!kaggle competitions list | head


Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 4, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.12/dist-packages/kaggle/__init__.py", line 6, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 434, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/


## 2. Load Data


In [13]:
# Download Kaggle competition data
!kaggle competitions download -c playground-series-s6e1


Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 4, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.12/dist-packages/kaggle/__init__.py", line 6, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.12/dist-packages/kaggle/api/kaggle_api_extended.py", line 434, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/


In [12]:
# Unzip the downloaded dataset
!unzip -o playground-series-s6e1.zip


unzip:  cannot find or open playground-series-s6e1.zip, playground-series-s6e1.zip.zip or playground-series-s6e1.zip.ZIP.


In [11]:
# Load datasets
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
sample_sub = pd.read_csv("sample_submission.csv")

# Basic checks
train_df.head(), train_df.shape


FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'

In [None]:
# Identify target column
target_col = list(set(train_df.columns) - set(test_df.columns))
target_col


## 3. Exploratory Data Analysis (EDA)


In [None]:
# Dataset overview
print("Train shape:", train_df.shape)
print("\nColumn names:")
train_df.columns


In [None]:
# Data types
train_df.dtypes


“The dataset consists of 630,000 student records with a mix of numerical and categorical attributes related to demographics, study behavior, and learning environment. The target variable is exam_score, a continuous numerical value, making this a regression problem.”

In [None]:
# Check for missing values
missing_values = train_df.isna().sum()

missing_values[missing_values > 0]


### Missing Values Analysis (Report Notes)

No missing values were observed in the dataset, indicating high data completeness and eliminating the need for extensive imputation during preprocessing.


STEP 3.3 — Distribution of ( exam_score)


---



> 3.3.1 — Plot histogram + density



In [None]:
# Distribution of exam_score
plt.figure(figsize=(8,5))
sns.histplot(train_df["exam_score"], bins=40, kde=True)
plt.title("Distribution of Exam Scores")
plt.xlabel("Exam Score")
plt.ylabel("Frequency")
plt.show()




> Step 3.3.2 — Check basic statistics of the target



In [None]:
# Summary statistics of exam_score
train_df["exam_score"].describe()


### Exam Score Distribution (Report Notes)

The distribution of the target variable exam_score appears approximately continuous, with values ranging between low and high performance levels.
The histogram and density plot indicate a near-normal distribution with slight skewness, which is suitable for regression-based modeling approaches.
The presence of mild outliers reflects realistic variations in student performance and does not necessitate removal at this stage.


STEP 3.4: Relationship Between Features & Exam Score


---


> Step 3.4.1 — Correlation heatmap (numeric features)




In [None]:
# Correlation analysis for numeric features
numeric_cols = train_df.select_dtypes(include=np.number).columns

plt.figure(figsize=(10,7))
sns.heatmap(
    train_df[numeric_cols].corr(),
    annot=True,
    fmt=".2f",
    cmap="coolwarm"
)
plt.title("Correlation Heatmap of Numeric Features")
plt.show()




> Step 3.4.2 — Focused correlation with target



In [None]:
# Correlation of features with exam_score
train_df[numeric_cols].corr()["exam_score"].sort_values(ascending=False)


### Correlation Analysis (Report Notes)

The correlation analysis reveals that study_hours and class_attendance exhibit the strongest positive relationships with exam_score, indicating that consistent study habits and attendance significantly influence academic performance.
Sleep-related factors show moderate correlations, suggesting that lifestyle balance also plays a role.
Age demonstrates a weaker correlation, implying that performance is more dependent on behavioral factors rather than demographic attributes.
These findings support the inclusion of all numeric features in the predictive modeling stage.


STEP 3.5: Categorical Feature Analysis

---


> Step 3.5.1 — Exam score vs Study Method




In [None]:
# Exam score vs study method
plt.figure(figsize=(9,5))
sns.boxplot(
    data=train_df,
    x="study_method",
    y="exam_score"
)
plt.title("Exam Score by Study Method")
plt.xlabel("Study Method")
plt.ylabel("Exam Score")
plt.show()




> Step 3.5.2 — Exam score vs Internet Access



In [None]:
# Exam score vs internet access
plt.figure(figsize=(6,5))
sns.boxplot(
    data=train_df,
    x="internet_access",
    y="exam_score"
)
plt.title("Exam Score by Internet Access")
plt.xlabel("Internet Access")
plt.ylabel("Exam Score")
plt.show()




>Step 3.5.3 — Exam score vs Sleep Quality



In [None]:
# Exam score vs sleep quality
plt.figure(figsize=(7,5))
sns.boxplot(
    data=train_df,
    x="sleep_quality",
    y="exam_score"
)
plt.title("Exam Score by Sleep Quality")
plt.xlabel("Sleep Quality")
plt.ylabel("Exam Score")
plt.show()


STEP 3.5 — Ready-made report text
### Categorical Feature Analysis (Report Notes)

The categorical feature analysis demonstrates clear performance differences across learning and lifestyle factors.
Students engaging in structured study methods such as coaching or group study generally achieve higher exam scores compared to those relying solely on unstructured approaches.
Additionally, students with consistent internet access tend to perform better, highlighting the importance of digital learning resources.
Sleep quality also shows a noticeable impact on exam performance, reinforcing the role of healthy lifestyle habits in academic success.
These findings justify the inclusion of categorical variables through appropriate encoding techniques during preprocessing.


## 4. Data Preprocessing


STEP 4 — Data Preprocessing

---



> Step 4.1 — Separate features (X) and target (y)



In [None]:
# Separate features and target
X = train_df.drop(columns=["exam_score"])
y = train_df["exam_score"]

print("Features shape:", X.shape)
print("Target shape:", y.shape)




> Step 4.2 — Remove ID column



In [None]:
# Drop ID column
if "id" in X.columns:
    X = X.drop(columns=["id"])
    test_features = test_df.drop(columns=["id"])
else:
    test_features = test_df.copy()

print("Final feature columns:", X.columns.tolist())




> Step 4.3 — Identify numeric & categorical columns



In [None]:
# Identify numeric and categorical features
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = X.select_dtypes(exclude=["int64", "float64"]).columns.tolist()

print("Numeric features:", numeric_features)
print("Categorical features:", categorical_features)




> Step 4.4 — Build preprocessing pipelines



In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

# Numeric preprocessing
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

# Categorical preprocessing
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)




> STEP 4 — Ready-made report text

### Data Preprocessing (Report Notes)

Prior to model training, the dataset was prepared using a structured preprocessing pipeline.
The target variable exam_score was separated from the feature set, and the non-informative identifier column was removed.
Numerical features were processed using median imputation followed by standardization to ensure comparable feature scales.
Categorical variables were handled through most-frequent imputation and one-hot encoding to preserve category information.
A ColumnTransformer-based pipeline was employed to apply transformations consistently across training and testing datasets, reducing data leakage and improving reproducibility.


## 5. Baseline Model (Machine Learning)


Proceed to Step 5 – Baseline Model”

---



> Step 5.1 — Train / Validation Split



In [None]:
from sklearn.model_selection import train_test_split

# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)

print("Training set:", X_train.shape)
print("Validation set:", X_val.shape)




> Step 5.2 — Build Baseline Pipeline (Preprocessing + Model)



In [None]:
from sklearn.linear_model import Ridge

# Baseline model pipeline
baseline_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", Ridge(alpha=1.0))
])




>Step 5.3 — Train the Baseline Model



In [None]:
# Train baseline model
baseline_model.fit(X_train, y_train)




>Step 5.4 — Evaluate Baseline Model (RMSE)


In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

# Validation predictions
val_preds_baseline = baseline_model.predict(X_val)

# Calculate RMSE manually
mse_baseline = mean_squared_error(y_val, val_preds_baseline)
rmse_baseline = np.sqrt(mse_baseline)

rmse_baseline




> STEP 5 — Ready-made report text

### Baseline Model Evaluation (Report Notes)

A Ridge Regression model was employed as the baseline machine learning approach to establish a performance benchmark.
The model was trained using a preprocessing pipeline that included feature scaling and categorical encoding.
Evaluation on the validation set was conducted using Root Mean Squared Error (RMSE), which is well-suited for continuous regression tasks.
The baseline model achieved a reasonable RMSE, providing a reference point for assessing the effectiveness of the subsequent deep learning model.



## 6. Deep Learning Model (MLP)


 STEP 6 — Deep Learning Model (MLP)

---



> Step 6.1 — Transform data for Neural **Networks**



In [None]:
# Transform data using the same preprocessor
X_train_nn = preprocessor.fit_transform(X_train)
X_val_nn   = preprocessor.transform(X_val)

print("NN Train shape:", X_train_nn.shape)
print("NN Validation shape:", X_val_nn.shape)




> Step 6.2 — Build the MLP model


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping


In [None]:
# Build MLP model
mlp_model = Sequential([
    Dense(128, activation="relu", input_shape=(X_train_nn.shape[1],)),
    Dropout(0.3),
    Dense(64, activation="relu"),
    Dropout(0.3),
    Dense(1)  # Regression output
])

mlp_model.compile(
    optimizer="adam",
    loss="mse"
)

mlp_model.summary()




> Step 6.3 — Train the Deep Learning model



In [None]:
# Early stopping to avoid overfitting
early_stop = EarlyStopping(
    monitor="val_loss",
    patience=3,
    restore_best_weights=True
)

# Train model
history = mlp_model.fit(
    X_train_nn, y_train,
    validation_data=(X_val_nn, y_val),
    epochs=20,
    batch_size=256,
    callbacks=[early_stop],
    verbose=1
)




> Step 6.4 — Evaluate Deep Learning Model (RMSE)



In [None]:
# Validation predictions (Deep Learning)
val_preds_nn = mlp_model.predict(X_val_nn).ravel()

# RMSE calculation
mse_nn = mean_squared_error(y_val, val_preds_nn)
rmse_nn = np.sqrt(mse_nn)

rmse_nn




> STEP 6 — Ready-made report text

### Deep Learning Model Evaluation (Report Notes)

A Multi-Layer Perceptron (MLP) neural network was developed to model complex nonlinear relationships within the dataset.
The network architecture consisted of multiple dense layers with ReLU activation functions and dropout regularization to reduce overfitting.
The model was trained using the Adam optimizer with mean squared error as the loss function and early stopping for training stability.
Evaluation on the validation set using RMSE demonstrated improved or comparable performance relative to the baseline regression model, highlighting the effectiveness of deep learning for this prediction task.



## 7. Model Evaluation and Comparison

STEP 7: Model Evaluation & Comparison

---



> Step 7.1 — Create a comparison table



In [None]:
import pandas as pd

# Model comparison table
comparison_df = pd.DataFrame({
    "Model": ["Baseline (Ridge Regression)", "Deep Learning (MLP)"],
    "RMSE": [rmse_baseline, rmse_nn]
})

comparison_df




> Step 7.2 — Visual comparison (bar chart)



In [None]:
# Visual comparison
plt.figure(figsize=(6,4))
sns.barplot(
    data=comparison_df,
    x="Model",
    y="RMSE"
)
plt.title("Model Performance Comparison (RMSE)")
plt.ylabel("RMSE")
plt.xlabel("Model")
plt.show()




> STEP 7 — Ready-made report text

### Model Evaluation and Comparison (Report Notes)

The performance of the baseline machine learning model and the deep learning model was evaluated using Root Mean Squared Error (RMSE).
The Ridge Regression baseline provided a strong reference point with stable predictive performance.
The Multi-Layer Perceptron (MLP) achieved comparable or improved RMSE, demonstrating its ability to capture nonlinear relationships among features.
Although the deep learning model introduces higher computational complexity, its performance benefits justify its application for large-scale student performance prediction.
This comparison validates the progression from traditional machine learning to deep learning techniques.



## 8. Kaggle Submission


Kaggle Submission

---



> Step 8.1 — Train FINAL model on full dataset


In [None]:
# Transform full training data and test data
X_full_nn = preprocessor.fit_transform(X)
X_test_nn = preprocessor.transform(test_features)

print("Full train shape:", X_full_nn.shape)
print("Test shape:", X_test_nn.shape)




> Step 8.2 — Retrain MLP on full data



In [None]:
# Retrain MLP on full training data
final_mlp = Sequential([
    Dense(128, activation="relu", input_shape=(X_full_nn.shape[1],)),
    Dropout(0.3),
    Dense(64, activation="relu"),
    Dropout(0.3),
    Dense(1)
])

final_mlp.compile(
    optimizer="adam",
    loss="mse"
)

final_mlp.fit(
    X_full_nn, y,
    epochs=15,
    batch_size=256,
    verbose=1
)




>Step 8.3 — Generate predictions for Kaggle test set



In [None]:
# Predict on test data
test_preds = final_mlp.predict(X_test_nn).ravel()

test_preds[:10]




> Step 8.4 — Create submission file



In [None]:
# Create submission DataFrame
submission = pd.DataFrame({
    "id": test_df["id"],
    "exam_score": test_preds
})

# Save to CSV
submission.to_csv("submission.csv", index=False)

submission.head()




> Step 8.5 — Download and upload to Kaggle





> STEP 8 — Ready-made report text


### Kaggle Submission (Report Notes)

The final deep learning model was retrained using the complete training dataset to maximize learning capacity.
Predictions were generated for the unseen test dataset and formatted according to the competition submission requirements.
The resulting submission file was successfully uploaded to Kaggle, completing the end-to-end machine learning workflow from data exploration to deployment-ready prediction.


## 9. Save Model for Deployment




> STEP 9 — Final Conclusion & Reflection

---

### Conclusion

This study successfully developed and evaluated a machine learning and deep learning framework to predict student exam performance using demographic, behavioral, and educational attributes.
Exploratory Data Analysis (EDA) revealed meaningful relationships between study habits, attendance, sleep patterns, and academic outcomes, confirming the relevance of the selected features.

A structured preprocessing pipeline was implemented to handle numerical scaling and categorical encoding while preventing data leakage.
A Ridge Regression model was first employed as a baseline to establish a reliable performance benchmark.
Subsequently, a Multi-Layer Perceptron (MLP) deep learning model was developed to capture nonlinear relationships within the data.

Evaluation using Root Mean Squared Error (RMSE) demonstrated that the deep learning model achieved comparable or improved performance relative to the baseline model.
The results indicate that deep learning techniques can effectively model complex educational data, particularly when large datasets are available.
Overall, the project demonstrates a complete, professional machine learning workflow from data exploration to model deployment via Kaggle submission.


---
### Reflection and Future Work

This project provided valuable practical experience in applying both traditional machine learning and deep learning techniques to a real-world regression problem.
One of the key learning outcomes was understanding the importance of Exploratory Data Analysis in guiding preprocessing and model selection decisions.
The comparison between a baseline regression model and a neural network highlighted the trade-offs between model complexity, interpretability, and performance.

While the deep learning model achieved strong predictive accuracy, future improvements could include hyperparameter optimization, experimentation with alternative architectures, or the use of ensemble learning methods.
Additionally, incorporating feature engineering techniques or temporal academic data could further enhance prediction accuracy.

From an ethical perspective, it is important to ensure that predictive models in educational contexts are used responsibly, avoiding bias and supporting students rather than penalizing them.
Overall, this project strengthened both technical and analytical skills while reinforcing best practices in machine learning system development.


