In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
# load data
df = pd.read_csv("/kaggle/input/student-mat-csv/student-mat.csv", sep=";")
df.head()

In [None]:
#verify the data 
df.shape

In [None]:
#Target varisble
y = df["G3"]

#Feautures (drop tragest)
X = df.drop("G3", axis=1)

X.head()

In [None]:
# categorical columns 
categorical_cols = X.select_dtypes(include=["object"]).columns
categorical_cols

In [None]:
X_encoded = pd.get_dummies(X, columns=categorical_cols, drop_first=True)
X_encoded.shape

In [None]:
X_encoded = X_encoded.drop(columns=["G1", "G2"], errors="ignore")

In [None]:
X_encoded.isnull().sum()


####  Data Cleaning and Preprocessing

The dataset was prepared for modeling by separating the target variable (G3) from
the input features. To prevent data leakage and unrealistic performance estimates,
prior academic scores were excluded from the feature set.

Categorical variables were converted into numerical form using one-hot encoding,
ensuring compatibility with regression-based machine learning models.

In [None]:
# Train-Test split
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y, test_size=0.2, random_state = 42)
X_train.shape, X_test.shape

In [None]:
# Linear Regression 
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# Predictions 
y_pred = lr.predict(X_test)
y_pred[:10]

## Linear Regression Model 

A baseline linear regression model was trained using the preprocessed dataset.
The data was split into training and testing sets to evaluate generalization performance.

Model performance was evaluated using Root Mean Squared Error (RMSE) and the
coefficient of determination (R²), which provide insight into prediction accuracy
and variance explained by the model.

In [None]:
#performance of model 
mse = mean_squared_error(y_test, y_pred)
# RMSE; average prediction error in gradde points
rmse = np.sqrt(mse)
# R^2: proportion of variance explained 
r2 = r2_score(y_test, y_pred)
rmse, r2


### Model Performance Interpretation

The linear regression model achieved an RMSE of approximately 4.2, indicating that
predictions differ from actual final grades by about four points on average.

The R² value of approximately 0.14 suggests that a limited proportion of the variance
in student performance is explained by the selected features. This outcome is expected
given the exclusion of prior academic scores to prevent data leakage.

In [None]:
# Decision Tree Regressor 
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state = 42, max_depth = 5)
#fitting 
dt.fit(X_train, y_train)

y_pred_dt = dt.predict(X_test)

rmse_dt = np.sqrt(mean_squared_error(y_test, y_pred))
r2_dt = r2_score(y_test, y_pred_dt)

rmse_dt, r2_dt

In [None]:
#Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

rmse_rf = np.sqrt(mean_squared_error(y_test, y_pred_rf))
r2_rf = r2_score(y_test, y_pred_rf)

rmse_rf, r2_rf

## Model Comparison

To improve predictive performance, non-linear models were evaluated alongside
the baseline linear regression model.

Decision Tree and Random Forest regressors were trained to capture potential
non-linear relationships in the data. Model performance was compared using
RMSE and R² metrics to assess prediction accuracy and generalization.

In [None]:
# Compare resutls 
results = pd.DataFrame({"Model": ["Linear Regression", "Decision Tree", "Random Forest"],
                        "RMSE": [rmse, rmse_dt, rmse_rf],
                       "R2": [r2, r2_dt, r2_rf]})
results

###  Feature Importance Analysis

Feature importance was evaluated using a Random Forest model trained on the
preprocessed dataset with prior academic scores excluded to prevent data leakage.

The most influential factors included attendance (absences), history of academic
difficulties (failures), and behavioral or lifestyle-related variables such as
study time, free time, and social activity. These results align with intuitive
educational expectations and support the validity of the modeling approach.

In [None]:
# Feature importance (random forest)
importances = rf.feature_importances_

feature_importance = pd.DataFrame({
    "Feature": X_encoded.columns,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

feature_importance.head(10)

In [None]:
#Visualize 

plt.figure(figsize=(10, 6))

sns.barplot(
    data=feature_importance.head(10),
    x="Importance",
    y="Feature"
)

plt.title("Top 10 Feature Importances (Random Forest)")
plt.xlabel("Relative Importance")
plt.ylabel("Feature")

plt.tight_layout()
plt.show()


## Limitations

This project have limitations. The dataset is relatively small and
represents students from a limited geographic and educational context, which may
restrict generalizability.

Additionally, the analysis is based on observational data. While the models identify
associations between features and academic performance, they do not establish causal
relationships. Important external factors not captured in the dataset may also
influence outcomes.

## Ethical Considerations

Predictive models applied to educational data must be interpreted with caution.
There is a risk that such models could reinforce biases or be misused to label
students unfairly.

The results of this project are intended for analytical and exploratory purposes
only, emphasizing the importance of responsible and transparent use of machine
learning in educational contexts.

## Conclusion

This project demonstrated a complete machine learning workflow, including data
preprocessing, model training, evaluation, and interpretation.

By comparing linear and non-linear models and analyzing feature importance, the study
highlighted key factors associated with student performance while maintaining
methodological rigor and avoiding data leakage.