### 📥 Step 1: Import Libraries and Load Dataset

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import joblib  # for exporting the model

# Load dataset
df = pd.read_csv("student-por.csv")


### 🧼 Step 2: Data Preparation

We'll use features like G1, G2, studytime, failures, absences, and some categorical ones.

In [2]:
# Select features and target
features = ['G1', 'G2', 'studytime', 'failures', 'absences', 'sex', 'schoolsup', 'internet', 'Medu', 'Fedu']
target = 'G3'

X = df[features]
y = df[target]


### 🧪 Step 3: Train-Test Split

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### ⚙️ Step 4: Preprocessing Pipeline

We'll encode categorical columns and scale numerical ones.

In [4]:
# Separate categorical and numeric columns
categorical_cols = ['sex', 'schoolsup', 'internet']
numeric_cols = ['G1', 'G2', 'studytime', 'failures', 'absences', 'Medu', 'Fedu']

# Preprocessing for numeric and categorical features
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])


### 🤖 Step 5: Build and Train the Model

In [5]:
# Full pipeline: preprocessing + regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Fit the model
pipeline.fit(X_train, y_train)


### 📊 Step 6: Evaluate the Model

In [6]:
# Predict and evaluate
y_pred = pipeline.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"R² Score: {r2:.2f}")


Root Mean Squared Error (RMSE): 1.16
R² Score: 0.86


### 📊 Step 7: Compare Multiple Regression Models

In [8]:
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
import pandas as pd

# Define models to compare
models = {
    "LinearRegression": LinearRegression(),
    "Ridge": Ridge(),
    "Lasso": Lasso(),
    "ElasticNet": ElasticNet(),
    "DecisionTree": DecisionTreeRegressor(random_state=42),
    "RandomForest": RandomForestRegressor(random_state=42)
}

# Define your columns
categorical_cols = ['sex', 'schoolsup', 'internet']
numeric_cols = ['G1', 'G2', 'studytime', 'failures', 'absences', 'Medu', 'Fedu']
target = 'G3'

# Load and prepare data
df = pd.read_csv("student-por.csv")
X = df[categorical_cols + numeric_cols]
y = df[target]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline
preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])

# Run and evaluate each model
results = []

for name, model in models.items():
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    results.append((name, rmse, r2))

# Show comparison
results_df = pd.DataFrame(results, columns=["Model", "RMSE", "R2"])
print(results_df.sort_values(by="RMSE"))




              Model      RMSE        R2
0  LinearRegression  1.155739  0.863026
1             Ridge  1.155812  0.863009
5      RandomForest  1.300856  0.826469
4      DecisionTree  1.406031  0.797274
2             Lasso  1.424042  0.792047
3        ElasticNet  1.483339  0.774368


## 📘 Conclusion: Student Performance Regression Modeling

This project aimed to predict students' final exam scores (`G3`) using various features such as previous exam scores, study time, attendance, and demographic data. After conducting detailed Exploratory Data Analysis (EDA) and training multiple regression models, the following conclusions were drawn:

---

### 🔍 Key Findings

- **Strongest Predictors**:
  - `G1` and `G2` (grades from the first and second periods) have a **strong linear correlation** with the final score `G3`.
  - Features like `studytime`, `failures`, `absences`, `Medu`, and `Fedu` also showed some predictive power.

- **Best Model**:
  - **Linear Regression** achieved the **lowest RMSE (1.16)** and **highest R² score (0.86)**, indicating it explains 86% of the variance in student performance.
  - **Ridge Regression** performed nearly identically, offering regularization benefits with similar accuracy.

- **Other Models**:
  - **Random Forest** and **Decision Tree** had decent performance but slightly higher error rates.
  - **Lasso** and **ElasticNet** did not improve performance, likely due to the lack of noisy or irrelevant features in this dataset.

---

### ✅ Final Conclusion

- A **simple linear regression model** is sufficient and highly effective for this dataset.
- The **consistency in student performance (G1 and G2)** is the most reliable indicator of final success.
- This model can help educators **identify students at risk** based on early performance and offer timely interventions.

---


### ✅ Cross-Validation

In [10]:
from sklearn.model_selection import cross_val_score

linear_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# R² scores
cv_r2_scores = cross_val_score(linear_pipeline, X, y, cv=5, scoring='r2')
print("Cross-Validated R²:", cv_r2_scores.mean())

# RMSE scores
cv_rmse_scores = cross_val_score(linear_pipeline, X, y, cv=5, scoring='neg_root_mean_squared_error')
print("Cross-Validated RMSE:", -cv_rmse_scores.mean())


Cross-Validated R²: 0.8207463106991397
Cross-Validated RMSE: 1.2534823920275762


### ✅ Add Interaction Terms / Non-Linear Features

In [11]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Use polynomial features for numeric columns
from sklearn.linear_model import LinearRegression

poly_preprocessor = ColumnTransformer(transformers=[
    ('num', make_pipeline(PolynomialFeatures(degree=2, include_bias=False), StandardScaler()), numeric_cols),
    ('cat', OneHotEncoder(drop='first'), categorical_cols)
])

poly_pipeline = Pipeline(steps=[
    ('preprocessor', poly_preprocessor),
    ('regressor', LinearRegression())
])

poly_pipeline.fit(X_train, y_train)
y_poly_pred = poly_pipeline.predict(X_test)

print("Polynomial RMSE:", mean_squared_error(y_test, y_poly_pred, squared=False))
print("Polynomial R²:", r2_score(y_test, y_poly_pred))


Polynomial RMSE: 1.2870016013545684
Polynomial R²: 0.8301454712268641




### ✅ Deploying with Flask or Streamlit

In [12]:
import joblib

# Save the best pipeline (Linear or Polynomial)
joblib.dump(poly_pipeline, "final_student_model.pkl")
print("✅ Model saved as 'final_student_model.pkl'")


✅ Model saved as 'final_student_model.pkl'
