# Machine Learning Model Training

This notebook implements the training phase for predicting student performance (G3 grade) using processed data from our data wrangling step.

### Import Dependencies

In [3]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
import os
import joblib

### Load Processed Data

In [4]:
# Load the processed training and testing data
X_train = pd.read_csv('processed_data/X_train.csv')
X_test = pd.read_csv('processed_data/X_test.csv')
y_train = pd.read_csv('processed_data/y_train.csv')
y_test = pd.read_csv('processed_data/y_test.csv')

# Convert y data from DataFrame to Series
y_train = y_train.squeeze()
y_test = y_test.squeeze()

print("Data loaded successfully:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

FileNotFoundError: [Errno 2] No such file or directory: 'processed_data/X_train.csv'

### Model Selection and Training

We'll use Linear Regression as our base model since:
1. The target variable (G3) is continuous
2. We want to predict final grades based on various features
3. The relationship between features and grades might be linear

In [None]:
# Initialize the model
model = LinearRegression()

# Train the model
model.fit(X_train, y_train)

print("Model training completed.")
print("\nModel Parameters:")
print(f"Intercept: {model.intercept_:.2f}")
print("\nTop 5 Feature Coefficients:")
coefficients = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': model.coef_
})
print(coefficients.nlargest(5, 'Coefficient'))

### Model Evaluation

In [None]:
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Calculate metrics
train_mse = mean_squared_error(y_train, y_pred_train)
test_mse = mean_squared_error(y_test, y_pred_test)
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)

print("Model Performance Metrics:")
print("\nTraining Set:")
print(f"MSE: {train_mse:.2f}")
print(f"RMSE: {np.sqrt(train_mse):.2f}")
print(f"R² Score: {train_r2:.2f}")

print("\nTesting Set:")
print(f"MSE: {test_mse:.2f}")
print(f"RMSE: {np.sqrt(test_mse):.2f}")
print(f"R² Score: {test_r2:.2f}")

### Cross-Validation

In [None]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

print("Cross-Validation Results:")
print(f"R² Scores: {cv_scores}")
print(f"Mean R²: {cv_scores.mean():.2f}")
print(f"Standard Deviation: {cv_scores.std():.2f}")

### Save Model

In [None]:
# Create directory if it doesn't exist
os.makedirs('models', exist_ok=True)

# Save the trained model
model_path = 'models/student_performance_model.joblib'
joblib.dump(model, model_path)

print(f"Model saved successfully to {model_path}")

### Model Summary

The Linear Regression model predicts student's final grade (G3) based on various features. Key findings:
1. Model performance metrics show how well it predicts grades
2. Cross-validation ensures robustness
3. Saved model can be used for future predictions