# Model Training

## 1. Mathematical Foundations (MSc Level)

### Entropy
$$ H(S) = -\sum p_i \log_2(p_i) $$
Measure of impurity in a dataset.

### Gini Index
$$ Gini = 1 - \sum p_i^2 $$
Probability of incorrect classification.

### Logistic Regression
Uses the sigmoid function:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Cost function optimized via Gradient Descent.

### Metrics
- **Precision**: TP / (TP + FP)
- **Recall**: TP / (TP + FN)
- **F1-Score**: Harmonic mean of Precision and Recall.
- **ROC-AUC**: Area under the Receiver Operating Characteristic curve.


In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
import joblib


## 2. Load and Prepare Data


In [None]:
df = pd.read_csv('../data/student_dropout_1000.csv')

# Basic Preprocessing duplication
df = df.drop_duplicates()

# Encode Target
target = 'Target' if 'Target' in df.columns else df.columns[-1]
if df[target].dtype == 'object':
    df[target] = df[target].apply(lambda x: 1 if str(x).strip() == 'Dropout' else 0)

# Encode others
for col in df.select_dtypes(include=['object']).columns:
    if col != target:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
        
# Fill NaNs
df = df.fillna(df.median(numeric_only=True))


## 3. Splitting and Scaling


In [None]:
X = df.drop(target, axis=1)
y = df[target]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


## 4. Logistic Regression


In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print(classification_report(y_test, y_pred_lr))


## 5. Random Forest & GridSearchCV


In [None]:

rf = RandomForestClassifier(random_state=42)
param_grid = {'n_estimators': [50, 100], 'max_depth': [10, 20, None]}
grid = GridSearchCV(rf, param_grid, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)
best_rf = grid.best_estimator_
y_pred_rf = best_rf.predict(X_test)
print("Best Params:", grid.best_params_)
print(classification_report(y_test, y_pred_rf))



## 6. Confusion Matrix & ROC


In [None]:
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, fmt='d')
plt.title('Confusion Matrix (RF)')
plt.show()


## 7. Save Models


In [None]:
joblib.dump(best_rf, '../models/model.pkl')
joblib.dump(scaler, '../models/scaler.pkl')
print('Models saved.')
