## ðŸ§© Part 1: Merging Mathematics and Portuguese Student Datasets

The dataset is split into two files â€” **student-math** and **student-por** â€” and there is **no unique student ID**.  
To correctly merge students who appear in both subjects, we:

âœ” Used all **non-grade and non-subject-specific columns** (e.g., age, school, parentsâ€™ education, address, etc.) as matching keys.  
âœ” Excluded subject-specific columns like `G1_math`, `G2_math`, `G3_math`, `absences_math` and Portuguese equivalents.  
âœ” Performed an **outer merge** so students appearing in only one dataset are still included.  
âœ” Saved the merged dataset as `student_combined_data_final.csv` for further modeling.

This will allow us to build **one classification model for Math** and **one for Portuguese**, using consistent student records.


In [None]:
import pandas as pd

# --- Load data ---
math = pd.read_csv("student-mat.csv")
por  = pd.read_csv("student-por.csv")

# --- Grade & subject-specific columns ---
math_subject_cols = ['G1_math', 'G2_math', 'G3_math', 'absences_math']
por_subject_cols  = ['G1_por', 'G2_por', 'G3_por', 'absences_por']

# --- Build strict match key: all shared non-subject, non-grade columns ---
common_cols = sorted(list(set(math.columns).intersection(set(por.columns))))
exclude_cols = set(math_subject_cols + por_subject_cols)
key_cols = [c for c in common_cols if c not in exclude_cols]

# --- Keep only key columns + Portuguese subject columns ---
por_keep = por[key_cols + por_subject_cols].drop_duplicates(subset=key_cols, keep='first')

# --- Merge: all columns from Math + Portuguese subject columns ---
merged = pd.merge(
    math, por_keep,
    on=key_cols,
    how='outer'
)

# --- Diagnostics ---
has_math = merged[['G1_math', 'G2_math', 'G3_math']].notna().any(axis=1)
has_por  = merged[['G1_por', 'G2_por', 'G3_por']].notna().any(axis=1)
matched  = (has_math & has_por).sum()
only_math = (has_math & ~has_por).sum()
only_por  = (~has_math & has_por).sum()

print(f"Matched students (identical on all non-subject columns): {matched}")
print(f"Math-only rows: {only_math}")
print(f"Portuguese-only rows: {only_por}")
print(f"Total rows in merged: {len(merged)}")

# --- Save final combined dataset ---
merged.to_csv("student_combined_data_final.csv", index=False)
print("âœ… Merged dataset saved as 'student_combined_data_final.csv'")


Matched students (identical on all non-subject columns): 162
Math-only rows: 233
Portuguese-only rows: 485
Total rows in merged: 880
âœ… Merged dataset saved as 'student_combined_data_final.csv'


## ðŸŽ¯ Part 2: Binary Classification â€” Predicting Pass/Fail in Mathematics (G3_math)

### âœ… Objective:
Convert the regression task into a **binary classification problem**:
- **Pass = 1** if G3_math > 15  
- **Fail = 0** if G3_math â‰¤ 15

### âœ… Why Logistic Regression?
- The professor specifically asked us to use **linear models only**.
- Logistic Regression is a **linear classification model**.
- It provides **interpretable coefficients and odds ratios**, useful for analysis.

### âœ… Important Constraints Followed:
- We are **not using G1 or G2**, because they strongly correlate with G3 and make the task trivial.
- We only use demographic, behavioral, and academic support features.
- Missing values in selected features are removed to avoid bias.

### âœ… Pipeline Steps:
1. Create binary target column (`G3_math_pass`).  
2. Select meaningful predictors (studytime, failures, alcohol consumption, etc.).  
3. Split into training/testing sets using **stratified sampling**.  
4. Apply **StandardScaler + OneHotEncoder** using `ColumnTransformer`.  
5. Train `LogisticRegression(class_weight='balanced')` to handle class imbalance.  
6. Evaluate using **Accuracy**, **Precision**, **Recall**, **F1**, **ROC-AUC**, and **Cross-Validation**.

This step helps us understand whether passing a Math exam can be predicted using socio-demographic and behavioral features.


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix

# -------------------------------
# 1. Load Dataset
# -------------------------------
data = pd.read_csv("student_combined_data_final.csv")

# -------------------------------
# 2. Create Binary Target: Pass/Fail
# -------------------------------
# Fail = 0 (G3 â‰¤ 15), Pass = 1 (G3 > 15)
data['G3_math_pass'] = np.where(data['G3_math'] > 15, 1, 0)

# -------------------------------
# 3. Select Relevant Features (No G1, G2)
# -------------------------------
features = [
    'address', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'studytime', 'failures', 'schoolsup',
    'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic',
    'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences_math'
]

# Remove rows with missing values in selected columns or target
data_clean = data.dropna(subset=features + ['G3_math_pass'])

X = data_clean[features]
y = data_clean['G3_math_pass']

# -------------------------------
# 4. Preprocessing (Scaling + Encoding)
# -------------------------------
numeric_features = X.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first'), categorical_features)
])

# -------------------------------
# 5. Build Logistic Regression Model
# -------------------------------
model = Pipeline([
    ('preprocess', preprocessor),
    ('logreg', LogisticRegression(max_iter=2000, class_weight='balanced'))
])

# -------------------------------
# 6. Train-Test Split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# -------------------------------
# 7. Train Model
# -------------------------------
model.fit(X_train, y_train)

# -------------------------------
# 8. Evaluate Performance
# -------------------------------
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))

# Cross-Validation ROC-AUC
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_auc = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
print("CV ROC-AUC Mean:", cv_auc.mean(), " | Std:", cv_auc.std())

print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Fail (â‰¤15)', 'Pass (>15)']))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# -------------------------------
# 9. Feature Importance (Coefficients & Odds Ratios)
# -------------------------------
logreg = model.named_steps['logreg']
pre = model.named_steps['preprocess']

# Get final feature names after encoding
encoded_features = np.concatenate([
    numeric_features,
    pre.named_transformers_['cat'].get_feature_names_out(categorical_features)
])

coef_df = pd.DataFrame({
    'Feature': encoded_features,
    'Coefficient': logreg.coef_[0],
    'OddsRatio': np.exp(logreg.coef_[0])
}).sort_values(by='Coefficient', ascending=False)

print("\nTop Positive Predictors (Higher â†’ More Likely to Pass):\n", coef_df.head(10))
print("\nTop Negative Predictors (Higher â†’ More Likely to Fail):\n", coef_df.tail(10))


Accuracy: 0.6835443037974683
Precision: 0.13043478260869565
Recall: 0.375
F1 Score: 0.1935483870967742
ROC-AUC: 0.6637323943661972
CV ROC-AUC Mean: 0.7288732394366197  | Std: 0.12359606347841197

Classification Report:
               precision    recall  f1-score   support

  Fail (â‰¤15)       0.91      0.72      0.80        71
  Pass (>15)       0.13      0.38      0.19         8

    accuracy                           0.68        79
   macro avg       0.52      0.55      0.50        79
weighted avg       0.83      0.68      0.74        79

Confusion Matrix:
 [[51 20]
 [ 5  3]]

Top Positive Predictors (Higher â†’ More Likely to Pass):
           Feature  Coefficient  OddsRatio
14  Mjob_services     1.353140   3.869558
19   Fjob_teacher     1.070378   2.916480
0            Medu     0.761710   2.141936
25     higher_yes     0.709080   2.032122
16    Fjob_health     0.613805   1.847448
26   internet_yes     0.473609   1.605779
4          famrel     0.437194   1.548356
24    nursery_yes

## ðŸ“š Part 3: Binary Classification â€” Predicting Pass/Fail in Portuguese (G3_por)

We now apply the **same classification process** to Portuguese instead of Mathematics.

### âœ… Why Repeat This Model?
- The goal is to **compare subjects** and see whether the same factors help predict passing in both Math and Portuguese.
- This also helps us answer assignment questions such as:
  - Is it possible to predict pass/fail with this dataset?
  - Does absenteeism affect performance?
  - Does parental education impact success?

### âœ… Pipeline (Same as Math Model):
âœ” Create binary target (`G3_por_pass`).  
âœ” Use same predictors, replacing `absences_math` with `absences_por`.  
âœ” Apply same preprocessing (scaling + one-hot encoding).  
âœ” Use `LogisticRegression(max_iter=2000, class_weight='balanced')`.  
âœ” Evaluate using classification metrics + cross-validation.

By keeping the modeling strategy identical, we can **fairly compare** which factors affect Math vs Portuguese differently.


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix

# -------------------------------
# 1. Load Dataset
# -------------------------------
data = pd.read_csv("student_combined_data_final.csv")

# -------------------------------
# 2. Create Binary Target: Pass/Fail for Portuguese
# -------------------------------
data['G3_por_pass'] = np.where(data['G3_por'] > 15, 1, 0)

# -------------------------------
# 3. Feature Selection (No G1 or G2 used)
# -------------------------------
features = [
    'address', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'studytime', 'failures', 'schoolsup',
    'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic',
    'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences_por'
]

data_clean = data.dropna(subset=features + ['G3_por_pass'])
X = data_clean[features]
y = data_clean['G3_por_pass']

# -------------------------------
# 4. Preprocessing (Scaling + Encoding)
# -------------------------------
numeric_features = X.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first'), categorical_features)
])

# -------------------------------
# 5. Build Logistic Regression Model
# -------------------------------
model_por = Pipeline([
    ('preprocess', preprocessor),
    ('logreg', LogisticRegression(max_iter=2000, class_weight='balanced'))
])

# -------------------------------
# 6. Train-Test Split
# -------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# -------------------------------
# 7. Train Model
# -------------------------------
model_por.fit(X_train, y_train)

# -------------------------------
# 8. Evaluate Model
# -------------------------------
y_pred = model_por.predict(X_test)
y_prob = model_por.predict_proba(X_test)[:, 1]

print("ðŸ“Œ G3_por Pass/Fail Model Performance")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_prob))

# Cross-validation ROC-AUC
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_auc = cross_val_score(model_por, X, y, cv=cv, scoring='roc_auc')
print("CV ROC-AUC Mean:", cv_auc.mean(), " | Std:", cv_auc.std())

print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Fail (â‰¤15)', 'Pass (>15)']))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# -------------------------------
# 9. Feature Importance (Coefficients & Odds Ratios)
# -------------------------------
logreg = model_por.named_steps['logreg']
pre = model_por.named_steps['preprocess']

encoded_features = np.concatenate([
    numeric_features,
    pre.named_transformers_['cat'].get_feature_names_out(categorical_features)
])

coef_df_por = pd.DataFrame({
    'Feature': encoded_features,
    'Coefficient': logreg.coef_[0],
    'OddsRatio': np.exp(logreg.coef_[0])
}).sort_values(by='Coefficient', ascending=False)

print("\nTop Positive Predictors (More Likely to Pass Portuguese):\n", coef_df_por.head(10))
print("\nTop Negative Predictors (Higher â†’ More Likely to Fail):\n", coef_df_por.tail(10))


ðŸ“Œ G3_por Pass/Fail Model Performance
Accuracy: 0.676923076923077
Precision: 0.21739130434782608
Recall: 0.625
F1 Score: 0.3225806451612903
ROC-AUC: 0.6869517543859649
CV ROC-AUC Mean: 0.696570796460177  | Std: 0.05497964088479665

Classification Report:
               precision    recall  f1-score   support

  Fail (â‰¤15)       0.93      0.68      0.79       114
  Pass (>15)       0.22      0.62      0.32        16

    accuracy                           0.68       130
   macro avg       0.57      0.65      0.56       130
weighted avg       0.84      0.68      0.73       130

Confusion Matrix:
 [[78 36]
 [ 6 10]]

Top Positive Predictors (More Likely to Pass Portuguese):
            Feature  Coefficient  OddsRatio
25      higher_yes     1.651193   5.213197
16     Fjob_health     0.856613   2.355170
18   Fjob_services     0.786520   2.195741
19    Fjob_teacher     0.620713   1.860254
0             Medu     0.547811   1.729463
17      Fjob_other     0.365254   1.440880
26    internet