# Preprocessing 

In this notebook, we prepare the UCI Student Performance dataset for Machine Learning.
We will:
- Identify numerical and categorical features
- Encode categorical variables
- Normalize numeric data 
- Create target variables for regression & classification
- Split the data into train/test sets
- Save the processed data for the modeling stage


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


In [2]:
df = pd.read_csv("../data/raw/student_performance_full.csv",sep=";")

In [3]:
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


Identify numerical and categorical features

In [5]:
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns
numeric_cols, categorical_cols

(Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
        'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2',
        'G3'],
       dtype='object'),
 Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
        'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
        'nursery', 'higher', 'internet', 'romantic'],
       dtype='object'))

Explore Unique Values of Categorical Columns

In [6]:
for col in categorical_cols:
    print(col, ":", df[col].unique())


school : ['GP' 'MS']
sex : ['F' 'M']
address : ['U' 'R']
famsize : ['GT3' 'LE3']
Pstatus : ['A' 'T']
Mjob : ['at_home' 'health' 'other' 'services' 'teacher']
Fjob : ['teacher' 'other' 'services' 'health' 'at_home']
reason : ['course' 'other' 'home' 'reputation']
guardian : ['mother' 'father' 'other']
schoolsup : ['yes' 'no']
famsup : ['no' 'yes']
paid : ['no' 'yes']
activities : ['no' 'yes']
nursery : ['yes' 'no']
higher : ['yes' 'no']
internet : ['no' 'yes']
romantic : ['no' 'yes']


One-Hot Encoding

In [7]:
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
df_encoded.head()


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,...,guardian_mother,guardian_other,schoolsup_yes,famsup_yes,paid_yes,activities_yes,nursery_yes,higher_yes,internet_yes,romantic_yes
0,18,4,4,2,2,0,4,3,4,1,...,True,False,True,False,False,False,True,True,False,False
1,17,1,1,1,2,0,5,3,3,1,...,False,False,False,True,False,False,False,True,True,False
2,15,1,1,1,2,0,4,3,2,2,...,True,False,True,False,False,False,True,True,True,False
3,15,4,2,1,3,0,3,2,2,1,...,True,False,False,True,False,True,True,True,True,True
4,16,3,3,1,2,0,4,3,2,1,...,False,False,False,True,False,False,True,True,False,False


Normalize Numeric Features

In [9]:
scaler = StandardScaler()
df_scaled = df_encoded.copy()

num_features = df_scaled.select_dtypes(include=['int64', 'float64']).columns
df_scaled[num_features] = scaler.fit_transform(df_scaled[num_features])


Create the Regression Target

In [10]:
target_reg = "G3"
X_reg = df_scaled.drop(columns=["G3"])
y_reg = df_scaled["G3"]


Create Classification Target (risk / medium / good)

We convert the final grade (G3) into 3 categories:

- 0–9   → risk
- 10–14 → medium
- 15–20 → good


In [11]:
def categorize_grade(g):
    if g < 10:
        return "risk"
    elif g < 15:
        return "medium"
    else:
        return "good"

df_encoded["G3_category"] = df["G3"].apply(categorize_grade)

X_clf = df_encoded.drop(columns=["G3", "G3_category"])
y_clf = df_encoded["G3_category"]


In [12]:
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)


In [13]:
X_train_reg.to_csv("../data/processed/X_train_reg.csv", index=False)
X_test_reg.to_csv("../data/processed/X_test_reg.csv", index=False)
y_train_reg.to_csv("../data/processed/y_train_reg.csv", index=False)
y_test_reg.to_csv("../data/processed/y_test_reg.csv", index=False)

X_train_clf.to_csv("../data/processed/X_train_clf.csv", index=False)
X_test_clf.to_csv("../data/processed/X_test_clf.csv", index=False)
y_train_clf.to_csv("../data/processed/y_train_clf.csv", index=False)
y_test_clf.to_csv("../data/processed/y_test_clf.csv", index=False)


# Conclusion

The dataset is now fully prepared for Machine Learning.
We generated:
- Encoded dataset
- Scaled dataset
- Regression features (X_reg) and target (G3)
- Classification features (X_clf) and target (G3_category)
- Train/Test splits for both tasks
- Saved all processed files for modeling

Next step:
Open `03-regression-models.ipynb` to train the first regression models.
