# 🧹 Step 2: Data Preprocessing

Goal: Prepare the dataset for machine learning.

**Tasks:**
1. Load cleaned student dataset from `data/processed/student.csv` (or raw if not yet processed).
2. Handle missing values (drop or impute).
3. Encode categorical variables (LabelEncoder or OneHotEncoder).
4. Normalize numerical features (MinMaxScaler or StandardScaler).
5. Separate features (X) and target (y = G3).
6. Split into training and testing sets (e.g., 80/20).

**Bonus:**
- Save the processed data to `data/processed/`.
- Include comments for each block of code.


In [1]:
# 1. Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, MinMaxScaler, StandardScaler
import os

In [2]:
# 2. Load the dataset (prefer processed, fallback to raw)
processed_path = '../data/processed/student.csv'
raw_path = '../data/raw/student-mat.csv'

if os.path.exists(processed_path):
    df = pd.read_csv(processed_path)
    print('Loaded processed dataset.')
else:
    try:
        df = pd.read_csv(raw_path)
        if 'G1' not in df.columns:
            raise ValueError('G1 not found, trying semicolon delimiter')
    except Exception:
        df = pd.read_csv(raw_path, delimiter=';')
    print('Loaded raw dataset.')
print('Columns:', df.columns.tolist())
print(df.head())

Loaded raw dataset.
Columns: ['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2', 'G3']
  school sex  age address famsize Pstatus  Medu  Fedu     Mjob      Fjob  ...  \
0     GP   F   18       U     GT3       A     4     4  at_home   teacher  ...   
1     GP   F   17       U     GT3       T     1     1  at_home     other  ...   
2     GP   F   15       U     LE3       T     1     1  at_home     other  ...   
3     GP   F   15       U     GT3       T     4     2   health  services  ...   
4     GP   F   16       U     GT3       T     3     3    other     other  ...   

  famrel freetime  goout  Dalc  Walc health absences  G1  G2  G3  
0      4        3      4     1     1      3        6   5   6   6  
1      5        3     

In [3]:
# 3. Handle missing values (drop or impute)
# For simplicity, drop rows with missing values
missing = df.isnull().sum().sum()
if missing > 0:
    print(f'Dropping {missing} missing values.')
    df = df.dropna()
else:
    print('No missing values found.')

No missing values found.


In [4]:
# 4. Encode categorical variables (OneHotEncoder for non-numeric columns except target)
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
if 'G3' in categorical_cols:
    categorical_cols.remove('G3')
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)
print('Categorical variables encoded.')
print(df_encoded.head())

Categorical variables encoded.
   age  Medu  Fedu  traveltime  studytime  failures  famrel  freetime  goout  \
0   18     4     4           2          2         0       4         3      4   
1   17     1     1           1          2         0       5         3      3   
2   15     1     1           1          2         3       4         3      2   
3   15     4     2           1          3         0       3         2      2   
4   16     3     3           1          2         0       4         3      2   

   Dalc  ...  guardian_mother  guardian_other  schoolsup_yes  famsup_yes  \
0     1  ...             True           False           True       False   
1     1  ...            False           False          False        True   
2     2  ...             True           False           True       False   
3     1  ...             True           False          False        True   
4     1  ...            False           False          False        True   

   paid_yes  activities_yes  nu

In [5]:
# 5. Normalize numerical features (excluding the target G3)
scaler = StandardScaler()
numeric_cols = df_encoded.select_dtypes(include='number').columns.tolist()
numeric_cols.remove('G3')
df_encoded[numeric_cols] = scaler.fit_transform(df_encoded[numeric_cols])
print('Numerical features normalized.')
print(df_encoded.head())

Numerical features normalized.
        age      Medu      Fedu  traveltime  studytime  failures    famrel  \
0  1.023046  1.143856  1.360371    0.792251  -0.042286 -0.449944  0.062194   
1  0.238380 -1.600009 -1.399970   -0.643249  -0.042286 -0.449944  1.178860   
2 -1.330954 -1.600009 -1.399970   -0.643249  -0.042286  3.589323  0.062194   
3 -1.330954  1.143856 -0.479857   -0.643249   1.150779 -0.449944 -1.054472   
4 -0.546287  0.229234  0.440257   -0.643249  -0.042286 -0.449944  0.062194   

   freetime     goout      Dalc  ...  guardian_mother  guardian_other  \
0 -0.236010  0.801479 -0.540699  ...             True           False   
1 -0.236010 -0.097908 -0.540699  ...            False           False   
2 -0.236010 -0.997295  0.583385  ...             True           False   
3 -1.238419 -0.997295 -0.540699  ...             True           False   
4 -0.236010 -0.997295 -0.540699  ...            False           False   

   schoolsup_yes  famsup_yes  paid_yes  activities_yes  nurse

In [6]:
# 6. Separate features (X) and target (y = G3)
X = df_encoded.drop('G3', axis=1)
y = df_encoded['G3']
print('Features and target separated.')
print('X shape:', X.shape)
print('y shape:', y.shape)

Features and target separated.
X shape: (395, 41)
y shape: (395,)


In [7]:
# 7. Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Data split into training and testing sets.')
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Data split into training and testing sets.
X_train shape: (316, 41)
X_test shape: (79, 41)


In [8]:
# 8. Save the processed data to data/processed/
X_train.to_csv('../data/processed/X_train.csv', index=False)
X_test.to_csv('../data/processed/X_test.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)
print('Processed data saved to data/processed/.')

Processed data saved to data/processed/.
