# Student Performance Data Preprocessing

The following list contains the data transformations that we intend to apply to the dataset:
1. Concatenate math and Portuguese data
2. Split data into train and test sets
3. Remove columns that do not help our model generalize for all students
4. Remove columns with little/no correlation with target variable to reduce noise in the dataset
5. Remove columns with high correlation with other columns to reduce multicollinearity in the dataset
6. Standardize or normalize all numberic columns so they are on the same scale
7. Encode the ordinal categorical variables
8. One-hot encode the nominal categorical variables

See DataExploration.ipynb for justification of the above transformations.

We want to ultimately evaluate our models using cross-validation. However, to improve training time and allow us to experiment faster, we will try out algorithms using a generic train-test-split and then apply cross-validation to the models that perform the best on the split data.

### Install and import libraries

In [36]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

### Define constants

In [37]:
RAND_STATE = 12
TEST_SIZE = 0.1
TRAIN_FILE = 'data/processed/train.csv'
TEST_FILE = 'data/processed/test.csv'

### Read data into memory

In [38]:
math_data = pd.read_csv(filepath_or_buffer = 'data/student-mat.csv', sep=';', header=0)
port_data = pd.read_csv(filepath_or_buffer = 'data/student-por.csv', sep=';', header=0)

In [39]:
df = pd.concat([math_data, port_data])

### Split data into train and test sets

In [40]:
X = df.loc[:, df.columns != 'G3']
y = df['G3']

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = TEST_SIZE, random_state = RAND_STATE, shuffle=True)

### Define preprocessing transformations

In [42]:
# define numerical attributes to keep - note that we are treating the ordinal variables (such as traveltime and studytime) as numeric because they have already been encoded
numeric_cols_to_keep = ['age', 'Medu', 'traveltime', 'studytime', 'failures', 'goout', 'Dalc', 'absences']
# define nominal attributes to keep
nominal_cols_to_keep = ['address', 'Fjob', 'guardian', 'higher', 'internet', 'romantic']

In [43]:
ct = ColumnTransformer(
         transformers = [("numeric", MinMaxScaler(), numeric_cols_to_keep),
                         ("nominal", OneHotEncoder(drop='if_binary', handle_unknown='error'), nominal_cols_to_keep)],
         remainder = 'drop',
         n_jobs = -1
)

### Transform data

In [45]:
X_train_processed = ct.fit_transform(X_train)
X_test_processed = ct.transform(X_test)

In [66]:
# get column names from transformers
nominal_col_names = ct.transformers_[1][1].get_feature_names(input_features=nominal_cols_to_keep)
transformed_col_names = np.concatenate([numeric_cols_to_keep, nominal_col_names])

Convert numpy arrays back into Pandas dataframes and add back target variable

In [67]:
train_df = pd.DataFrame(X_train_processed, columns = transformed_col_names)
train_df['G3'] = np.array(y_train)
test_df = pd.DataFrame(X_test_processed, columns = transformed_col_names)
test_df['G3'] = np.array(y_test)

### Write processed data to disk

In [69]:
!mkdir -p 'data/processed'

In [70]:
train_df.to_csv(path_or_buf = TRAIN_FILE, header=True)
test_df.to_csv(path_or_buf = TEST_FILE, header=True)