# Feature preprocessing
Student perfomance dataset: https://archive.ics.uci.edu/ml/datasets/student+performance

# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1. school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2. sex - student's sex (binary: "F" - female or "M" - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: "U" - urban or "R" - rural)
5. famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6. Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
7. Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8. Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9. Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10. Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11. reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12. guardian - student's guardian (nominal: "mother", "father" or "other")
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)

# These grades are related with the course subject, Math or Portuguese:
31. G1 - first period grade (numeric: from 0 to 20)
31. G2 - second period grade (numeric: from 0 to 20)
32. G3 - final grade (numeric: from 0 to 20, output target)

Additional note: there are several (382) students that belong to both datasets . 
These students can be identified by searching for identical attributes
that characterize each student, as shown in the annexed R file.


In [548]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.simplefilter('ignore')

### Step 1. Load the dataset

In [549]:
df = pd.read_csv('student-por.csv', sep=';')
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


### Let's look at the features of our dataset.
As we see in the dataset a lot of object features, which we should to preprocess for ML algorithm.
And also there some categorical features with muneric values, such as `Medu, Fedu, famrel, freetime, goout, Dalc, Walc, health`

In [550]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
school        649 non-null object
sex           649 non-null object
age           649 non-null int64
address       649 non-null object
famsize       649 non-null object
Pstatus       649 non-null object
Medu          649 non-null int64
Fedu          649 non-null int64
Mjob          649 non-null object
Fjob          649 non-null object
reason        649 non-null object
guardian      649 non-null object
traveltime    649 non-null int64
studytime     649 non-null int64
failures      649 non-null int64
schoolsup     649 non-null object
famsup        649 non-null object
paid          649 non-null object
activities    649 non-null object
nursery       649 non-null object
higher        649 non-null object
internet      649 non-null object
romantic      649 non-null object
famrel        649 non-null int64
freetime      649 non-null int64
goout         649 non-null int64
Dalc          649 no

### Step 2. Splitting the dataset
Lets split our dataset into 3 parts `object_features, numeric_features and numeric_categorical_features`, to preprocess them separately.

In [551]:
object_features = df.select_dtypes('object')
numeric_features = df.select_dtypes('int64')

numeric_categorial = ['Medu', 'Fedu', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health']
numeric_features.drop(numeric_categorial, axis=1, inplace=True)
numeric_categorical_features = df.loc[:, numeric_categorial]

In [552]:
print(object_features.shape, numeric_features.shape, numeric_categorical_features.shape)

(649, 17) (649, 8) (649, 8)


### Step 3. Object features preprocessing
Let's start with the `object_features`, there are fetures with only two values, let's call them `binary_features` and aplly to them `LabelEncoder`, which will change values like `yes/no` to `1/0`.

In [553]:
one_hot = OneHotEncoder()
label_encoder = LabelEncoder()
binary_features = ['Pstatus', 'school', 'sex', 'schoolsup', 'paid', 'activities', 'nursery', 'famsup', 'higher', 'internet', 'romantic', 'famsize', 'address']
for i in binary_features:
    object_features[i] = label_encoder.fit_transform(object_features[i])


### Step 3.1
For the other object features with values more than two like `Mjob, Fjob, reason, guardian` we are going to apply a dummy transformation.

In [554]:
dummy_fetures = object_features.loc[:, ['Mjob', 'Fjob', 'reason', 'guardian']]
dummy_data = pd.get_dummies(dummy_fetures)
dummy_data.head()

Unnamed: 0,Mjob_at_home,Mjob_health,Mjob_other,Mjob_services,Mjob_teacher,Fjob_at_home,Fjob_health,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0
1,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0,0
2,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0
3,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,1,0
4,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0


Then we drop `'Mjob', 'Fjob', 'reason', 'guardian'` from `object_features` and then concatinate `object_features` with `dummy_fetures`. We got the same `dummy_fetures` as in the step 2 but with processed features.

In [555]:
object_features.drop(labels=['Mjob', 'Fjob', 'reason', 'guardian'], axis=1, inplace=True)

In [556]:
object_features = pd.concat([object_features, dummy_data], axis=1)
object_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 30 columns):
school               649 non-null int64
sex                  649 non-null int64
address              649 non-null int64
famsize              649 non-null int64
Pstatus              649 non-null int64
schoolsup            649 non-null int64
famsup               649 non-null int64
paid                 649 non-null int64
activities           649 non-null int64
nursery              649 non-null int64
higher               649 non-null int64
internet             649 non-null int64
romantic             649 non-null int64
Mjob_at_home         649 non-null uint8
Mjob_health          649 non-null uint8
Mjob_other           649 non-null uint8
Mjob_services        649 non-null uint8
Mjob_teacher         649 non-null uint8
Fjob_at_home         649 non-null uint8
Fjob_health          649 non-null uint8
Fjob_other           649 non-null uint8
Fjob_services        649 non-null uint8
Fjob_teacher   

### Step 4. Working with numeric categorical features.
Here we shoud do same the same operation as we did in Step 3.1, but `pd.get_dummies` doesnt wok with numerical features, so we need to represent `numeric_categorical_features` as categorical feature.

In [557]:
for i in numeric_categorial:
    numeric_categorical_features[i] = numeric_categorical_features[i].astype('category')
numeric_categorical_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 8 columns):
Medu        649 non-null category
Fedu        649 non-null category
famrel      649 non-null category
freetime    649 non-null category
goout       649 non-null category
Dalc        649 non-null category
Walc        649 non-null category
health      649 non-null category
dtypes: category(8)
memory usage: 6.7 KB


In [558]:
numeric_data = pd.get_dummies(numeric_categorical_features)
numeric_data.head()

Unnamed: 0,Medu_0,Medu_1,Medu_2,Medu_3,Medu_4,Fedu_0,Fedu_1,Fedu_2,Fedu_3,Fedu_4,...,Walc_1,Walc_2,Walc_3,Walc_4,Walc_5,health_1,health_2,health_3,health_4,health_5
0,0,0,0,0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,1,0,0
1,0,1,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,0,1,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,0,0,0,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,1
4,0,0,0,1,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,1


In [559]:
cat_feat = pd.concat([object_features, numeric_data], axis=1)
df = pd.concat([cat_feat, numeric_features], axis=1)
df.head()

Unnamed: 0,school,sex,address,famsize,Pstatus,schoolsup,famsup,paid,activities,nursery,...,health_4,health_5,age,traveltime,studytime,failures,absences,G1,G2,G3
0,0,0,1,0,0,1,0,0,0,1,...,0,0,18,2,2,0,4,0,11,11
1,0,0,1,0,1,0,1,0,0,0,...,0,0,17,1,2,0,2,9,11,11
2,0,0,1,1,1,1,0,0,0,1,...,0,0,15,1,2,0,6,12,13,12
3,0,0,1,0,1,0,1,0,1,1,...,0,1,15,1,3,0,0,14,14,14
4,0,0,1,0,1,0,1,0,0,1,...,0,1,16,1,2,0,0,11,13,13


## Machine Learning
This is a regression task, so we import regression models, for this task we are going to use `LinearRegression, Ridge, Lasso, XGBRegressor`

In [560]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

Splitting the dataset into objects data and target feature 

In [263]:
X = df.iloc[:, :-1]
y = df['G3']

Splitting into trainig and test sets

In [456]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)

In [563]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
lin_pred = lin_reg.predict(X_test)
lin_mse = mean_squared_error(y_test, lin_pred)
train_score = lin_reg.score(X_train, y_train)
test_score = lin_reg.score(X_test, y_test)

print('Train set accuracy: {0:2f}\nTest set accuracy: {1:2f}'.format(train_score, test_score))
print('Mean square error: {0}'.format(lin_mse))

Train set accuracy: 0.856842
Test set accuracy: 0.855881
Mean square error: 1.59369046672319


In [564]:
lasso = Lasso(alpha=0.5)
lasso.fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
lasso_mse = mean_squared_error(y_test, lasso_pred)
train_score = lasso.score(X_train, y_train)
test_score = lasso.score(X_test, y_test)

print('Train set accuracy: {0:2f}\nTest set accuracy: {1:2f}'.format(train_score, test_score))
print('Mean square error: {0}'.format(lasso_mse))

Train set accuracy: 0.826927
Test set accuracy: 0.884352
Mean square error: 1.278849735654452


In [566]:
ridge = Ridge(alpha=0.5)
ridge.fit(X_train, y_train)
ridge_pred = ridge.predict(X_test)
ridge_mse = mean_squared_error(y_test, ridge_pred)
train_score = ridge.score(X_train, y_train)
test_score = ridge.score(X_test, y_test)

print('Train set accuracy: {0:2f}\nTest set accuracy: {1:2f}'.format(train_score, test_score))
print('Mean square error: {0}'.format(ridge_mse))

Train set accuracy: 0.856832
Test set accuracy: 0.856998
Mean square error: 1.5813390533565268


In [567]:
xgb = XGBRegressor()
xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)
xgb_mse = mean_squared_error(y_test, xgb_pred)
train_score = xgb.score(X_train, y_train)
test_score = xgb.score(X_test, y_test)

print('Train set accuracy: {0:2f}\nTest set accuracy: {1:2f}'.format(train_score, test_score))
print('Mean square error: {0}'.format(xgb_mse))

Train set accuracy: 0.946292
Test set accuracy: 0.871694
Mean square error: 1.418821265638591


That's it. This dataset can be treat as classification task with the question "Will student pass the exams of not?". For that we just need to code target feature into ones and zeros, if student's grade less than 10 it's zero (student failed), if student's grade more or equal to 10 then it's one (student passed). But it will next time :)