# Student Grades Regression Model

In this notebook you will build a regression model to predict student grades.

## Imports

In [1]:
import pandas as pd
import numpy as np

## 1. Dataset

This dataset comes from Kaggle and has information about student grades and alcohol usage along with information about their family:

https://www.kaggle.com/uciml/student-alcohol-consumption/kernels

In [2]:
raw_data = pd.read_csv('/data/student-alcohol-consumption/student-mat.csv')

In [3]:
raw_data.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
school        395 non-null object
sex           395 non-null object
age           395 non-null int64
address       395 non-null object
famsize       395 non-null object
Pstatus       395 non-null object
Medu          395 non-null int64
Fedu          395 non-null int64
Mjob          395 non-null object
Fjob          395 non-null object
reason        395 non-null object
guardian      395 non-null object
traveltime    395 non-null int64
studytime     395 non-null int64
failures      395 non-null int64
schoolsup     395 non-null object
famsup        395 non-null object
paid          395 non-null object
activities    395 non-null object
nursery       395 non-null object
higher        395 non-null object
internet      395 non-null object
romantic      395 non-null object
famrel        395 non-null int64
freetime      395 non-null int64
goout         395 non-null int64
Dalc          395 no

## 2. Features

Create a feature `DataFrame`, `X` with the following columns:

* `Dalc` (weekday alcohol consumption)
* `Walc` (weekend alcohol consumption)
* `Medu` (mother's education level)
* `Fedu` (father's education level)
* `traveltime`
* `studytime`
* `goout`
* `romantic` (one hot encoded, with `get_dummies` and `drop_first=True, prefix='romantic'`)
* `higher` (one hot encoded, with `get_dummies` and `drop_first=True, prefix='higher'`)
* `sex` (one hot encoded, with `get_dummies` and `drop_first=True, prefix='higher'`)

In [5]:
features = ['Dalc', 'Walc', 'Medu', 'Fedu', 'traveltime', 'studytime', 'goout', 'romantic', 'higher', 'sex']
to_one_hot = ['higher', 'sex', 'romantic']
X = raw_data.loc[:, features]
X = pd.get_dummies(X, columns=to_one_hot, drop_first=True, prefix=to_one_hot)

In [6]:
print(X.columns)

Index(['Dalc', 'Walc', 'Medu', 'Fedu', 'traveltime', 'studytime', 'goout',
       'higher_yes', 'sex_M', 'romantic_yes'],
      dtype='object')


In [7]:
assert list(X.columns)==['Dalc', 'Walc', 'Medu', 'Fedu', 'traveltime', 'studytime', 'goout',
       'higher_yes', 'sex_M', 'romantic_yes']

In [8]:
X.head()

Unnamed: 0,Dalc,Walc,Medu,Fedu,traveltime,studytime,goout,higher_yes,sex_M,romantic_yes
0,1,1,4,4,2,2,4,1,0,0
1,1,1,1,1,1,2,3,1,0,0
2,2,3,1,1,1,2,2,1,0,0
3,1,1,4,2,1,3,2,1,0,1
4,1,2,3,3,1,2,2,1,0,0


Create the target column `y` from the `G3` column (total grade):

In [9]:
y = raw_data.G3

In [10]:
assert list(y.value_counts().values)==[56, 47, 38, 33, 32, 31, 31, 28, 27, 16, 15, 12,  9,  7,  6,  5,  1,
        1]

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=276, test_size=119)

In [13]:
assert Xtrain.shape==(276,10)
assert Xtest.shape==(119,10)
assert ytrain.shape==(276,)
assert ytest.shape==(119,)

## Regression model

In [14]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score
from sklearn.model_selection import GridSearchCV

In the following cells create and tune regression models using the following models:

* `LinearRegression`
* `RandomForestRegression`
* `Lasso`

For each of the models:

* Create a pipeline with a `PolynomialFeatures` preprocessor first and model second.
* Compute the $R^2$ score for both the training and test datasets.
* Tune model parameters, including the polynomial degree to balance the bias and variance of the model.

Create, fit, tune and predict using the `LinearRegression` model here:

In [15]:
def PolynomialRegression(model, degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), model(**kwargs))

linear_param_grid = {
    'polynomialfeatures__degree': np.arange(4),
    'linearregression__fit_intercept': [True, False],
    'linearregression__normalize': [True, False]
}

linear_grid = GridSearchCV(PolynomialRegression(LinearRegression), linear_param_grid, cv=7)
linear_grid.fit(X, y)
print(linear_grid.best_params_)

{'polynomialfeatures__degree': 1, 'linearregression__normalize': True, 'linearregression__fit_intercept': False}


Compute and print the training and test $R^2$ score here: 

In [16]:
linear_model = linear_grid.best_estimator_
linear_model.fit(Xtrain, ytrain)
linear_ytrain_predict = linear_model.predict(Xtrain)
linear_ytest_predict = linear_model.predict(Xtest)

print("Train:", r2_score(ytrain, linear_ytrain_predict))
print("Test:", r2_score(ytest, linear_ytest_predict))

Train: 0.131764270796
Test: 0.0670502699842


Create, fit, tune and predict using the `RandomForestRegression` model here:

In [17]:
rforest_param_grid = {
    'polynomialfeatures__degree': np.arange(4),
    'randomforestregressor__n_estimators': np.arange(50, 100, 5)
}

rforest_grid = GridSearchCV(PolynomialRegression(RandomForestRegressor), rforest_param_grid, cv=5,\
                            verbose=1, n_jobs=3)
rforest_grid.fit(X, y)
print(rforest_grid.best_params_)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=3)]: Done  82 tasks      | elapsed:    3.8s
[Parallel(n_jobs=3)]: Done 200 out of 200 | elapsed:   54.2s finished


{'polynomialfeatures__degree': 0, 'randomforestregressor__n_estimators': 95}


Compute and print the training and test $R^2$ score here: 

In [18]:
rforest_model = rforest_grid.best_estimator_
rforest_model.fit(Xtrain, ytrain)
rforest_ytrain_predict = rforest_model.predict(Xtrain)
rforest_ytest_predict = rforest_model.predict(Xtest)

print("Train:", r2_score(ytrain, rforest_ytrain_predict))
print("Test:", r2_score(ytest, rforest_ytest_predict))

Train: -4.54558435006e-05
Test: -0.00839966355545


Create, fit, tune and predict using the `Ridge` model here:

In [19]:
ridge_param_grid = {
    'polynomialfeatures__degree': np.arange(4),
    'ridge__fit_intercept': [True, False],
    'ridge__normalize': [True, False],
    'ridge__alpha': np.linspace(0.01, 0.5, num=5)
}

ridge_grid = GridSearchCV(PolynomialRegression(Ridge), ridge_param_grid, cv=7, verbose=1)
ridge_grid.fit(X, y)
print(ridge_grid.best_params_)

Fitting 7 folds for each of 80 candidates, totalling 560 fits
{'polynomialfeatures__degree': 1, 'ridge__fit_intercept': True, 'ridge__normalize': True, 'ridge__alpha': 0.255}


[Parallel(n_jobs=1)]: Done 560 out of 560 | elapsed:   10.9s finished


Compute and print the training and test $R^2$ score here: 

In [20]:
ridge_model = ridge_grid.best_estimator_
ridge_model.fit(Xtrain, ytrain)
ridge_ytrain_predict = ridge_model.predict(Xtrain)
ridge_ytest_predict = ridge_model.predict(Xtest)

print("Train:", r2_score(ytrain, ridge_ytrain_predict))
print("Test:", r2_score(ytest, ridge_ytest_predict))

Train: 0.124966311167
Test: 0.0808930337889
