In this notebook we go further in trying to fit a model that is able to predict the final grades for the Portuguese and the Mathematics class. We try to also use the factor variables (without the school variable which just indicates the attended school by the pupil). To do this we convert the factor variables from symbolic ones such as "Yes" / "No" to numerical values that are suitable for a Lasso techniques

In [10]:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

students_combined = pd.read_csv('../data/students-combined.csv', sep = ';')
students_combined = students_combined.drop(columns = 'Unnamed: 0')

students_combined

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,health,absences_x,G1_x,G2_x,G3_x,paid_y,absences_y,G1_y,G2_y,G3_y
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,3,4,0,11,11,no,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,3,2,9,11,11,no,4,5,5,6
2,GP,F,15,U,GT3,T,4,2,health,services,...,5,0,14,14,14,yes,2,15,14,15
3,GP,F,16,U,GT3,T,3,3,other,other,...,5,0,11,13,13,yes,4,6,10,10
4,GP,M,16,U,LE3,T,4,3,services,other,...,5,6,12,12,13,yes,10,15,15,15
5,GP,M,16,U,LE3,T,2,2,other,other,...,3,0,13,12,13,no,0,12,12,11
6,GP,F,17,U,GT3,A,4,4,other,teacher,...,1,2,10,13,13,no,6,6,5,6
7,GP,M,15,U,LE3,A,3,2,services,other,...,1,0,15,16,17,yes,0,16,18,19
8,GP,M,15,U,GT3,T,3,4,other,other,...,5,0,12,12,13,yes,0,14,15,15
9,GP,F,15,U,GT3,T,4,4,teacher,health,...,2,2,14,14,14,yes,0,10,8,9


We use the following encoding: 
- "yes" becomes 1 and "no" becomes -1
- "F" becomes 1 and "M" becomes -1
- "U" becomes 1 and "R" becomes -1
- "LE3" becomes 1 and "GT3" becomes -1
- "T" becomes 1 and "A" becomes -1
- for Mjob and Fjob: "teacher" becomes 1, "health" becomes 2, "services" becomes 3, "at_home" becomes 4 and "other" becomes 5
- for reason: "close to home" becomes 1, "school reputation" becomes 2, "course" becomes 3 and "other becomes 4
- for guardian: "mother" becomes 1, "father" becomes 2 and "other" becomes 3

Traveltime, age and school are ignored as variables (this was decided by analysing the features of the data sets).

In [11]:
students_combined['paid_y']     = students_combined['paid_y'].apply(lambda x: -1 if x == "no" else 1)
students_combined['paid_x']     = students_combined['paid_x'].apply(lambda x: -1 if x == "no" else 1)
students_combined['address']    = students_combined['address'].apply(lambda x: -1 if x == "R" else 1)
students_combined['famsize']    = students_combined['famsize'].apply(lambda x: -1 if x == "GT3" else 1)
students_combined['Pstatus']    = students_combined['Pstatus'].apply(lambda x: -1 if x == "A" else 1)
students_combined['Mjob']       = students_combined['Mjob'].apply(lambda x: 1 if x == "teacher" else 2 \
                                                                  if x == "health" else 3 if x == "services" else 4 \
                                                                  if x == "at_home" else 5)
students_combined['Fjob']       = students_combined['Fjob'].apply(lambda x: 1 if x == "teacher" else 2 \
                                                                  if x == "health" else 3 if x == "services" else 4 \
                                                                  if x == "at_home" else 5)
students_combined['reason']     = students_combined['reason'].apply(lambda x: 1 if x == "home" else 2 \
                                                                    if x == "reputation" else 3 if "course" else 4)
students_combined['guardian']   = students_combined['guardian'].apply(lambda x: 1 if x == "mother" else 2 \
                                                                      if x == "father" else 3)
students_combined['schoolsup']  = students_combined['schoolsup'].apply(lambda x: -1 if x == "no" else 1)
students_combined['famsup']     = students_combined['famsup'].apply(lambda x: -1 if x == "no" else 1)
students_combined['activities'] = students_combined['activities'].apply(lambda x: -1 if x == "no" else 1)
students_combined['nursery']    = students_combined['nursery'].apply(lambda x: -1 if x == "no" else 1)
students_combined['higher']     = students_combined['higher'].apply(lambda x: -1 if x == "no" else 1)
students_combined['internet']   = students_combined['internet'].apply(lambda x: -1 if x == "no" else 1)
students_combined['romantic']   = students_combined['romantic'].apply(lambda x: -1 if x == "no" else 1)


In [12]:
students_combined

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,health,absences_x,G1_x,G2_x,G3_x,paid_y,absences_y,G1_y,G2_y,G3_y
0,GP,F,18,1,-1,-1,4,4,4,1,...,3,4,0,11,11,-1,6,5,6,6
1,GP,F,17,1,-1,1,1,1,4,5,...,3,2,9,11,11,-1,4,5,5,6
2,GP,F,15,1,-1,1,4,2,2,3,...,5,0,14,14,14,1,2,15,14,15
3,GP,F,16,1,-1,1,3,3,5,5,...,5,0,11,13,13,1,4,6,10,10
4,GP,M,16,1,1,1,4,3,3,5,...,5,6,12,12,13,1,10,15,15,15
5,GP,M,16,1,1,1,2,2,5,5,...,3,0,13,12,13,-1,0,12,12,11
6,GP,F,17,1,-1,-1,4,4,5,1,...,1,2,10,13,13,-1,6,6,5,6
7,GP,M,15,1,1,-1,3,2,3,5,...,1,0,15,16,17,1,0,16,18,19
8,GP,M,15,1,-1,1,3,4,5,5,...,5,0,12,12,13,1,0,14,15,15
9,GP,F,15,1,-1,1,4,4,1,2,...,2,2,14,14,14,1,0,10,8,9


First we also include the grades in our trial of determing the relevance of features via model fitting for predicting the final grades.
Later on we shall remove grades and absences ...

In [19]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'absences_x', 'absences_y', 'G1_x', 'G1_y', 'G2_x', 'G2_y']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [22]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 64
32 features for the first predicted value and 32 for the second one
Training score default lasso: 0.8116939786280594
Test score default lasso: 0.8255361194477168
Number of features used: 5
Most relevant columns for Portuguese: Index(['G2_x', 'G1_x', 'schoolsup', 'higher', 'nursery', 'activities',
       'paid_y', 'paid_x', 'famsup', 'G2_y'],
      dtype='object')
Most relevant columns for Mathematics: Index(['G2_y', 'G1_y', 'absences_y', 'schoolsup', 'nursery', 'activities',
       'paid_y', 'paid_x', 'famsup', 'reason'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.8324489912051912
Test score alpha = 0.1 lasso: 0.8437870527739532
Number of features used: 24
Most relevant columns for Portuguese: Index(['G2_x', 'G1_x', 'absences_x', 'Fjob', 'famsup', 'higher', 'nursery',
       'activities', 'paid_y', 'paid_x'],
      dtype='object')
Most relev

Droping the absences:

In [23]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'G1_x', 'G1_y', 'G2_x', 'G2_y']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [24]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 60
30 features for the first predicted value and 30 for the second one
Training score default lasso: 0.8499907389472877
Test score default lasso: 0.6882863747767161
Number of features used: 5
Most relevant columns for Portuguese: Index(['G2_x', 'G1_x', 'G1_y', 'G2_y', 'nursery', 'famsize', 'Pstatus', 'Mjob',
       'Fjob', 'reason'],
      dtype='object')
Most relevant columns for Mathematics: Index(['G2_y', 'G1_y', 'higher', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.8649294070176218
Test score alpha = 0.1 lasso: 0.7045774822003723
Number of features used: 17
Most relevant columns for Portuguese: Index(['G2_x', 'G1_x', 'address', 'internet', 'famsup', 'G1_y', 'health',
       'activities', 'famsize', 'Pstatus'],
      dtype='object')
Most relevant columns for Math

Ok, so we could say that the absences count quite a bit, now we add the absences back and we remove the first and second grades.

In [25]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'absences_x', 'absences_y']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [26]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 56
28 features for the first predicted value and 28 for the second one
Training score default lasso: 0.012455849443039147
Test score default lasso: 0.005531208684226709
Number of features used: 2
Most relevant columns for Portuguese: Index(['higher', 'absences_x', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
Most relevant columns for Mathematics: Index(['absences_y', 'nursery', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.21582058323844547
Test score alpha = 0.1 lasso: 0.08900128613047147
Number of features used: 37
Most relevant columns for Portuguese: Index(['address', 'studytime', 'internet', 'famsup', 'Medu', 'Fedu',
       'activities', 'romantic', 'famsize', 'reason'],
      dtype='ob

Without grades the predictive power is really low, like in the previous case (Lasso technique on the data sets). Thus we add the grades back in then we try to normalize each used feature.

In [36]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'absences_x', 'absences_y', 'G1_x', 'G1_y', 'G2_x', 'G2_y']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

for feature in features.columns: 
    features[feature] = (features[feature] - np.mean(features[feature])) / np.std(features[feature])

for feature in target.columns:
    target[feature] = (target[feature] - np.mean(target[feature])) / np.std(target[feature])

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [39]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 64
32 features for the first predicted value and 32 for the second one
Training score default lasso: -2.6702146031099082e-15
Test score default lasso: -0.012556856795476164
Number of features used: 0
Most relevant columns for Portuguese: Index(['G2_y', 'G2_x', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
Most relevant columns for Mathematics: Index(['G2_y', 'G2_x', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.8060077869204453
Test score alpha = 0.1 lasso: 0.8238295000105613
Number of features used: 6
Most relevant columns for Portuguese: Index(['G2_x', 'G1_x', 'schoolsup', 'higher', 'nursery', 'activities',
       'paid_y', 'paid_x', 'famsup', 'G2_y'],
      dtype='object')
Most relevant co

So we remove the grades to see what happens, if the normalization help in removing their influence in the process of prediction anyhow ...

In [40]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'absences_x', 'absences_y']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

for feature in features.columns: 
    features[feature] = (features[feature] - np.mean(features[feature])) / np.std(features[feature])

for feature in target.columns:
    target[feature] = (target[feature] - np.mean(target[feature])) / np.std(target[feature])

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [41]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 56
28 features for the first predicted value and 28 for the second one
Training score default lasso: 2.250751829198468e-16
Test score default lasso: -0.007625325663889372
Number of features used: 0
Most relevant columns for Portuguese: Index(['absences_y', 'absences_x', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup'],
      dtype='object')
Most relevant columns for Mathematics: Index(['absences_y', 'absences_x', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.18299319284987398
Test score alpha = 0.1 lasso: 0.07699716139004577
Number of features used: 18
Most relevant columns for Portuguese: Index(['address', 'studytime', 'Medu', 'higher', 'activities', 'nursery',
       'famsize', 'Pstatus', 'Fjob', 'reason'],
      dt

Nope, the normalization didn't really help ...

In [54]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'absences_x', 'absences_y', 'G2_y']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

for feature in features.columns: 
    features[feature] = (features[feature] - np.mean(features[feature])) / np.std(features[feature])

for feature in target.columns:
    target[feature] = (target[feature] - np.mean(target[feature])) / np.std(target[feature])

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [55]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 58
29 features for the first predicted value and 29 for the second one
Training score default lasso: 6.186311585769665e-16
Test score default lasso: -0.008018899782198493
Number of features used: 0
Most relevant columns for Portuguese: Index(['G2_y', 'higher', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
Most relevant columns for Mathematics: Index(['G2_y', 'higher', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.6088876275315738
Test score alpha = 0.1 lasso: 0.5552354978361113
Number of features used: 9
Most relevant columns for Portuguese: Index(['G2_y', 'studytime', 'address', 'famsup', 'health', 'higher', 'famsize',
       'Pstatus', 'Mjob', 'Fjob'],
      dtype='object')
Most relevant co

In [73]:
features = students_combined.loc[:, ['address', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                                    'famsup', 'paid_x', 'paid_y', 'activities', 'nursery', 'higher', 'internet', 'romantic',
                                    'Medu', 'Fedu', 'studytime', 'failures', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc',
                                    'health', 'absences_x', 'absences_y', 'G2_x']]

target   = students_combined.loc[:, ['G3_x', 'G3_y']]

for feature in features.columns: 
    features[feature] = (features[feature] - np.mean(features[feature])) / np.std(features[feature])

for feature in target.columns:
    target[feature] = (target[feature] - np.mean(target[feature])) / np.std(target[feature])

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2)

In [74]:
print('Default number of features: {}'.format(2 * features.shape[1]))
print('{} features for the first predicted value and {} for the second one'.format(features.shape[1], features.shape[1]))
print('=' * 100)

lasso_normal = Lasso(max_iter = 10e6)
lasso_normal = lasso_normal.fit(X_train, y_train)

train_score = lasso_normal.score(X_train, y_train)
test_score  = lasso_normal.score(X_test, y_test)
coefficients_used = np.sum(lasso_normal.coef_ != 0)

print("Training score default lasso: {}".format(train_score))
print("Test score default lasso: {}".format(test_score))
print("Number of features used: {}".format(coefficients_used))

coef1, coef2 = np.array(lasso_normal.coef_)
most_relevant1 = coef1.argsort()[-10:]
most_relevant2 = coef2.argsort()[-10:]
print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

alpha_values = [0.1, 0.01, 0.001, 2, 5, 10, 15]
for alpha_iter in alpha_values:
    
    lasso = Lasso(alpha = alpha_iter, max_iter = 10e6)
    lasso = lasso.fit(X_train, y_train)

    train_score = lasso.score(X_train, y_train)
    test_score  = lasso.score(X_test, y_test)
    coefficients_used = np.sum(lasso.coef_ != 0)

    print("-" * 100)
    print("Training score alpha = {} lasso: {}".format(alpha_iter, train_score))
    print("Test score alpha = {} lasso: {}".format(alpha_iter, test_score))
    print("Number of features used: {}".format(coefficients_used))

    coef1, coef2 = np.array(lasso.coef_)
    most_relevant1 = coef1.argsort()[-10:]
    most_relevant2 = coef2.argsort()[-10:]
    print("Most relevant columns for Portuguese: {}".format(features.columns[most_relevant1[::-1]]))
    print("Most relevant columns for Mathematics: {}".format(features.columns[most_relevant2[::-1]]))

Default number of features: 58
29 features for the first predicted value and 29 for the second one
Training score default lasso: 7.747260153887927e-16
Test score default lasso: -0.008345059049089906
Number of features used: 0
Most relevant columns for Portuguese: Index(['G2_x', 'higher', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
Most relevant columns for Mathematics: Index(['G2_x', 'higher', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
----------------------------------------------------------------------------------------------------
Training score alpha = 0.1 lasso: 0.5335277964912588
Test score alpha = 0.1 lasso: 0.5599681461440892
Number of features used: 6
Most relevant columns for Portuguese: Index(['G2_x', 'higher', 'famsize', 'Pstatus', 'Mjob', 'Fjob', 'reason',
       'guardian', 'schoolsup', 'famsup'],
      dtype='object')
Most relevant c

The situation is more stable in this case, as in even if the Mathematics grade weight a little bit heavier than the Portuguese one when it comes to the predictive power of the model the relevance of the Mathematics grade is not that strong as in the last time (unnormalized data).

Anyhow, a lot of the conclusion from the previous Lasso analysis still hold (Lasso technique on the data sets). Besides these conclusion we can add the following points:
- the absences have some weight in the predictive capabilities of the model, anyhow, without any kind of previous grades the process of grade prediction does not confer satisfying results
- if only one grade is available than the prefered one (for better predictions) is the second Mathematics grade; if we add a Portuguese grade any will do.
