# Capítulo 15 - Regularização

## Seção 15.2 - Por que regularizar?

Regularização é uma técnica usada para reduzir o overfitting.Vamos ilustrar isso com um caso básico de regressão linear:

In [1]:
import pandas as pd
from patsy import dmatrices 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

acs = pd.read_csv('../data/acs_ny.csv')
print(acs.columns)

# Cria matrizes de resposta e preditores a partir de uma fórmula
response, predictors = dmatrices('FamilyIncome ~ NumBedrooms + NumChildren + NumPeople + NumRooms + '\
                                 'NumUnits + NumVehicles + NumWorkers + OwnRent + YearBuilt + ElectricBill + '\
                                 'FoodStamp + HeatingFuel + Insurance + Language', data=acs)

# Agora divide os dados em um conjunto de teste e um conjunto de treinamento
x_train, x_test, y_train, y_test = train_test_split(predictors, response, random_state=0)

# E agora vamos fazer uma regressão linear com os dados de treinamento
lr = LinearRegression(normalize=True).fit(x_train, y_train)

model_coefs = pd.DataFrame(list(zip(predictors.design_info.column_names, lr.coef_[0])), columns=['variable', 'coef_lr'])
print(model_coefs)

Index(['Acres', 'FamilyIncome', 'FamilyType', 'NumBedrooms', 'NumChildren',
       'NumPeople', 'NumRooms', 'NumUnits', 'NumVehicles', 'NumWorkers',
       'OwnRent', 'YearBuilt', 'HouseCosts', 'ElectricBill', 'FoodStamp',
       'HeatingFuel', 'Insurance', 'Language'],
      dtype='object')
                       variable       coef_lr
0                     Intercept  3.522660e-11
1   NumUnits[T.Single attached]  3.135646e+04
2   NumUnits[T.Single detached]  2.418368e+04
3           OwnRent[T.Outright]  2.839186e+04
4             OwnRent[T.Rented]  7.229586e+03
5        YearBuilt[T.1940-1949]  1.292169e+04
6        YearBuilt[T.1950-1959]  2.057793e+04
7        YearBuilt[T.1960-1969]  1.764835e+04
8        YearBuilt[T.1970-1979]  1.756881e+04
9        YearBuilt[T.1980-1989]  2.552566e+04
10       YearBuilt[T.1990-1999]  2.983944e+04
11       YearBuilt[T.2000-2004]  3.012502e+04
12            YearBuilt[T.2005]  4.318648e+04
13            YearBuilt[T.2006]  3.242038e+04
14            Yea

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), LinearRegression())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)




E agora vamo checar qual é o score (R^2) do modelo:

In [2]:
print(lr.score(x_train, y_train))

print(lr.score(x_test, y_test))

0.2726140465638569
0.26976979568488124


O score é muito baixo, o que mostra um desempenho ruim do modelo.

Num cenário útil em que a regularização é necessária, a gente veria um valor de R2 alto no conjunto de treinamento e baixo no conjunto de teste, indicando o overfitting do modelo.

E aí com regularização a gente diminuiria o peso dos coeficientes de cada um deles. Há alguns tipos de regularização e nas seções seguintes o livro não explica o funcionamento delas, apenas usa.

Obs.: é incrível que o livro tenha usado esse exemplo tosco pra embasar o assunto de regularização.

## Seção 15.3 - Regressão LASSO (Least Absolute Shrinkage and Selection Operator)

Essa regressão também é conhecida como regressão com regularização L1 permite até mesmo zerar alguns coeficientes (descartando variáveis):

In [3]:
from sklearn.linear_model import Lasso

# O livro faz isso com os dados de teste, mas o certo é sempre usar os dados de treinamento
lasso = Lasso(normalize=True, random_state=0).fit(x_train, y_train)

coefs_lasso = pd.DataFrame(list(zip(predictors.design_info.column_names, lasso.coef_)), columns=['variable', 'coef_lr'])
print(coefs_lasso)

print(lasso.score(x_train, y_train))
print(lasso.score(x_test, y_test))

                       variable       coef_lr
0                     Intercept      0.000000
1   NumUnits[T.Single attached]  28192.431471
2   NumUnits[T.Single detached]  21101.168897
3           OwnRent[T.Outright]  27074.779758
4             OwnRent[T.Rented]   6274.559912
5        YearBuilt[T.1940-1949]  -7081.568645
6        YearBuilt[T.1950-1959]      0.000000
7        YearBuilt[T.1960-1969]  -2429.382254
8        YearBuilt[T.1970-1979]  -2798.031405
9        YearBuilt[T.1980-1989]   4410.551326
10       YearBuilt[T.1990-1999]   8550.687504
11       YearBuilt[T.2000-2004]   8696.421930
12            YearBuilt[T.2005]  20967.365359
13            YearBuilt[T.2006]  10594.529271
14            YearBuilt[T.2007]  13591.992592
15            YearBuilt[T.2008]  14617.015659
16            YearBuilt[T.2009]   7408.678779
17            YearBuilt[T.2010]  50201.873426
18     YearBuilt[T.Before 1939]  -8240.050914
19             FoodStamp[T.Yes] -27585.672088
20   HeatingFuel[T.Electricity]   

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Lasso())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * np.sqrt(n_samples). 


## Seção 15.4 - Regressão de ridge

Também é conhecida como regressão com regularização L2:

In [4]:
from sklearn.linear_model import Ridge

ridge = Ridge(normalize=True, random_state=0).fit(x_train, y_train)

coefs_ridge = pd.DataFrame(list(zip(predictors.design_info.column_names, ridge.coef_[0])), columns=['variable', 'coef_lr'])
print(coefs_ridge)

print(ridge.score(x_train, y_train))
print(ridge.score(x_test, y_test))

                       variable       coef_lr
0                     Intercept      0.000000
1   NumUnits[T.Single attached]   4571.129321
2   NumUnits[T.Single detached]   4514.956813
3           OwnRent[T.Outright]  10674.890982
4             OwnRent[T.Rented] -10180.631863
5        YearBuilt[T.1940-1949]  -3672.096659
6        YearBuilt[T.1950-1959]   1221.616020
7        YearBuilt[T.1960-1969]    -15.801437
8        YearBuilt[T.1970-1979]  -1868.746915
9        YearBuilt[T.1980-1989]   2664.343363
10       YearBuilt[T.1990-1999]   4079.639281
11       YearBuilt[T.2000-2004]   5615.285677
12            YearBuilt[T.2005]  12607.557029
13            YearBuilt[T.2006]   5783.401233
14            YearBuilt[T.2007]   8019.076178
15            YearBuilt[T.2008]   7964.342869
16            YearBuilt[T.2009]   3892.605415
17            YearBuilt[T.2010]  28469.966885
18     YearBuilt[T.Before 1939]  -4271.925584
19             FoodStamp[T.Yes] -21854.708263
20   HeatingFuel[T.Electricity]  -

If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing stage. To reproduce the previous behavior:

from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(with_mean=False), Ridge())

If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to each step of the pipeline as follows:

kwargs = {s[0] + '__sample_weight': sample_weight for s in model.steps}
model.fit(X, y, **kwargs)

Set parameter alpha to: original_alpha * n_samples. 


## Seção 15.5 - Rede elástica

Combina as duas técnicas de regularização anteriores (LASSO e Ridge):

In [5]:
from sklearn.linear_model import ElasticNet

en = ElasticNet(random_state=42).fit(x_train, y_train)


coefs_en = pd.DataFrame(list(zip(predictors.design_info.column_names, en.coef_)), columns=['variable', 'coef_en'])

model_coefs=pd.merge(model_coefs, coefs_en, on='variable')
print(model_coefs)

                       variable       coef_lr       coef_en
0                     Intercept  3.522660e-11      0.000000
1   NumUnits[T.Single attached]  3.135646e+04   1342.291706
2   NumUnits[T.Single detached]  2.418368e+04    168.728479
3           OwnRent[T.Outright]  2.839186e+04    445.533238
4             OwnRent[T.Rented]  7.229586e+03   -600.673747
5        YearBuilt[T.1940-1949]  1.292169e+04   -794.239494
6        YearBuilt[T.1950-1959]  2.057793e+04    513.289101
7        YearBuilt[T.1960-1969]  1.764835e+04   -275.576200
8        YearBuilt[T.1970-1979]  1.756881e+04   -574.365605
9        YearBuilt[T.1980-1989]  2.552566e+04    708.813588
10       YearBuilt[T.1990-1999]  2.983944e+04   1357.944466
11       YearBuilt[T.2000-2004]  3.012502e+04    798.576141
12            YearBuilt[T.2005]  4.318648e+04    445.271666
13            YearBuilt[T.2006]  3.242038e+04    202.040682
14            YearBuilt[T.2007]  3.562061e+04    222.170314
15            YearBuilt[T.2008]  3.71247