# 07 - Regularizações

Nessa aula, iremos tratar dos seguintes conteúdos:
- Função de Custo e Regularização;
- Ridge;
- Lasso;
- Elastic-Net.

###  

##  Regularizações

<br>

As __regularizações__ vão ser uma importante ferramenta para auxiliar no ajuste de modelos de Regressão Linear. Quando modela-se uma Regressão Linear Múltipla, o objetivo é calcular os coeficientes que determinam a equação abaixo:

<br>

$$\ Y_j=\beta_0 + \sum_{i=1}^{n} \beta_i X_{ij} = \beta_0 + \beta X $$

<br>

Para se determinar os valores de todos os parâmetros $\beta$, o processo de modelagem envolve achar os parâmetros que minimizam a chamada __função de custo__, função esta que avalia o custo (ou seja o erro empregado) ao estimar o valor de $Y$, que para o caso das regressões a função de custo é dada pela __soma residual dos quadrados__, conforme a seguir:

<br>

$$
\Theta = \sum_{i = 1}^{n}[y_i - (\beta_0 + \beta X)]^2
$$

<br>

Mas durante o processo iterativo para o cálculo dos parâmetros, um problema que pode surgir é o caso do _overfitting_, como discutido em tópicos anterior. Ao invés do modelo aprender a __generalizar os resultados__, ele apenas passa a __memorizar__ as respostas dos dados fornecidos no treinamento, prejudicando assim o real poder de predição da Regressão Linear e qualquer outro modelo de _Machine Learning_.

A forma utilizada para diminuir esse efeito nas regressões, seria justamente a __regularização__, onde de acordo com o tipo de regularização será adicionado a função de custo um termo conhecido como __penalização__ proporporcional aos coeficientes $\beta$. Dessa forma, ao minimizar a função de custo, também será minimizado os parâmetros $\beta$.

Nos tópicos a seguir, serão apresentados os principais métodos de regularização para as regressões, sendo eles o __Ridge__, __Lasso__ e __Elastic-Net__.

<br><br>

###  Ridge (L2)

<br>

O método Ridge ou penalização L2, consiste em adicionar um termo quadrático dos parâmetros na função de custo:

<br>

$$
\Theta_{Ridge} = \sum_{i = 1}^{n}[y_i - (\beta_0 + \beta X)]^2 + \alpha\sum_{j = 1}^{p}\beta^{2}_{j} 
$$

<br>

Esse tipo de regularização é mais interessante de se usar quando __todas as variáveis atributos dos dados são importantes__, mas esperasse que o modelo generalize mais. O parâmetro $\alpha$ é justamente o que define a complexidade do modelo, quanto maior o $\alpha$, mais simples o modelo, ou seja, menor a viriância e maiores chances de ocorrer um _underfitting_.

O processo de treinamento e geração de novas predições funciona de forma análoga ao que acontece para a função `LinearRegression`, no caso para implementar o [_Ridge_](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html) basta carregar a função específica para ele:

<br>

```python
# Carregando a função para o Ridge
from sklearn.linear_model import Ridge

# Instanciar o modelo
model = Ridge(alpha = 1.0) # Parâmetro de Ajuste do Ridge
```

<br><br>

###  Lasso (L1)

<br>

O método Lasso ou penalização L1, consiste em adicionar o módulo dos parâmetros na função de custo, ao invés do quadrado no Ridge:

$$
\Theta_{Lasso} = \sum_{i = 1}^{n}[y_i - (\beta_0 + \beta X)]^2 + \alpha\sum_{j = 1}^{p}|\beta_{j}|
$$

O Lasso tem uma aplicação adicional bem interessante pois, no processo interativo de minimizar a função de custo, alguns parâmetros $\beta$ serão __zerados__. Ou seja, o método pode ser utilizado como __uma seleção de atributos__, onde serão zerados os atributos menos relevantes para a modelagem. No caso do Lasso, se tivermos $\alpha = 0$ cai-se no caso clássico de regressão linear e para os casos $\alpha > 0$, quanto maior o valor de lambda, mais parâmetros serão zerados. 

De forma análoga ao que acontece no _Ridge_, no caso para implementar o [__Lasso__](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html?highlight=lasso#sklearn.linear_model.Lasso) basta carregar a função específica para ele:

<br>

```python
# Carregando a função para o Lasso
from sklearn.linear_model import Lasso

# Instanciar o modelo
model = Lasso(alpha = 1.0) # Parâmetro de Ajuste do Lasso
```

<br><br>

###  Elastic-Net (L1 + L2)

<br>

O __Elastic-Net__ é um caso particular bem interessante pois ele combina ambos os efeitos de penalização L1 e L2, conforme descrito pela fórmula a seguir:

<br>

$$
\Theta_{EN} = \sum_{i = 1}^{n}[y_i - (\beta_0 + \beta X)]^2 + \alpha_{1}\sum_{j = 1}^{p}|\beta_{j}| + \alpha_{2}\sum_{j = 1}^{p}\beta_{j}^{2}
$$

<br>

Ou seja, o _Elastic-Net_ é interessante pois combina o poder de penalização efetiva do _Ridge_ com as características de seleção de atributos do _Lasso_. Para implementar o [_Elastic-Net_](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html) basta carregar a sua função específica:

<br>

```python
# Carregando a função para o ElasticNet
from sklearn.linear_model import ElasticNet

# Instanciar o modelo
model = ElasticNet(alpha = 1.0) # Parâmetro de Ajuste do ElasticNet
```

<br><br>

## 

__Exemplo:__ Vamos retomar o exercício com o _dataset_ `Car_Prices.csv` e avaliar os dados com diferentes modelos agora:

In [1]:
# Import das Libs necessárias
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [3]:
#Carregando o CSV
cars = pd.read_csv('datasets/Car_Prices.csv', index_col = 0)

# Mostrar os primeiras linhas
cars.head()

Unnamed: 0_level_0,symboling,CarName,fueltype,aspiration,doornumber,carbody,wheelbase,carlength,carwidth,carheight,...,cylindernumber,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
car_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3,alfa-romero giulia,gas,std,two,convertible,88.6,168.8,64.1,48.8,...,four,130,3.47,2.68,9.0,111,5000,21,27,13495.0
2,3,alfa-romero stelvio,gas,std,two,convertible,88.6,168.8,64.1,48.8,...,four,130,3.47,2.68,9.0,111,5000,21,27,16500.0
3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,94.5,171.2,65.5,52.4,...,six,152,2.68,3.47,9.0,154,5000,19,26,16500.0
4,2,audi 100 ls,gas,std,four,sedan,99.8,176.6,66.2,54.3,...,four,109,3.19,3.4,10.0,102,5500,24,30,13950.0
5,2,audi 100ls,gas,std,four,sedan,99.4,176.6,66.4,54.3,...,five,136,3.19,3.4,8.0,115,5500,18,22,17450.0


In [4]:
# aplicar o get_dummies
cars_dummies = pd.get_dummies(cars,
                              prefix_sep = '_',
                              columns = ['fueltype', 
                                         'aspiration', 
                                         'doornumber', 
                                         'carbody', 
                                         'cylindernumber'],
                              drop_first = True)

In [5]:
# Olhando a transformação
cars_dummies.head()

Unnamed: 0_level_0,symboling,CarName,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,...,carbody_hardtop,carbody_hatchback,carbody_sedan,carbody_wagon,cylindernumber_five,cylindernumber_four,cylindernumber_six,cylindernumber_three,cylindernumber_twelve,cylindernumber_two
car_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,3,alfa-romero giulia,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0,0,0,0,0,1,0,0,0,0
2,3,alfa-romero stelvio,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0,0,0,0,0,1,0,0,0,0
3,1,alfa-romero Quadrifoglio,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,...,0,1,0,0,0,0,1,0,0,0
4,2,audi 100 ls,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,...,0,0,1,0,0,1,0,0,0,0
5,2,audi 100ls,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,...,0,0,1,0,1,0,0,0,0,0


In [6]:
# Separando em X e y
X = cars_dummies.drop(['CarName', 'price'], axis = 1)
y = cars_dummies['price']

In [7]:
# Separando em treino e teste
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.3,
                                                    random_state = 42)

In [8]:
# Instanciar a normalização
scaler = StandardScaler()

# Fit e Transform do Scaler
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

### Regressão Linear

In [9]:
# Instancia o modelo
linreg = LinearRegression()

In [10]:
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
linreg.fit(X_train_std, y_train)

In [11]:
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_lr = linreg.predict(X_test_std)

In [12]:
# feature Importance - LinReg
coefs = linreg.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary = {'Features': list_feature,
              'Scores': list_score}

df_features = pd.DataFrame(dictionary)
df_features = df_features.sort_values(by=['Scores'], ascending=False)
df_features.reset_index(inplace=True, drop=True)
df_features

Unnamed: 0,Features,Scores
0,enginesize,5315.351
1,compressionratio,2924.531
2,fueltype_gas,2700.265
3,curbweight,1243.329
4,highwaympg,1047.525
5,carwidth,981.8643
6,peakrpm,959.9665
7,aspiration_turbo,872.8895
8,wheelbase,761.8308
9,cylindernumber_two,696.099


### Ridge

In [13]:
# Instancia o modelo
ridge = Ridge()

In [14]:
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
ridge.fit(X_train_std, y_train)

In [15]:
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_ridge = ridge.predict(X_test_std)

In [16]:
# Feature Importance - Ridge
coefs2 = ridge.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs2):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary2 = {'Features': list_feature,
              'Scores': list_score}

df_features2 = pd.DataFrame(dictionary2)
df_features2 = df_features2.sort_values(by=['Scores'], ascending=False)
df_features2.reset_index(inplace=True, drop=True)
df_features2

Unnamed: 0,Features,Scores
0,enginesize,4358.903526
1,curbweight,1412.142609
2,compressionratio,1165.949671
3,carwidth,1077.449684
4,fueltype_gas,922.637678
5,peakrpm,914.678189
6,wheelbase,642.395811
7,highwaympg,601.577174
8,aspiration_turbo,589.250228
9,cylindernumber_two,482.123002


## 

### Lasso

In [17]:
# Instancia o modelo
lasso = Lasso(alpha = 5)

In [18]:
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
lasso.fit(X_train_std, y_train)

In [19]:
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_lasso = lasso.predict(X_test_std)

In [20]:
# Feature Importance - Lasso
coefs3 = lasso.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs3):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary3 = {'Features': list_feature,
              'Scores': list_score}

df_features3 = pd.DataFrame(dictionary3)
df_features3 = df_features3.sort_values(by=['Scores'], ascending=False)
df_features3.reset_index(inplace=True, drop=True)
df_features3

Unnamed: 0,Features,Scores
0,enginesize,5018.971189
1,compressionratio,1814.80548
2,fueltype_gas,1576.132772
3,curbweight,1234.433335
4,carwidth,1056.063431
5,peakrpm,971.719399
6,highwaympg,752.561728
7,aspiration_turbo,727.886137
8,cylindernumber_two,644.306544
9,wheelbase,634.338673


## 

### ElasticNet

In [21]:
# Instancia o modelo
EN = ElasticNet()

In [22]:
# Fit dos dados (ou seja, vamos passar os dados para o modelo aprender com eles)
EN.fit(X_train_std, y_train)

In [23]:
# Para os dados novos, vamos definir a predição para a base de teste
y_pred_EN = EN.predict(X_test_std)

In [24]:
# Feature Importance - Elastic-Net
coefs4 = EN.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs4):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary4 = {'Features': list_feature,
              'Scores': list_score}

df_features4 = pd.DataFrame(dictionary4)
df_features4 = df_features4.sort_values(by=['Scores'], ascending=False)
df_features4.reset_index(inplace=True, drop=True)
df_features4

Unnamed: 0,Features,Scores
0,enginesize,1403.34743
1,carwidth,1013.262644
2,horsepower,997.267325
3,curbweight,961.87536
4,carbody_hardtop,848.036148
5,wheelbase,449.036598
6,carlength,334.534228
7,boreratio,326.37469
8,peakrpm,220.400167
9,doornumber_two,209.115872


Comparando as métricas para avaliar os modelos:

In [25]:
# Comparando os R2s
print('R2 - Regressão Linear: ', np.round(r2_score(y_test, y_pred_lr), 4))
print('R2 - Ridge:            ', np.round(r2_score(y_test, y_pred_ridge), 4))
print('R2 - Lasso:            ', np.round(r2_score(y_test, y_pred_lasso), 4))
print('R2 - Elastic-Net:      ', np.round(r2_score(y_test, y_pred_EN), 4))

R2 - Regressão Linear:  0.7923
R2 - Ridge:             0.7912
R2 - Lasso:             0.7949
R2 - Elastic-Net:       0.7291


In [26]:
# Comparando o MSE
print('MSE - Regressão Linear: ', np.round(mean_squared_error(y_test, y_pred_lr), 4))
print('MSE - Ridge:            ', np.round(mean_squared_error(y_test, y_pred_ridge), 4))
print('MSE - Lasso:            ', np.round(mean_squared_error(y_test, y_pred_lasso), 4))
print('MSE - Elastic-Net:      ', np.round(mean_squared_error(y_test, y_pred_EN), 4))

MSE - Regressão Linear:  14390119.8668
MSE - Ridge:             14466383.1511
MSE - Lasso:             14210201.8536
MSE - Elastic-Net:       18770205.5038


In [27]:
# comparando o MAE
print('MAE - Regressão Linear: ', np.round(mean_absolute_error(y_test, y_pred_lr), 4))
print('MAE - Ridge:            ', np.round(mean_absolute_error(y_test, y_pred_ridge), 4))
print('MAE - Lasso:            ', np.round(mean_absolute_error(y_test, y_pred_lasso), 4))
print('MAE - Elastic-Net:      ', np.round(mean_absolute_error(y_test, y_pred_EN), 4))

MAE - Regressão Linear:  2632.2202
MAE - Ridge:             2638.6542
MAE - Lasso:             2632.6767
MAE - Elastic-Net:       3014.7481


## 

## Exercícios

__1)__ Reavaliar o conjunto de dados para `insurance.csv` e fazer o comparativo entre os modelos de regularização com a Regressão Linear.

In [2]:
#Carregando o CSV
insurance = pd.read_csv('datasets/insurance.csv', index_col = 0)

# Mostrar os primeiras linhas
insurance.head()

Unnamed: 0_level_0,sex,bmi,children,smoker,region,charges
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
19,female,27.9,0,yes,southwest,16884.924
18,male,33.77,1,no,southeast,1725.5523
28,male,33.0,3,no,southeast,4449.462
33,male,22.705,0,no,northwest,21984.47061
32,male,28.88,0,no,northwest,3866.8552


In [3]:
insurance_final = insurance.copy()

insurance_final['sex'].replace('female', 0, inplace=True)
insurance_final['sex'].replace('male', 1, inplace=True)

insurance_final['smoker'].replace('no', 0, inplace=True)
insurance_final['smoker'].replace('yes', 1, inplace=True)

insurance_final = pd.get_dummies(insurance_final, columns=['region'])

insurance_final.head()

Unnamed: 0_level_0,sex,bmi,children,smoker,charges,region_northeast,region_northwest,region_southeast,region_southwest
age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
19,0,27.9,0,1,16884.924,0,0,0,1
18,1,33.77,1,0,1725.5523,0,0,1,0
28,1,33.0,3,0,4449.462,0,0,1,0
33,1,22.705,0,0,21984.47061,0,1,0,0
32,1,28.88,0,0,3866.8552,0,1,0,0


In [8]:
X = insurance_final.drop(['charges'], axis = 1)
y = insurance_final['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

scaler = StandardScaler()

X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

(936, 8) (402, 8) (936,) (402,)


### Regressão Linear

In [12]:
linreg = LinearRegression()

linreg.fit(X_train_std, y_train)

y_pred_lr = linreg.predict(X_test_std)

coefs = linreg.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary = {'Features': list_feature,
              'Scores': list_score}

df_features = pd.DataFrame(dictionary)
df_features = df_features.sort_values(by=['Scores'], ascending=False)
df_features.reset_index(inplace=True, drop=True)
df_features

Unnamed: 0,Features,Scores
0,smoker,9457.081
1,bmi,2628.342
2,children,724.8849
3,sex,0.7035956
4,region_southwest,-1.882187e+17
5,region_northwest,-1.902237e+17
6,region_northeast,-1.934809e+17
7,region_southeast,-1.942608e+17


### Rige

In [13]:
ridge = Ridge()

ridge.fit(X_train_std, y_train)

y_pred_ridge = ridge.predict(X_test_std)

coefs2 = ridge.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs2):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary2 = {'Features': list_feature,
              'Scores': list_score}

df_features2 = pd.DataFrame(dictionary2)
df_features2 = df_features2.sort_values(by=['Scores'], ascending=False)
df_features2.reset_index(inplace=True, drop=True)
df_features2

Unnamed: 0,Features,Scores
0,smoker,9420.498561
1,bmi,2524.707476
2,children,744.557499
3,region_northeast,283.37588
4,region_northwest,118.655123
5,sex,-1.110443
6,region_southwest,-110.300944
7,region_southeast,-291.557233


### Lasso

In [14]:
lasso = Lasso(alpha = 5)

lasso.fit(X_train_std, y_train)

y_pred_lasso = lasso.predict(X_test_std)

coefs3 = lasso.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs3):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary3 = {'Features': list_feature,
              'Scores': list_score}

df_features3 = pd.DataFrame(dictionary3)
df_features3 = df_features3.sort_values(by=['Scores'], ascending=False)
df_features3.reset_index(inplace=True, drop=True)
df_features3

Unnamed: 0,Features,Scores
0,smoker,9425.388484
1,bmi,2520.277457
2,children,740.487511
3,region_northeast,389.397211
4,region_northwest,222.775889
5,sex,-0.0
6,region_southwest,-0.0
7,region_southeast,-177.200936


### ElasticNet

In [16]:
EN = ElasticNet()

EN.fit(X_train_std, y_train)

y_pred_EN = EN.predict(X_test_std)

coefs4 = EN.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs4):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary4 = {'Features': list_feature,
              'Scores': list_score}

df_features4 = pd.DataFrame(dictionary4)
df_features4 = df_features4.sort_values(by=['Scores'], ascending=False)
df_features4.reset_index(inplace=True, drop=True)
df_features4

Unnamed: 0,Features,Scores
0,smoker,6274.641594
1,bmi,1603.315179
2,children,549.085216
3,region_northeast,147.949491
4,sex,115.469825
5,region_southeast,-24.198958
6,region_northwest,-36.171328
7,region_southwest,-88.48408


Comparando as métricas para avaliar os modelos:

In [18]:
# Comparando os R2s
print('Comparando os R2s')
print('R2 - Regressão Linear: ', np.round(r2_score(y_test, y_pred_lr), 4))
print('R2 - Ridge:            ', np.round(r2_score(y_test, y_pred_ridge), 4))
print('R2 - Lasso:            ', np.round(r2_score(y_test, y_pred_lasso), 4))
print('R2 - Elastic-Net:      ', np.round(r2_score(y_test, y_pred_EN), 4))

# Comparando o MSE
print('----------------------------------------------------')
print('Comparando o MSE')
print('MSE - Regressão Linear: ', np.round(mean_squared_error(y_test, y_pred_lr), 4))
print('MSE - Ridge:            ', np.round(mean_squared_error(y_test, y_pred_ridge), 4))
print('MSE - Lasso:            ', np.round(mean_squared_error(y_test, y_pred_lasso), 4))
print('MSE - Elastic-Net:      ', np.round(mean_squared_error(y_test, y_pred_EN), 4))

# comparando o MAE
print('----------------------------------------------------')
print('Comparando o MAE')
print('MAE - Regressão Linear: ', np.round(mean_absolute_error(y_test, y_pred_lr), 4))
print('MAE - Ridge:            ', np.round(mean_absolute_error(y_test, y_pred_ridge), 4))
print('MAE - Lasso:            ', np.round(mean_absolute_error(y_test, y_pred_lasso), 4))
print('MAE - Elastic-Net:      ', np.round(mean_absolute_error(y_test, y_pred_EN), 4))

Comparando os R2s
R2 - Regressão Linear:  0.6897
R2 - Ridge:             0.6903
R2 - Lasso:             0.6904
R2 - Elastic-Net:       0.5997
----------------------------------------------------
Comparando o MSE
MSE - Regressão Linear:  45492554.874
MSE - Ridge:             45404153.4826
MSE - Lasso:             45400503.5976
MSE - Elastic-Net:       58693493.3441
----------------------------------------------------
Comparando o MAE
MAE - Regressão Linear:  5282.3897
MAE - Ridge:             5277.8712
MAE - Lasso:             5277.4146
MAE - Elastic-Net:       5980.8752


__2)__ Reavaliar o conjunto de dados para `Admission_Predict.csv` e fazer o comparativo entre os modelos de regularização com a Regressão Linear.

In [21]:
#Carregando o CSV
admission_predict = pd.read_csv('datasets/Admission_Predict.csv', index_col = 0)

# Mostrar os primeiras linhas
admission_predict.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [22]:
X = admission_predict.drop(['Chance of Admit'], axis = 1)
y = admission_predict['Chance of Admit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

scaler = StandardScaler()

X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

(280, 7) (120, 7) (280,) (120,)


### Regressão Linear

In [23]:
linreg = LinearRegression()

linreg.fit(X_train_std, y_train)

y_pred_lr = linreg.predict(X_test_std)

coefs = linreg.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary = {'Features': list_feature,
              'Scores': list_score}

df_features = pd.DataFrame(dictionary)
df_features = df_features.sort_values(by=['Scores'], ascending=False)
df_features.reset_index(inplace=True, drop=True)
df_features

Unnamed: 0,Features,Scores
0,CGPA,0.069904
1,GRE Score,0.02087
2,TOEFL Score,0.019058
3,LOR,0.012549
4,Research,0.009151
5,University Rating,0.005591
6,SOP,0.001019


### Rige

In [24]:
ridge = Ridge()

ridge.fit(X_train_std, y_train)

y_pred_ridge = ridge.predict(X_test_std)

coefs2 = ridge.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs2):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary2 = {'Features': list_feature,
              'Scores': list_score}

df_features2 = pd.DataFrame(dictionary2)
df_features2 = df_features2.sort_values(by=['Scores'], ascending=False)
df_features2.reset_index(inplace=True, drop=True)
df_features2

Unnamed: 0,Features,Scores
0,CGPA,0.068938
1,GRE Score,0.021166
2,TOEFL Score,0.019223
3,LOR,0.012649
4,Research,0.009149
5,University Rating,0.005769
6,SOP,0.00121


### Lasso

In [25]:
lasso = Lasso(alpha = 5)

lasso.fit(X_train_std, y_train)

y_pred_lasso = lasso.predict(X_test_std)

coefs3 = lasso.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs3):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary3 = {'Features': list_feature,
              'Scores': list_score}

df_features3 = pd.DataFrame(dictionary3)
df_features3 = df_features3.sort_values(by=['Scores'], ascending=False)
df_features3.reset_index(inplace=True, drop=True)
df_features3

Unnamed: 0,Features,Scores
0,GRE Score,0.0
1,TOEFL Score,0.0
2,University Rating,0.0
3,SOP,0.0
4,LOR,0.0
5,CGPA,0.0
6,Research,0.0


### ElasticNet

In [26]:
EN = ElasticNet()

EN.fit(X_train_std, y_train)

y_pred_EN = EN.predict(X_test_std)

coefs4 = EN.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs4):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary4 = {'Features': list_feature,
              'Scores': list_score}

df_features4 = pd.DataFrame(dictionary4)
df_features4 = df_features4.sort_values(by=['Scores'], ascending=False)
df_features4.reset_index(inplace=True, drop=True)
df_features4

Unnamed: 0,Features,Scores
0,GRE Score,0.0
1,TOEFL Score,0.0
2,University Rating,0.0
3,SOP,0.0
4,LOR,0.0
5,CGPA,0.0
6,Research,0.0


Comparando as métricas para avaliar os modelos:

In [27]:
# Comparando os R2s
print('Comparando os R2s')
print('R2 - Regressão Linear: ', np.round(r2_score(y_test, y_pred_lr), 4))
print('R2 - Ridge:            ', np.round(r2_score(y_test, y_pred_ridge), 4))
print('R2 - Lasso:            ', np.round(r2_score(y_test, y_pred_lasso), 4))
print('R2 - Elastic-Net:      ', np.round(r2_score(y_test, y_pred_EN), 4))

# Comparando o MSE
print('----------------------------------------------------')
print('Comparando o MSE')
print('MSE - Regressão Linear: ', np.round(mean_squared_error(y_test, y_pred_lr), 4))
print('MSE - Ridge:            ', np.round(mean_squared_error(y_test, y_pred_ridge), 4))
print('MSE - Lasso:            ', np.round(mean_squared_error(y_test, y_pred_lasso), 4))
print('MSE - Elastic-Net:      ', np.round(mean_squared_error(y_test, y_pred_EN), 4))

# comparando o MAE
print('----------------------------------------------------')
print('Comparando o MAE')
print('MAE - Regressão Linear: ', np.round(mean_absolute_error(y_test, y_pred_lr), 4))
print('MAE - Ridge:            ', np.round(mean_absolute_error(y_test, y_pred_ridge), 4))
print('MAE - Lasso:            ', np.round(mean_absolute_error(y_test, y_pred_lasso), 4))
print('MAE - Elastic-Net:      ', np.round(mean_absolute_error(y_test, y_pred_EN), 4))

Comparando os R2s
R2 - Regressão Linear:  0.7956
R2 - Ridge:             0.7955
R2 - Lasso:             -0.0176
R2 - Elastic-Net:       -0.0176
----------------------------------------------------
Comparando o MSE
MSE - Regressão Linear:  0.0047
MSE - Ridge:             0.0047
MSE - Lasso:             0.0232
MSE - Elastic-Net:       0.0232
----------------------------------------------------
Comparando o MAE
MAE - Regressão Linear:  0.0495
MAE - Ridge:             0.0495
MAE - Lasso:             0.1234
MAE - Elastic-Net:       0.1234


__3)__ Reavaliar o conjunto de dados para `usa_housing.csv` e fazer o comparativo entre os modelos de regularização com a Regressão Linear.

In [31]:
#Carregando o CSV
usa_housing = pd.read_csv('datasets/usa_housing.csv', index_col = 0)

# Mostrar os primeiras linhas
usa_housing.head()

Unnamed: 0_level_0,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
Avg. Area Income,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [33]:
X = usa_housing.drop(['Address', 'Price'], axis = 1)
y = usa_housing['Price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

scaler = StandardScaler()

X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

(3500, 4) (1500, 4) (3500,) (1500,)


### Regressão Linear

In [34]:
linreg = LinearRegression()

linreg.fit(X_train_std, y_train)

y_pred_lr = linreg.predict(X_test_std)

coefs = linreg.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary = {'Features': list_feature,
              'Scores': list_score}

df_features = pd.DataFrame(dictionary)
df_features = df_features.sort_values(by=['Scores'], ascending=False)
df_features.reset_index(inplace=True, drop=True)
df_features

Unnamed: 0,Features,Scores
0,Avg. Area House Age,165169.479807
1,Area Population,146204.661275
2,Avg. Area Number of Rooms,116065.26473
3,Avg. Area Number of Bedrooms,12170.177959


### Rige

In [35]:
ridge = Ridge()

ridge.fit(X_train_std, y_train)

y_pred_ridge = ridge.predict(X_test_std)

coefs2 = ridge.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs2):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary2 = {'Features': list_feature,
              'Scores': list_score}

df_features2 = pd.DataFrame(dictionary2)
df_features2 = df_features2.sort_values(by=['Scores'], ascending=False)
df_features2.reset_index(inplace=True, drop=True)
df_features2

Unnamed: 0,Features,Scores
0,Avg. Area House Age,165121.780184
1,Area Population,146163.064874
2,Avg. Area Number of Rooms,116024.383736
3,Avg. Area Number of Bedrooms,12185.539384


### Lasso

In [36]:
lasso = Lasso(alpha = 5)

lasso.fit(X_train_std, y_train)

y_pred_lasso = lasso.predict(X_test_std)

coefs3 = lasso.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs3):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary3 = {'Features': list_feature,
              'Scores': list_score}

df_features3 = pd.DataFrame(dictionary3)
df_features3 = df_features3.sort_values(by=['Scores'], ascending=False)
df_features3.reset_index(inplace=True, drop=True)
df_features3

Unnamed: 0,Features,Scores
0,Avg. Area House Age,165164.742496
1,Area Population,146199.602985
2,Avg. Area Number of Rooms,116066.201468
3,Avg. Area Number of Bedrooms,12164.709647


### ElasticNet

In [37]:
EN = ElasticNet()

EN.fit(X_train_std, y_train)

y_pred_EN = EN.predict(X_test_std)

coefs4 = EN.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs4):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary4 = {'Features': list_feature,
              'Scores': list_score}

df_features4 = pd.DataFrame(dictionary4)
df_features4 = df_features4.sort_values(by=['Scores'], ascending=False)
df_features4.reset_index(inplace=True, drop=True)
df_features4

Unnamed: 0,Features,Scores
0,Avg. Area House Age,109784.051095
1,Area Population,97531.873041
2,Avg. Area Number of Rooms,74203.872487
3,Avg. Area Number of Bedrooms,20954.650659


Comparando as métricas para avaliar os modelos:

In [38]:
# Comparando os R2s
print('Comparando os R2s')
print('R2 - Regressão Linear: ', np.round(r2_score(y_test, y_pred_lr), 4))
print('R2 - Ridge:            ', np.round(r2_score(y_test, y_pred_ridge), 4))
print('R2 - Lasso:            ', np.round(r2_score(y_test, y_pred_lasso), 4))
print('R2 - Elastic-Net:      ', np.round(r2_score(y_test, y_pred_EN), 4))

# Comparando o MSE
print('----------------------------------------------------')
print('Comparando o MSE')
print('MSE - Regressão Linear: ', np.round(mean_squared_error(y_test, y_pred_lr), 4))
print('MSE - Ridge:            ', np.round(mean_squared_error(y_test, y_pred_ridge), 4))
print('MSE - Lasso:            ', np.round(mean_squared_error(y_test, y_pred_lasso), 4))
print('MSE - Elastic-Net:      ', np.round(mean_squared_error(y_test, y_pred_EN), 4))

# comparando o MAE
print('----------------------------------------------------')
print('Comparando o MAE')
print('MAE - Regressão Linear: ', np.round(mean_absolute_error(y_test, y_pred_lr), 4))
print('MAE - Ridge:            ', np.round(mean_absolute_error(y_test, y_pred_ridge), 4))
print('MAE - Lasso:            ', np.round(mean_absolute_error(y_test, y_pred_lasso), 4))
print('MAE - Elastic-Net:      ', np.round(mean_absolute_error(y_test, y_pred_EN), 4))

Comparando os R2s
R2 - Regressão Linear:  0.4909
R2 - Ridge:             0.4909
R2 - Lasso:             0.4909
R2 - Elastic-Net:       0.439
----------------------------------------------------
Comparando o MSE
MSE - Regressão Linear:  60074362397.2338
MSE - Ridge:             60074081397.7371
MSE - Lasso:             60074244223.8931
MSE - Elastic-Net:       66206437518.9981
----------------------------------------------------
Comparando o MAE
MAE - Regressão Linear:  194585.1324
MAE - Ridge:             194584.6976
MAE - Lasso:             194584.8595
MAE - Elastic-Net:       203914.7427


__4)__ Utilizando o dataset `penguins` e a partir dos modelos de Regressão e regularização, desenvolva uma regressão para determinar o valor da massa corporal dos pinguins (`body_mass_g`)

In [57]:
penguins = sns.load_dataset('penguins')
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [58]:
penguins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [59]:
penguins = pd.get_dummies(penguins, columns=['species','island','sex'])
penguins.replace(np.nan, 0, inplace=True)

penguins.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,species_Adelie,species_Chinstrap,species_Gentoo,island_Biscoe,island_Dream,island_Torgersen,sex_Female,sex_Male
0,39.1,18.7,181.0,3750.0,1,0,0,0,0,1,0,1
1,39.5,17.4,186.0,3800.0,1,0,0,0,0,1,1,0
2,40.3,18.0,195.0,3250.0,1,0,0,0,0,1,1,0
3,0.0,0.0,0.0,0.0,1,0,0,0,0,1,0,0
4,36.7,19.3,193.0,3450.0,1,0,0,0,0,1,1,0


In [60]:
X = penguins.drop(['body_mass_g'], axis = 1)
y = penguins['body_mass_g']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

scaler = StandardScaler()

X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

(240, 11) (104, 11) (240,) (104,)


### Regressão Linear

In [61]:
linreg = LinearRegression()

linreg.fit(X_train_std, y_train)

y_pred_lr = linreg.predict(X_test_std)

coefs = linreg.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary = {'Features': list_feature,
              'Scores': list_score}

df_features = pd.DataFrame(dictionary)
df_features = df_features.sort_values(by=['Scores'], ascending=False)
df_features.reset_index(inplace=True, drop=True)
df_features

Unnamed: 0,Features,Scores
0,species_Gentoo,347.171849
1,flipper_length_mm,256.059526
2,sex_Male,253.087376
3,bill_length_mm,141.049713
4,bill_depth_mm,128.279608
5,sex_Female,56.845807
6,island_Dream,2.166776
7,island_Torgersen,-0.725164
8,island_Biscoe,-1.595297
9,species_Adelie,-150.731645


### Rige

In [62]:
ridge = Ridge()

ridge.fit(X_train_std, y_train)

y_pred_ridge = ridge.predict(X_test_std)

coefs2 = ridge.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs2):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary2 = {'Features': list_feature,
              'Scores': list_score}

df_features2 = pd.DataFrame(dictionary2)
df_features2 = df_features2.sort_values(by=['Scores'], ascending=False)
df_features2.reset_index(inplace=True, drop=True)
df_features2

Unnamed: 0,Features,Scores
0,species_Gentoo,336.10188
1,flipper_length_mm,262.280404
2,sex_Male,243.140175
3,bill_length_mm,142.596798
4,bill_depth_mm,117.830053
5,sex_Female,45.557276
6,island_Biscoe,1.139919
7,island_Dream,0.135914
8,island_Torgersen,-1.869347
9,species_Adelie,-144.355703


### Lasso

In [63]:
lasso = Lasso(alpha = 5)

lasso.fit(X_train_std, y_train)

y_pred_lasso = lasso.predict(X_test_std)

coefs3 = lasso.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs3):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary3 = {'Features': list_feature,
              'Scores': list_score}

df_features3 = pd.DataFrame(dictionary3)
df_features3 = df_features3.sort_values(by=['Scores'], ascending=False)
df_features3.reset_index(inplace=True, drop=True)
df_features3

Unnamed: 0,Features,Scores
0,species_Gentoo,429.06412
1,flipper_length_mm,294.249529
2,sex_Male,206.39499
3,bill_length_mm,132.412901
4,bill_depth_mm,77.107327
5,island_Biscoe,3.988554
6,species_Adelie,-0.0
7,island_Dream,-0.0
8,island_Torgersen,-0.0
9,sex_Female,0.0


### ElasticNet

In [64]:
EN = ElasticNet()

EN.fit(X_train_std, y_train)

y_pred_EN = EN.predict(X_test_std)

coefs4 = EN.coef_

list_columns = X_train.columns
list_feature = []
list_score = []

for i, v in enumerate(coefs4):
    list_feature.append(list_columns[i])
    list_score.append(v)

dictionary4 = {'Features': list_feature,
              'Scores': list_score}

df_features4 = pd.DataFrame(dictionary4)
df_features4 = df_features4.sort_values(by=['Scores'], ascending=False)
df_features4.reset_index(inplace=True, drop=True)
df_features4

Unnamed: 0,Features,Scores
0,flipper_length_mm,241.189422
1,species_Gentoo,170.267349
2,bill_length_mm,147.243155
3,sex_Male,117.179362
4,island_Biscoe,67.47205
5,bill_depth_mm,-0.0
6,island_Torgersen,-19.774188
7,island_Dream,-55.402517
8,sex_Female,-71.392983
9,species_Adelie,-79.042746


Comparando as métricas para avaliar os modelos:

In [65]:
# Comparando os R2s
print('Comparando os R2s')
print('R2 - Regressão Linear: ', np.round(r2_score(y_test, y_pred_lr), 4))
print('R2 - Ridge:            ', np.round(r2_score(y_test, y_pred_ridge), 4))
print('R2 - Lasso:            ', np.round(r2_score(y_test, y_pred_lasso), 4))
print('R2 - Elastic-Net:      ', np.round(r2_score(y_test, y_pred_EN), 4))

# Comparando o MSE
print('----------------------------------------------------')
print('Comparando o MSE')
print('MSE - Regressão Linear: ', np.round(mean_squared_error(y_test, y_pred_lr), 4))
print('MSE - Ridge:            ', np.round(mean_squared_error(y_test, y_pred_ridge), 4))
print('MSE - Lasso:            ', np.round(mean_squared_error(y_test, y_pred_lasso), 4))
print('MSE - Elastic-Net:      ', np.round(mean_squared_error(y_test, y_pred_EN), 4))

# comparando o MAE
print('----------------------------------------------------')
print('Comparando o MAE')
print('MAE - Regressão Linear: ', np.round(mean_absolute_error(y_test, y_pred_lr), 4))
print('MAE - Ridge:            ', np.round(mean_absolute_error(y_test, y_pred_ridge), 4))
print('MAE - Lasso:            ', np.round(mean_absolute_error(y_test, y_pred_lasso), 4))
print('MAE - Elastic-Net:      ', np.round(mean_absolute_error(y_test, y_pred_EN), 4))

Comparando os R2s
R2 - Regressão Linear:  0.8802
R2 - Ridge:             0.8812
R2 - Lasso:             0.8852
R2 - Elastic-Net:       0.8712
----------------------------------------------------
Comparando o MSE
MSE - Regressão Linear:  95519.1959
MSE - Ridge:             94653.9768
MSE - Lasso:             91516.4602
MSE - Elastic-Net:       102688.7134
----------------------------------------------------
Comparando o MAE
MAE - Regressão Linear:  232.4242
MAE - Ridge:             231.7732
MAE - Lasso:             229.8542
MAE - Elastic-Net:       251.6719


## 