# Tecnica de "Regularización"

* La regularización consiste en disminuir la complejidad del modelo a través de una penalización aplicada a sus variables mas irrelevantes.
* con PCA se eliminaba las variables o creaban nuevas y en este caso se penaliza a las variables que no aportan lo suficiente al modelo.

In [26]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression, Lasso, Ridge

from sklearn.metrics import mean_squared_error

In [27]:
dataset = pd.read_csv('felicidad.csv')
dataset.head()

Unnamed: 0,country,rank,score,high,low,gdp,family,lifexp,freedom,generosity,corruption,dystopia
0,Norway,1,7.537,7.594445,7.479556,1.616463,1.533524,0.796667,0.635423,0.362012,0.315964,2.277027
1,Denmark,2,7.522,7.581728,7.462272,1.482383,1.551122,0.792566,0.626007,0.35528,0.40077,2.313707
2,Iceland,3,7.504,7.62203,7.38597,1.480633,1.610574,0.833552,0.627163,0.47554,0.153527,2.322715
3,Switzerland,4,7.494,7.561772,7.426227,1.56498,1.516912,0.858131,0.620071,0.290549,0.367007,2.276716
4,Finland,5,7.469,7.527542,7.410458,1.443572,1.540247,0.809158,0.617951,0.245483,0.382612,2.430182


In [28]:
dataset.describe()

Unnamed: 0,rank,score,high,low,gdp,family,lifexp,freedom,generosity,corruption,dystopia
count,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0,155.0
mean,78.0,5.354019,5.452326,5.255713,0.984718,1.188898,0.551341,0.408786,0.246883,0.12312,1.850238
std,44.888751,1.13123,1.118542,1.14503,0.420793,0.287263,0.237073,0.149997,0.13478,0.101661,0.500028
min,1.0,2.693,2.864884,2.521116,0.0,0.0,0.0,0.0,0.0,0.0,0.377914
25%,39.5,4.5055,4.608172,4.374955,0.663371,1.042635,0.369866,0.303677,0.154106,0.057271,1.591291
50%,78.0,5.279,5.370032,5.193152,1.064578,1.253918,0.606042,0.437454,0.231538,0.089848,1.83291
75%,116.5,6.1015,6.1946,6.006527,1.318027,1.414316,0.723008,0.516561,0.323762,0.153296,2.144654
max,155.0,7.537,7.62203,7.479556,1.870766,1.610574,0.949492,0.658249,0.838075,0.464308,3.117485


In [29]:
dataset.columns

Index(['country', 'rank', 'score', 'high', 'low', 'gdp', 'family', 'lifexp',
       'freedom', 'generosity', 'corruption', 'dystopia'],
      dtype='object')

In [30]:
df_features = dataset[['gdp', 'family', 'lifexp', 'freedom', 'corruption', 'generosity', 'dystopia']]

df_target = dataset[['score']]

In [31]:
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size = 0.25, random_state = 42)

In [32]:
model_linear = LinearRegression().fit(X_train, y_train)
y_pred_linear = model_linear.predict(X_test)

model_lasso = Lasso(alpha = 0.02).fit(X_train, y_train) # el alfa es que tanta penalización queremos para las variables
y_pred_lasso = model_lasso.predict(X_test)

model_ridge = Ridge(alpha = 1).fit(X_train, y_train) # lo dejaremos con alpha=1, que es el valor por defecto 
y_pred_ridge = model_ridge.predict(X_test)

* Vamos a evaluar nuestras perdidas con el " Error medio cuadratico", pero existen varias tecnicas para tener que tan ajustada está el modelo.


In [33]:
Linear_loss = mean_squared_error(y_test, y_pred_linear)
print("Linear loss", Linear_loss)

Lasso_loss = mean_squared_error(y_test, y_pred_lasso)
print("Lasso loss", Lasso_loss)

Ridge_loss = mean_squared_error(y_test, y_pred_ridge)
print("Ridge loss", Ridge_loss)

Linear loss 9.893337283082652e-08
Lasso loss 0.049605751139829075
Ridge loss 0.00565012449996281


In [34]:
print("Coef lasso", model_lasso.coef_)
print("==="*32)
print("Coef ridge",model_ridge.coef_)

Coef lasso [1.28921417 0.91969417 0.47686397 0.73297273 0.         0.14245522
 0.89965327]
Coef ridge [[1.07234856 0.97048582 0.85605399 0.87400159 0.68583271 0.73285696
  0.96206567]]


* Se puede observar el peso de cada variable en el modelo.
* Un ejemplo la variable "corruption" en el modelo de Lasso es "0", debido a que el modelo determinó que esa variable no es determinante para entrenar al modelo.
* En Ridge, ningun feature fue "0". ya que es la caracteristica del modelo ya que penaliza pero nunca elimina.