# Modelos de regresión en SKlearn
## 1. Objetivo

Familiarizarse con los conceptos de modelos de regresion

## Datos de Melb Houses
* Suburb: Suburb
* Address: Address
* Rooms: Number of rooms
* Price: Price in Australian dollars
* Method:
  * S - property sold;
  * SP - property sold prior;
  * PI - property passed in;
  * PN - sold prior not disclosed;
  * SN - sold not disclosed;
  * NB - no bid;
  * VB - vendor bid;
  * W - withdrawn prior to auction;
  * SA - sold after auction;
  * SS - sold after auction price not disclosed.
  * N/A - price or highest bid not available.

* Type:
  * br - bedroom(s);
  * h - house,cottage,villa, semi,terrace;
  * u - unit, duplex;
  * t - townhouse;
  * dev site - development site;
  * res - other residential.

* SellerG: Real Estate Agent

* Date: Date sold

* Distance: Distance from CBD in Kilometres
* Regionname: General Region (West, North West, North, North east …etc)
* Propertycount: Number of properties that exist in the suburb.
* Bedroom2 : Scraped # of Bedrooms (from different source)
* Bathroom: Number of Bathrooms
* Car: Number of carspots
* Landsize: Land Size in Metres
* BuildingArea: Building Size in Metres
* YearBuilt: Year the house was built
* CouncilArea: Governing council for the area
* Lattitude: Self explanitory
* Longtitude: Self explanitory


* Melb Data: https://www.kaggle.com/code/dansbecker/handling-missing-values/data?select=melb_data.csv
* Sklearn Cheat Sheet from Datacamp: https://media.datacamp.com/legacy/image/upload/v1676302389/Marketing/Blog/Scikit-Learn_Cheat_Sheet.pdf

## 2. Librerias de trabajo

In [1]:
# Instala libreria Pandas si no la tenemos
#pip install pandas seaborn scikit-learn -y

In [2]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import (
    LinearRegression,
    Lasso,
    Ridge,
    ElasticNet
)

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.preprocessing import (
    LabelEncoder,
    OrdinalEncoder,
    LabelBinarizer,
    OneHotEncoder
)

from sklearn.linear_model import (
    LinearRegression,
    Lasso,
    Ridge,
    ElasticNet
)

from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

## 3. Lectura de datos

Primero nos encargaremos de leer los datos, indicando a Python donde se encuentra la carpeta que contiene los datos y los nombres de los archivos relevantes para el análisis.

In [3]:
#  Indicamos la ruta a la carpeta de de tu computadora 
# donde se ubican los datos del E-commerce
# Ejemplo: "C:\Usuarios\[tu nombre]\Descargas"

DATA_PATH="../data"

Ahora procederemos a definir una variable que indique el nombre del archivo junto con su extensión (por ejemplo, `.csv`):

In [4]:
FILE_DATA_PATH = "melb_data.csv"

Echaremos mano de la utilidad `os.path.join` de Python que indicar rutas en tu computadora donde se ubican archivos, así Pandas encontrá los archivos de datos.


**Ejemplo**

A continuación mostraremos un ejemplo leyendo el archivo `melb_data.csv`.
csv`:

In [5]:
# Ejemplo
print(f"Ruta del archivo: {FILE_DATA_PATH}")
print(os.path.join(DATA_PATH, FILE_DATA_PATH))

Ruta del archivo: melb_data.csv
../data/melb_data.csv


In [6]:
# Leemos con pandas
df = pd.read_csv(
    os.path.join(DATA_PATH, FILE_DATA_PATH)
    )

In [7]:
df.sample(10)

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
3454,Keilor East,17/24 Craig St,3,u,580000,S,Rendina,10/12/16,12.8,3033,...,2,2.0,112,138.0,2013.0,Moonee Valley,-37.7434,144.881,Western Metropolitan,5629
8658,North Melbourne,5 Murphy St,3,h,1485000,S,Jellis,20/05/17,2.3,3051,...,1,1.0,198,95.0,1968.0,Melbourne,-37.8029,144.9459,Northern Metropolitan,6821
12981,Glen Iris,6 Bardolph St,2,h,708000,S,Jellis,19/08/17,7.3,3146,...,1,1.0,151,,,,-37.85156,145.07958,Southern Metropolitan,10412
190,Altona North,153 Mills St,3,h,1405000,S,Sweeney,04/03/17,11.1,3025,...,1,3.0,655,,,Hobsons Bay,-37.8338,144.8562,Western Metropolitan,5132
7202,Ormond,1/25 Murray Rd,3,u,820000,PI,Buxton,22/05/16,11.8,3204,...,2,1.0,238,127.0,1998.0,Glen Eira,-37.9062,145.027,Southern Metropolitan,3578
7605,Brunswick,25/5 Evans St,3,t,981000,SP,Nelson,13/05/17,5.2,3056,...,2,1.0,122,140.0,2012.0,Moreland,-37.7698,144.9661,Northern Metropolitan,11918
6948,Fawkner,1/26 Tucker St,3,t,492000,S,Ray,28/08/16,12.4,3060,...,1,1.0,208,107.0,2011.0,Moreland,-37.7092,144.9695,Northern Metropolitan,5070
157,Altona,1/60 David St,3,u,605000,S,Jas,10/12/16,13.8,3018,...,2,2.0,230,,,Hobsons Bay,-37.8646,144.827,Western Metropolitan,5301
11624,Caulfield North,6/374 Dandenong Rd,2,u,610000,PI,Greg,22/07/17,7.8,3161,...,1,1.0,816,88.0,2007.0,Glen Eira,-37.86068,145.01121,Southern Metropolitan,6923
630,Balwyn North,5 Lemon Rd,5,h,2070000,SP,Jellis,12/11/16,9.2,3104,...,2,1.0,1138,,,Boroondara,-37.7928,145.1006,Southern Metropolitan,7809


Revisemo las información de los datos:

In [8]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  str    
 1   Address        13580 non-null  str    
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  str    
 4   Price          13580 non-null  int64  
 5   Method         13580 non-null  str    
 6   SellerG        13580 non-null  str    
 7   Date           13580 non-null  str    
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  int64  
 10  Bedroom2       13580 non-null  int64  
 11  Bathroom       13580 non-null  int64  
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  int64  
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  str    
 17  Lattitude      13580 non-null  float64
 18  Longtitude     13

## 2. Modelos Básicos de Aprendizaje de Máquina

Esta sección aborda algunos de los principales modelos usados en problemas de aprendizaje supervisado.

### 2.1 Modelos de Regresión Lineal

Los modelos de regresión son modelos estadísticos que asumen la existencia de una relación lineal entre las características del problema y la variable a predecir, es decir, la variable objetivo se puede describir de la forma

$$y_i = \beta_0 + \beta_1 x_{i1}  + \beta_2 x_{i2} + \ldots + \beta_n x_{in}+ \epsilon_i$$

Donde los coeficientes $\beta_i$ son parámetros por determinarse y $\epsilon_i$ es una representación del ruido en la relación entre $y_i$ y $x_{i1}, \ldots, x_{in}$. En la jerga estadística los variables reciben el nombre de regresores.

Debe notarse que dado que los parámetros deben estimarse existen supuestos de tipo estadístico para que la relación lineal entre las variables y el objetivo se aproxime de buena manera, es decir, el residuno, nombre que recib el error entre $y$ y la $x_{i1}, \ldots, x_{1n}$, que generalmente se describe en termino del error cuadrático medio.

La teoría estadística señala algunos de los supuestos que deben cumplirse para que la regresión sea un modelo con buen funcionamiento (es decir controlando el sesgo y la varianza de la predicción):

* **Linealidad:** La relación entre la variable dependiente $Y$ y las variables independientes $X$ debe ser lineal.
* **Normalidad de los residuos:** Los residuos deben seguir una distribución normal, este supuesto se puede relajar a una distribución aproximadamente normal.
* **Homogeneidad de la varianza de los residuos:** Los residuos deben tener una varianza constante (homocedasticidad), es decir se espera que la variación de los residuos no se dispare en alguna región del espacio.
* **Independencia de los residuos:** Los residuos deben ser independientes los unos de los otros, es decir  no deben estar correlacionados entre sí.

Por otro lado la estimación de los coeficientes $\beta_i$ se realiza con técnicas de análisis numérico y que siguen principios estadísticos que, cumpliendose los supuestos anteriores, aseguran que el modelo obtenido tenga la combinación de coeficientes que asegura una varianza mínimo (Teorema de Gauss-Markov). Normalmente los paquetes de cómputo científico realizan los cálculos correspondientes, pero la verificación de dichos supuestos es responsabilidad de quien realiza el análisis.

Es dable mencionar que los modelos de regresión son herramientas poderosas de aprendizaje supervisado, pero tienen algunos puntos a considerar:

* Se ven afectados fuertemente por la presencia de valores atípicos y la presencia de escala distintas en los datos, por lo que es común pre-procesar los datos para eliminar valores ruidosos y asegurando que todos sus componentes tengan órdenes de magnitud comparables.
* Requieren que los variables dentro de los mismos sean linealmente independientes en el sentido del álgebra lineal, normalmente un buen análisis de correlación puede ayudar a desechar aquellas variables entre las que exista correlación lineal para mejorar su desempeño
* Son muy flexibles, típicamente puede transformarse para capturar relaciones no lineales.
* Generalmente la cantidad de variables que se puede incluir está limitada por la cantidad de puntos en el conjunto de entrenamiento.
* Si todas las variables tienen la misma escala, los coeficientes se pueden interpretar como el efecto que tiene una característica sobre la variable objetivo, considerando que todas las demás características se quedan fija (*ceteris paribus*).

#### 2.1.1 Modelos de Regresión Lineal y Regularización

Para mejorar evitar el sobreajuste de estos modelos, existen técnicas que penalizan la complejidad del modelo, es decir el valor que pueden tomar los coeficientes y que en general inducen mejores resultados para predecir.

Los modelos de regresión que incluyen penalizaciones sobre los coeficientes tienen denominaciones especiales:

* **Lasso:** Este modelo es similar al descrito arriba pero considerando que los coefientes se encuentre en una región definida por $\sum |\beta_i| \leq K_1$ donde $K_1$ es alguna constante. Esta métrica generalmente hace que algunos de los coeficientes $\beta_i$ sean cercanos a cero o bien se anulen.
* **Ridge:** Similar al modelo previo, pero considerando que los coefientes se encuentre en una región definida por $\sum |\beta_i|^2 \leq K_2$ donde $K_2$ es alguna constante.
* **Elastic Net:** En este caso, se pide que los coeficientes satisfagan la restricción $\sum |\beta_i| + \sum |\beta_i|^2 \leq K_3$ donde $K_3$ es cierta constante.


Una forma alternativa de describir lo anterior, es la formulación de Lagrange los problemas de minimización con restriciones, donde esencialmente se busca minimizar la expresión para encontrar el error más el valor de los parámetros

* **Lasso:** $$||y - \beta X||^2_2 + \alpha ||\beta||^2_2$$ donde el coeficiente $\alpha$ es un parámetros que controla la restricción del tamaño de $\beta$. 
* **Ridge:** $$||y - \beta X||^2_2 + \alpha ||\beta||^2_1$$ donde el coeficiente $\alpha$ es un parámetros que controla la restricción del tamaño de $\beta$. 
* **Elastic NEt:** $$||y - \beta X||^2_2 + \alpha_1 ||\beta||^2_1 + \alpha_2  ||\beta||^2_2$$ donde el coeficiente $\alpha$ es un parámetros que controla la restricción del tamaño de $\beta$. (La parametrización de Sklear es equivalente, ver https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet)

En Sklearn, dichos modelos se encuentran disponibles en las clases `LinearRegression`, `Lasso`, `Ridge` y `ElasticNet`. En Python, los valores de $\alpha_i$ y sus reparamtrizaciónes se utilizan para controlar que tantos se restringe a los coeficientes para controlar el sobreajuste. Al ser parámetros que no dependen específicamente de los datos, de denominar **hiper-parámetros**.

# 3. Modelos de regresión para predecir el precio de Melb Houses

In [9]:
# Define listas de columnas que van a emplearse en el modelado
num_features = [
    'Rooms', 
    'BuildingArea',
    'Landsize',
    'Distance',
    'Bathroom',
    'YearBuilt'
 ]

cat_cols = ['Regionname', 'Type']

# Lista que tiene todas los grupos de columnas
non_target_cols = num_features + cat_cols

target = ['Price']

In [10]:
df["BuildingArea"] = df["BuildingArea"].fillna(0)

In [11]:
df["YearBuilt"] = df["YearBuilt"].fillna(df["YearBuilt"].median())

In [12]:
df[non_target_cols].isna().sum()

Rooms           0
BuildingArea    0
Landsize        0
Distance        0
Bathroom        0
YearBuilt       0
Regionname      0
Type            0
dtype: int64

In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    df[non_target_cols],
    df[target],
    test_size=0.2,
)

In [14]:
# Pipeline para escalar con estandar z-score
numerical_pipe = Pipeline([
    ('standar_scaler', StandardScaler())
])

categorical_pipe = Pipeline([
    ('one_hot', OneHotEncoder(handle_unknown='ignore'))
])

In [15]:
# Combina ambos procesos en columnas espeficadas en listas
pre_processor = ColumnTransformer([
    ('numerical', numerical_pipe, num_features),
    ('categorical', categorical_pipe, cat_cols),
], remainder='passthrough')

## 4. Models

### 4.1 Linear Regresion

In [None]:
lr = LinearRegression()

In [17]:
X_train.info()

<class 'pandas.DataFrame'>
Index: 10864 entries, 11451 to 12272
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rooms         10864 non-null  int64  
 1   BuildingArea  10864 non-null  float64
 2   Landsize      10864 non-null  int64  
 3   Distance      10864 non-null  float64
 4   Bathroom      10864 non-null  int64  
 5   YearBuilt     10864 non-null  float64
 6   Regionname    10864 non-null  str    
 7   Type          10864 non-null  str    
dtypes: float64(3), int64(3), str(2)
memory usage: 763.9 KB


In [18]:
lr.fit(X_train[["Rooms","Landsize","Bathroom"]], y_train)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [19]:
lr.get_params()

{'copy_X': True,
 'fit_intercept': True,
 'n_jobs': None,
 'positive': False,
 'tol': 1e-06}

In [20]:
lr.coef_[0]

array([2.30526088e+05, 3.08483871e+00, 2.52347255e+05])

In [21]:
lr.intercept_

array([13649.89696125])

In [22]:
print("LinearRegression Coefficients: ")
pd.concat(
[
    pd.DataFrame(
    {
        "coef_names": "coef_" + lr.feature_names_in_,
        "coef_values": lr.coef_[0]
    }
    ),
    pd.DataFrame(
    {
        "coef_names": ["intercept"],
        "coef_values": lr.intercept_
    }
    )
]
)

LinearRegression Coefficients: 


Unnamed: 0,coef_names,coef_values
0,coef_Rooms,230526.087626
1,coef_Landsize,3.084839
2,coef_Bathroom,252347.254813
0,intercept,13649.896961


### 4.2 Ridge Regresion

In [23]:
ridge = Ridge(alpha=0.23)

In [24]:
ridge.fit(X_train[["Rooms","Landsize","Bathroom"]], y_train)

0,1,2
,"alpha  alpha: {float, ndarray of shape (n_targets,)}, default=1.0 Constant that multiplies the L2 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Ridge` object is not advised. Instead, you should use the :class:`LinearRegression` object. If an array is passed, penalties are assumed to be specific to the targets. Hence they must correspond in number.",0.23
,"fit_intercept  fit_intercept: bool, default=True Whether to fit the intercept for this model. If set to false, no intercept will be used in calculations (i.e. ``X`` and ``y`` are expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"max_iter  max_iter: int, default=None Maximum number of iterations for conjugate gradient solver. For 'sparse_cg' and 'lsqr' solvers, the default value is determined by scipy.sparse.linalg. For 'sag' solver, the default value is 1000. For 'lbfgs' solver, the default value is 15000.",
,"tol  tol: float, default=1e-4 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for each solver: - 'svd': `tol` has no impact. - 'cholesky': `tol` has no impact. - 'sparse_cg': norm of residuals smaller than `tol`. - 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr,  which control the norm of the residual vector in terms of the norms of  matrix and coefficients. - 'sag' and 'saga': relative change of coef smaller than `tol`. - 'lbfgs': maximum of the absolute (projected) gradient=max|residuals|  smaller than `tol`. .. versionchanged:: 1.2  Default value changed from 1e-3 to 1e-4 for consistency with other linear  models.",0.0001
,"solver  solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto' Solver to use in the computational routines: - 'auto' chooses the solver automatically based on the type of data. - 'svd' uses a Singular Value Decomposition of X to compute the Ridge  coefficients. It is the most stable solver, in particular more stable  for singular matrices than 'cholesky' at the cost of being slower. - 'cholesky' uses the standard :func:`scipy.linalg.solve` function to  obtain a closed-form solution. - 'sparse_cg' uses the conjugate gradient solver as found in  :func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is  more appropriate than 'cholesky' for large-scale data  (possibility to set `tol` and `max_iter`). - 'lsqr' uses the dedicated regularized least-squares routine  :func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative  procedure. - 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses  its improved, unbiased version named SAGA. Both methods also use an  iterative procedure, and are often faster than other solvers when  both n_samples and n_features are large. Note that 'sag' and  'saga' fast convergence is only guaranteed on features with  approximately the same scale. You can preprocess the data with a  scaler from :mod:`sklearn.preprocessing`. - 'lbfgs' uses L-BFGS-B algorithm implemented in  :func:`scipy.optimize.minimize`. It can be used only when `positive`  is True. All solvers except 'svd' support both dense and sparse data. However, only 'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when `fit_intercept` is True. .. versionadded:: 0.17  Stochastic Average Gradient descent solver. .. versionadded:: 0.19  SAGA solver.",'auto'
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. Only 'lbfgs' solver is supported in this case.",False
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag' or 'saga' to shuffle the data. See :term:`Glossary ` for details. .. versionadded:: 0.17  `random_state` to support Stochastic Average Gradient.",


In [25]:
ridge.coef_

array([2.30525212e+05, 3.08490527e+00, 2.52336733e+05])

In [26]:
ridge.intercept_[0]

np.float64(13668.560503840446)

In [27]:
print("Ridge Regression Coefficients: ")
pd.concat(
[
    pd.DataFrame(
    {
        "coef_names": "coef_" + ridge.feature_names_in_,
        "coef_values": ridge.coef_[0]
    }
    ),
    pd.DataFrame(
    {
        "coef_names": ["intercept"],
        "coef_values": ridge.intercept_
    }
    )
]
)

Ridge Regression Coefficients: 


Unnamed: 0,coef_names,coef_values
0,coef_Rooms,230525.211789
1,coef_Landsize,230525.211789
2,coef_Bathroom,230525.211789
0,intercept,13668.560504


### 4.3 Lasso Regresion

In [43]:
lasso = Lasso(alpha=0.00005)

In [44]:
lasso.fit(X_train[["Rooms","Landsize","Bathroom"]], y_train)

0,1,2
,"alpha  alpha: float, default=1.0 Constant that multiplies the L1 term, controlling regularization strength. `alpha` must be a non-negative float i.e. in `[0, inf)`. When `alpha = 0`, the objective is equivalent to ordinary least squares, solved by the :class:`LinearRegression` object. For numerical reasons, using `alpha = 0` with the `Lasso` object is not advised. Instead, you should use the :class:`LinearRegression` object.",5e-05
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"precompute  precompute: bool or array-like of shape (n_features, n_features), default=False Whether to use a precomputed Gram matrix to speed up calculations. The Gram matrix can also be passed as argument. For sparse input this option is always ``False`` to preserve sparsity.",False
,"copy_X  copy_X: bool, default=True If ``True``, X will be copied; else, it may be overwritten.",True
,"max_iter  max_iter: int, default=1000 The maximum number of iterations.",1000
,"tol  tol: float, default=1e-4 The tolerance for the optimization: if the updates are smaller or equal to ``tol``, the optimization code checks the dual gap for optimality and continues until it is smaller or equal to ``tol``, see Notes below.",0.0001
,"warm_start  warm_start: bool, default=False When set to ``True``, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution. See :term:`the Glossary `.",False
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive.",False
,"random_state  random_state: int, RandomState instance, default=None The seed of the pseudo random number generator that selects a random feature to update. Used when ``selection`` == 'random'. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.",
,"selection  selection: {'cyclic', 'random'}, default='cyclic' If set to 'random', a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to 'random') often leads to significantly faster convergence especially when tol is higher than 1e-4.",'cyclic'


In [45]:
lasso.coef_

array([2.30526088e+05, 3.08483871e+00, 2.52347255e+05])

In [46]:
lasso.intercept_[0]

np.float64(13649.89714757097)

In [48]:
print("Lasso Regression Coefficients: ")
pd.concat(
[
    pd.DataFrame(
    {
        "coef_names": "coef_" + lasso.feature_names_in_,
        "coef_values": lasso.coef_[0]
    }
    ),
    pd.DataFrame(
    {
        "coef_names": ["intercept"],
        "coef_values": lasso.intercept_
    }
    )
]
)

Lasso Regression Coefficients: 


Unnamed: 0,coef_names,coef_values
0,coef_Rooms,230526.087611
1,coef_Landsize,230526.087611
2,coef_Bathroom,230526.087611
0,intercept,13649.897148


In [49]:
models_regresion = {
    'regression': LinearRegression(),
    'lasso_01': Lasso(alpha=0.1),
    'lasso_001': Lasso(alpha=0.01),
    'lasso_0001': Lasso(alpha=0.001),
    'lasso_000001': Lasso(alpha=0.0001),
    'ridge_01': Ridge(alpha=0.1),
    'ridge_001': Ridge(alpha=0.01),
    'ridge_0001': Ridge(alpha=0.001),
    'ridge_000001': Ridge(alpha=0.0001),
    'elastic_05_03': ElasticNet(alpha=0.5, l1_ratio=0.3),
    'elastic_0001_01': ElasticNet(alpha=0.0001, l1_ratio=0.0001)
    }

In [50]:
models = []
models_train_errors = []
models_test_errors = []

for model_name in models_regresion.keys():

    print("Modelo:", model_name)
    model = Pipeline([
        ('transform', pre_processor),
        ('model', models_regresion[model_name])
    ])

    # Ajusta el modelo con los datos de prueba
    model.fit(X_train[non_target_cols],y_train)

    y_train_pred = model.predict(X_train[non_target_cols])
    y_test_pred = model.predict(X_test[non_target_cols])

    # error en conjunto de entrenamiento y prueba
    error_train = root_mean_squared_error(y_train, y_train_pred)
    error_test = root_mean_squared_error(y_test, y_test_pred)

    # errores
    print("Error RSME en train:", round(error_train,4) )
    print("Error RSME en test:", round(error_test,4) )

    print("----------------------------------------------")

    models.append(model_name)
    models_train_errors.append(error_train)
    models_test_errors.append(error_test)

    

Modelo: regression
Error RSME en train: 417580.1179
Error RSME en test: 378921.2817
----------------------------------------------
Modelo: lasso_01
Error RSME en train: 417580.1179
Error RSME en test: 378921.2876
----------------------------------------------
Modelo: lasso_001
Error RSME en train: 417580.1179
Error RSME en test: 378921.2823
----------------------------------------------
Modelo: lasso_0001
Error RSME en train: 417580.1179
Error RSME en test: 378921.2818
----------------------------------------------
Modelo: lasso_000001
Error RSME en train: 417580.1179
Error RSME en test: 378921.2817
----------------------------------------------
Modelo: ridge_01
Error RSME en train: 417580.1212
Error RSME en test: 378921.5409
----------------------------------------------
Modelo: ridge_001
Error RSME en train: 417580.1179
Error RSME en test: 378921.3074
----------------------------------------------
Modelo: ridge_0001
Error RSME en train: 417580.1179
Error RSME en test: 378921.2843
---

In [51]:
pd.DataFrame({
    "model": models,
    "rmse_train": models_train_errors,
    "rmse_test": models_test_errors,
}).sort_values(["rmse_test"])

Unnamed: 0,model,rmse_train,rmse_test
0,regression,417580.117914,378921.281741
4,lasso_000001,417580.117914,378921.281747
3,lasso_0001,417580.117914,378921.2818
8,ridge_000001,417580.117914,378921.281998
2,lasso_001,417580.117914,378921.282328
7,ridge_0001,417580.117914,378921.284308
1,lasso_01,417580.117928,378921.287627
6,ridge_001,417580.117947,378921.307428
5,ridge_01,417580.121172,378921.540856
10,elastic_0001_01,417580.506785,378924.116555


In [52]:
models_regresion["regression"].coef_

array([[ 179571.42792329,   14512.76555257,   14633.75922752,
        -235972.83944522,  156957.68529727,  -66363.71259451,
         -47031.27524426,  263267.55235155, -263423.36436662,
          85553.32319038,  161823.15892971,  213518.29246167,
        -329283.75697173,  -84423.9303507 ,  254675.06382366,
         -24098.221241  , -230576.84258267]])

In [53]:
models_regresion["lasso_001"].coef_

array([ 179571.40688229,   14512.78719712,   14633.75740076,
       -235972.68009611,  156957.70865895,  -66363.72182054,
         34009.3917758 ,  344305.11581886, -182382.43814581,
        166589.82412378,  242863.32406909,  294559.09994184,
       -248242.89855746,   -3379.79200663,  433447.68655046,
        154674.28043807,  -51804.12812081])