<a href="https://colab.research.google.com/github/fcochaux/MINE-4101_Quiz_3/blob/main/MINE_4101_Quiz_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quiz 3 de Ciencia de Datos Aplicada

**Estudiante:** Francisco José Chaux Guzmán <br/>
**Código:** 202210155 <br/>
**Programa:** Maestría en Ingeniería de la Información

Enlace al repositorio de este taller:

In [1]:
!git clone "https://github.com/fcochaux/MINE-4101_Quiz_3.git"

Cloning into 'MINE-4101_Quiz_3'...
remote: Enumerating objects: 14, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 14 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (14/14), done.


Carga de librerías:

In [2]:
# procesamiento

import numpy as np
import pandas as pd

# visualización

import matplotlib.pyplot as plt
import seaborn as sns

# aprendizaje

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error

## 1) Preparación del conjunto de datos

Lectura de la base de datos:

In [3]:
# lectura del conjunto de datos
df = pd.read_csv('/content/MINE-4101_Quiz_3/datos/insurance.csv')
# convierte nombres a minúsculas
df.columns = df.columns.str.lower()
# revisa 5 casos al azar
df.sample(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
397,21,male,31.02,0,no,southeast,16586.49771
193,56,female,26.6,1,no,northwest,12044.342
1110,54,female,32.3,1,no,northeast,11512.405
661,57,female,23.98,1,no,southeast,22192.43711
1257,54,female,27.645,1,no,northwest,11305.93455


Revisión de características de la base:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Ningún caso de valor faltante. Tampoco hay variables de identificación, por lo que puedo revisar si hay duplicados:

In [5]:
df[df.duplicated()]

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
581,19,male,30.59,0,no,northwest,1639.5631


Como se encontró un duplicado, se elimina el caso repetido:

In [6]:
df = df.drop_duplicates()

Ahora la base tiene un dato menos:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1337 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB


Las estadísticas descriptivas de las variables numéricas son las siguientes:

In [8]:
df.describe().applymap('{:,.2f}'.format)

Unnamed: 0,age,bmi,children,charges
count,1337.0,1337.0,1337.0,1337.0
mean,39.22,30.66,1.1,13279.12
std,14.04,6.1,1.21,12110.36
min,18.0,15.96,0.0,1121.87
25%,27.0,26.29,0.0,4746.34
50%,39.0,30.4,1.0,9386.16
75%,51.0,34.7,2.0,16657.72
max,64.0,53.13,5.0,63770.43


La descripción revela que la variable número de niños tiene un máximo de 5, lo que significa que su variabilidad puede estar muy limitada. Como es una variable entera y, pensando en que es una categoría para la persona, decido tratarla como una variable categórica.

A continuación, se generan las variables dicotómicas para las variables categóricas:

In [9]:
# para sexo
df = pd.get_dummies(df, prefix='sex', columns=['sex'], prefix_sep = '_')
# para fumador
df = pd.get_dummies(df, prefix='smoker', columns=['smoker'], prefix_sep = '_')
# para región
df = pd.get_dummies(df, prefix='region', columns=['region'], prefix_sep = '_')
# para número de niños
df = pd.get_dummies(df, prefix='children', columns=['children'], prefix_sep = '_')

Base de datos resultante:

In [10]:
df.sample(5)

Unnamed: 0,age,bmi,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest,children_0,children_1,children_2,children_3,children_4,children_5
221,53,33.25,10564.8845,1,0,1,0,1,0,0,0,1,0,0,0,0,0
1170,18,27.36,17178.6824,0,1,0,1,1,0,0,0,0,1,0,0,0,0
555,28,23.8,3847.674,0,1,1,0,0,0,0,1,0,0,1,0,0,0
188,41,32.2,6775.961,1,0,1,0,0,0,0,1,0,1,0,0,0,0
921,62,33.2,13462.52,1,0,1,0,0,0,0,1,1,0,0,0,0,0


Separación de la base:

In [11]:
# separación de variable dependiente y atributos
y = df['charges']
x = df.loc[:, ~df.columns.isin(['charges'])]
# separación entrenamiento y prueba
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)

Transformación  de la variable objetivo a logaritmo:

Estandarización de las características:

In [12]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

## 2) Entrenamiento


### 2.1) Sin regularización y sin polinomio

Ajuste del modelo:

In [13]:
regr = LinearRegression()
regr.fit(x_train_scaled, y_train)

LinearRegression()

Evaluación del modelo:

In [14]:
preds_train = regr.predict(x_train_scaled)
preds_test = regr.predict(x_test_scaled)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 6132.541570633522
Prueba: 5717.791255307469


Para poder interpretar el error, se revisan las características de la variable objetivo:

In [15]:
y_train.describe()

count     1069.000000
mean     13631.418932
std      12263.054249
min       1121.873900
25%       4889.036800
50%       9583.893300
75%      17904.527050
max      63770.428010
Name: charges, dtype: float64

In [16]:
y_test.describe()

count      268.000000
mean     11873.875335
std      11394.925051
min       1137.469700
25%       4347.206725
50%       8601.522600
75%      13051.470012
max      62592.873090
Name: charges, dtype: float64

Al parecer, el modelo tiene un error alto porque la raíz del error cuadrático medio, tanto para el conjunto el conjunto de entrenamiento como para el de prueba, es mayor a los percentiles 75 y bastante cercano al máximo.

Considerando que el error de entrenamiento es mayor al error de prueba, no hay evidencia de sobre-ajuste sino, por el contrario, de sub-ajuste.

### 2.2) Con polinomio y regularización

Transformaciones a polinomios de grado 2 y 5:

In [17]:
# grado 2
poly_features2 = PolynomialFeatures(degree=2, include_bias = False)
x_train_poly2 = poly_features2.fit_transform(x_train_scaled)
x_test_poly2 = poly_features2.fit_transform(x_test_scaled)
# grado 5
poly_features5 = PolynomialFeatures(degree=5, include_bias = False)
x_train_poly5 = poly_features5.fit_transform(x_train_scaled)
x_test_poly5 = poly_features5.fit_transform(x_test_scaled)

Ambos casos ajustados con regularización Ridge con $\alpha=0.1$:

In [18]:
# grado 2
ridge_reg2 = Ridge(alpha=0.1, solver='cholesky')
ridge_reg2.fit(x_train_poly2,y_train)
# grado 5
ridge_reg5 = Ridge(alpha=0.1, solver='cholesky')
ridge_reg5.fit(x_train_poly5,y_train)

Ridge(alpha=0.1, solver='cholesky')

Evaluación grado 2:

In [19]:
preds_train = ridge_reg2.predict(x_train_poly2)
preds_test = ridge_reg2.predict(x_test_poly2)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 4661.137111252048
Prueba: 4817.562845412789


Evaluación grado 5:

In [20]:
preds_train = ridge_reg5.predict(x_train_poly5)
preds_test = ridge_reg5.predict(x_test_poly5)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 2911.7533188462576
Prueba: 20598.57727695553


Con regularización Ridge con $\alpha=0.5$:

In [21]:
# grado 2
ridge_reg2 = Ridge(alpha=0.5, solver='cholesky')
ridge_reg2.fit(x_train_poly2,y_train)
# grado 5
ridge_reg5 = Ridge(alpha=0.5, solver='cholesky')
ridge_reg5.fit(x_train_poly5,y_train)

Ridge(alpha=0.5, solver='cholesky')

Evaluación grado 2:

In [22]:
preds_train = ridge_reg2.predict(x_train_poly2)
preds_test = ridge_reg2.predict(x_test_poly2)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 4661.137742428537
Prueba: 4817.410020656636


Evaluación grado 5:

In [23]:
preds_train = ridge_reg5.predict(x_train_poly5)
preds_test = ridge_reg5.predict(x_test_poly5)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 2915.253737119541
Prueba: 17326.610400321333


Con regularización Ridge con $\alpha=0.7$:

In [24]:
# grado 2
ridge_reg2 = Ridge(alpha=0.7, solver='cholesky')
ridge_reg2.fit(x_train_poly2,y_train)
# grado 5
ridge_reg5 = Ridge(alpha=0.7, solver='cholesky')
ridge_reg5.fit(x_train_poly5,y_train)

Ridge(alpha=0.7, solver='cholesky')

Evaluación grado 2:

In [25]:
preds_train = ridge_reg2.predict(x_train_poly2)
preds_test = ridge_reg2.predict(x_test_poly2)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 4661.13837309605
Prueba: 4817.333961274703


Evaluación grado 5:

In [26]:
preds_train = ridge_reg5.predict(x_train_poly5)
preds_test = ridge_reg5.predict(x_test_poly5)
print(f'Entrenamiento: {np.sqrt(mean_squared_error(y_train, preds_train))}')
print(f'Prueba: {np.sqrt(mean_squared_error(y_test, preds_test))}')

Entrenamiento: 2916.848696852341
Prueba: 16745.832202347592


## 3) Conclusiones

Se concluye lo siguiente:

- El error mejoró notablemente con el incremento del polinomio, teniendo valores cercanos al percentil 25 con grado 2 e inferiores al mismo percentil para grado 5.
- El error de entrenamiento fue mayor para todos los casos, lo que significa que en todos los casos tuvieron sub-ajuste.
- Los hiperparámetros no parecen tener un efecto para los casos con grado 2. En el caso del grado 5, los errores se incrementan, por lo que parece que un menor nivel de hiperparámetro disminuye el error.

Con respecto a los coeficientes, se revisan los valores de los modelos:

In [27]:
coef_dict = {}

for coef, feat in zip(regr.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3624.95769250073,
 'bmi': 2325.130766689254,
 'charges': -1.0407737876506344e+17,
 'sex_female': -1.0407737876506378e+17,
 'sex_male': 5.46633961721754e+16,
 'smoker_no': 5.466339617218511e+16,
 'smoker_yes': -2.597739272246916e+17,
 'region_northeast': -2.5538907140750294e+17,
 'region_northwest': -2.668659666072071e+17,
 'region_southeast': -2.5812074095713766e+17,
 'region_southwest': -1.3459883717574115e+17,
 'children_0': -1.1786593820662995e+17,
 'children_1': -1.0507384335733854e+17,
 'children_2': -8.790849510513446e+16,
 'children_3': -3.6021448816558236e+16,
 'children_4': -3.0994167390863624e+16}

In [28]:
ridge_reg2 = Ridge(alpha=0.1, solver='cholesky')
ridge_reg2.fit(x_train_poly2,y_train)
coef_dict = {}

for coef, feat in zip(ridge_reg2.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3515.4543467790404,
 'bmi': 2278.6304173493936,
 'charges': 196.7592771698507,
 'sex_female': -196.75927716893142,
 'sex_male': -1322.0801261414529,
 'smoker_no': 1322.0801261498914,
 'smoker_yes': 115.08478778328369,
 'region_northeast': 16.194309429476572,
 'region_northwest': -67.11399710534424,
 'region_southeast': -62.45695390958237,
 'region_southwest': -213.78541649949454,
 'children_0': 57.717840172322475,
 'children_1': 169.76521324549645,
 'children_2': 42.04220262738617,
 'children_3': 7.944230422673783,
 'children_4': 4.9166738085962365,
 'children_5': 859.9830529789559}

In [29]:
ridge_reg5 = Ridge(alpha=0.1, solver='cholesky')
ridge_reg5.fit(x_train_poly2,y_train)
coef_dict = {}

for coef, feat in zip(ridge_reg5.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3515.4543467790404,
 'bmi': 2278.6304173493936,
 'charges': 196.7592771698507,
 'sex_female': -196.75927716893142,
 'sex_male': -1322.0801261414529,
 'smoker_no': 1322.0801261498914,
 'smoker_yes': 115.08478778328369,
 'region_northeast': 16.194309429476572,
 'region_northwest': -67.11399710534424,
 'region_southeast': -62.45695390958237,
 'region_southwest': -213.78541649949454,
 'children_0': 57.717840172322475,
 'children_1': 169.76521324549645,
 'children_2': 42.04220262738617,
 'children_3': 7.944230422673783,
 'children_4': 4.9166738085962365,
 'children_5': 859.9830529789559}

In [30]:
ridge_reg2 = Ridge(alpha=0.5, solver='cholesky')
ridge_reg2.fit(x_train_poly2,y_train)
coef_dict = {}

for coef, feat in zip(ridge_reg2.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3514.067125732168,
 'bmi': 2277.7076363831743,
 'charges': 196.68620381189646,
 'sex_female': -196.68620381145425,
 'sex_male': -1321.9823791594026,
 'smoker_no': 1321.9823791583394,
 'smoker_yes': 115.04962314702779,
 'region_northeast': 16.157955776717277,
 'region_northwest': -67.06833006699102,
 'region_southeast': -62.43280938990244,
 'region_southwest': -213.7483355514581,
 'children_0': 57.69046504131172,
 'children_1': 169.74663367353514,
 'children_2': 42.04813991707984,
 'children_3': 7.942392121583351,
 'children_4': 4.908028491009149,
 'children_5': 859.6583578437538}

In [31]:
ridge_reg5 = Ridge(alpha=0.5, solver='cholesky')
ridge_reg5.fit(x_train_poly2,y_train)
coef_dict = {}

for coef, feat in zip(ridge_reg5.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3514.067125732168,
 'bmi': 2277.7076363831743,
 'charges': 196.68620381189646,
 'sex_female': -196.68620381145425,
 'sex_male': -1321.9823791594026,
 'smoker_no': 1321.9823791583394,
 'smoker_yes': 115.04962314702779,
 'region_northeast': 16.157955776717277,
 'region_northwest': -67.06833006699102,
 'region_southeast': -62.43280938990244,
 'region_southwest': -213.7483355514581,
 'children_0': 57.69046504131172,
 'children_1': 169.74663367353514,
 'children_2': 42.04813991707984,
 'children_3': 7.942392121583351,
 'children_4': 4.908028491009149,
 'children_5': 859.6583578437538}

In [32]:
ridge_reg2 = Ridge(alpha=0.7, solver='cholesky')
ridge_reg2.fit(x_train_poly2,y_train)
coef_dict = {}

for coef, feat in zip(ridge_reg2.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3513.373948200924,
 'bmi': 2277.246583910599,
 'charges': 196.64966686670593,
 'sex_female': -196.6496668663386,
 'sex_male': -1321.9335252947126,
 'smoker_no': 1321.9335252932117,
 'smoker_yes': 115.03204212280191,
 'region_northeast': 16.139802394379423,
 'region_northwest': -67.04550966337476,
 'region_southeast': -62.42074806560621,
 'region_southwest': -213.7298038135419,
 'children_0': 57.676783334999605,
 'children_1': 169.73734747150533,
 'children_2': 42.0511067900516,
 'children_3': 7.941473437794925,
 'children_4': 4.9037138865588545,
 'children_5': 859.4961194551341}

In [33]:
ridge_reg5 = Ridge(alpha=0.7, solver='cholesky')
ridge_reg5.fit(x_train_poly2,y_train)
coef_dict = {}

for coef, feat in zip(ridge_reg5.coef_,df.columns):

    coef_dict[feat] = coef

coef_dict

{'age': 3513.373948200924,
 'bmi': 2277.246583910599,
 'charges': 196.64966686670593,
 'sex_female': -196.6496668663386,
 'sex_male': -1321.9335252947126,
 'smoker_no': 1321.9335252932117,
 'smoker_yes': 115.03204212280191,
 'region_northeast': 16.139802394379423,
 'region_northwest': -67.04550966337476,
 'region_southeast': -62.42074806560621,
 'region_southwest': -213.7298038135419,
 'children_0': 57.676783334999605,
 'children_1': 169.73734747150533,
 'children_2': 42.0511067900516,
 'children_3': 7.941473437794925,
 'children_4': 4.9037138865588545,
 'children_5': 859.4961194551341}

En todos los caso, los atributos más importantes son la edad y el índice de masa corporal.