**1.1**
Realiza una regresión tomando 'mpg' como salida y eliminando la columna 'model'. Considera todos los demás factores como numéricos/ordinales.
- Calcula el R2 e interpreta los signos de los betas.
- Realiza un train-test-split donde se use el 40% de los datos para entrenar. Calcula el R2 de entrenamiento y de prueba.
- Añade regularización L2 con un hiperparámetro lambda decidido por ti. Cambia este valor y compara con varios distintos los R2 de -entrenamiento y de prueba.

In [70]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import r2_score
import statsmodels.api as sm

dataset = "Motor Trend Car Road Tests (2).xlsx"
df = pd.read_excel(dataset)
df.columns = [c.strip() for c in df.columns]
df.head()


Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [109]:
df = df.drop(columns=[c for c in df.columns if c.lower() == "model"], errors="ignore")

X_full = df.drop(columns=["mpg"]).select_dtypes(include=[np.number])
df_clean = df.dropna(subset=X_full.columns.tolist() + ["mpg"])
X_full = df_clean.drop(columns=["mpg"]).select_dtypes(include=[np.number])
y_full = df_clean["mpg"].astype(float)

X_full.shape, y_full.shape

((32, 10), (32,))

In [111]:
X_train, X_test, y_train, y_test = train_test_split(
    X_full, y_full, train_size=0.40, random_state=42, shuffle=True
)

len(X_train), len(X_test)

(12, 20)

In [198]:
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)
cols = X_train.columns

ols = LinearRegression().fit(X_train_s, y_train)
y_tr_hat_ols = ols.predict(X_train_s)
y_te_hat_ols = ols.predict(X_test_s)
r2_train_ols = r2_score(y_train, y_tr_hat_ols)
r2_test_ols  = r2_score(y_test,  y_te_hat_ols)
res_ols = pd.DataFrame(np.column_stack([[r2_train_ols], [r2_test_ols]]), columns=["R2_train", "R2_test"])
res_ols

Unnamed: 0,R2_train,R2_test
0,0.998212,-7.107141


In [200]:
betas_ols = pd.Series(ols.coef_, index=cols)
betas_ols = betas_ols.reindex(betas_ols.abs().sort_values(ascending=False).index)
signo_ols = ["+" if b >= 0 else "-" for b in betas_ols.values]
tabla_betas_ols = pd.DataFrame(
    np.column_stack([betas_ols.index.to_numpy(), betas_ols.values, np.array(signo_ols, dtype=object)]),
    columns=["feature", "beta", "signo"]
)

alphas = [0.1, 1.0, 10.0, 100.0, 300.0]
a_list = []
r2tr_list = []
r2te_list = []
for a in alphas:
    rg = Ridge(alpha=a, fit_intercept=True, random_state=42).fit(X_train_s, y_train)
    a_list.append(a)
    r2tr_list.append(r2_score(y_train, rg.predict(X_train_s)))
    r2te_list.append(r2_score(y_test,  rg.predict(X_test_s)))

ridge_scores = pd.DataFrame(
    np.column_stack([a_list, r2tr_list, r2te_list]),
    columns=["alpha", "R2_train", "R2_test"]
)
ridge_scores

Unnamed: 0,alpha,R2_train,R2_test
0,0.1,0.983138,-0.506482
1,1.0,0.927481,0.651838
2,10.0,0.878923,0.80088
3,100.0,0.585232,0.671088
4,300.0,0.311403,0.349753


In [117]:
best_idx = int(np.argmax(np.array(r2te_list)))
best_alpha = a_list[best_idx]
ridge_best = Ridge(alpha=best_alpha, fit_intercept=True, random_state=42).fit(X_train_s, y_train)
betas_ridge = pd.Series(ridge_best.coef_, index=cols)
betas_ridge = betas_ridge.reindex(betas_ridge.abs().sort_values(ascending=False).index)
signo_ridge = ["+" if b >= 0 else "-" for b in betas_ridge.values]
tabla_betas_ridge = pd.DataFrame(
    np.column_stack([betas_ridge.index.to_numpy(), betas_ridge.values, np.array(signo_ridge, dtype=object)]),
    columns=["feature", "beta_ridge", "signo"]
)

In [119]:
Xc_all = sm.add_constant((X_full - X_full.mean())/X_full.std(ddof=0))
ols_sm = sm.OLS(y_full, Xc_all).fit()

res_ols, tabla_betas_ols, ridge_scores, tabla_betas_ridge, ols_sm.summary()

(   R2_train   R2_test
 0  0.998212 -7.107141,
   feature       beta signo
 0      vs -12.988911     -
 1      wt -10.159411     -
 2    qsec   9.172667     +
 3    carb   6.651356     +
 4     cyl  -5.835217     -
 5    drat  -3.606297     -
 6    gear   2.480631     +
 7      am   2.212592     +
 8      hp  -2.164675     -
 9    disp  -1.648889     -,
    alpha  R2_train   R2_test
 0    0.1  0.983138 -0.506482
 1    1.0  0.927481  0.651838
 2   10.0  0.878923  0.800880
 3  100.0  0.585232  0.671088
 4  300.0  0.311403  0.349753,
   feature beta_ridge signo
 0      wt  -1.307344     -
 1      am    0.89179     +
 2      hp  -0.866107     -
 3    disp  -0.812825     -
 4    carb  -0.802666     -
 5     cyl  -0.760716     -
 6    gear   0.628285     +
 7      vs   0.522034     +
 8    drat   0.497123     +
 9    qsec   0.317899     +,
 <class 'statsmodels.iolib.summary.Summary'>
 """
                             OLS Regression Results                            
 Dep. Variable:         

Interpretación: 

El modelo explica bien los datos que ya conoce r2 = 0.87, pero cuando se prueba con datos nuevos se equivoca mucho porque se “aprendió de memoria” en lugar de generalizar. Al usar Ridge, el modelo mejora y predice mejor con datos nuevos.

**1.2** Repite el ejercicio anterior usando 'qsec' como salida.

In [121]:
dataset = "Motor Trend Car Road Tests (2).xlsx"
df = pd.read_excel(dataset)
df.columns = [c.strip() for c in df.columns]
df = df.drop(columns=[c for c in df.columns if c.lower() == "model"], errors="ignore")


y_12 = df["qsec"].astype(float)
X_12 = df.drop(columns=["qsec"]).select_dtypes(include=[np.number])

df12 = df.dropna(subset=X_12.columns.tolist() + ["qsec"])
X_12 = df12.drop(columns=["qsec"]).select_dtypes(include=[np.number])
y_12 = df12["qsec"].astype(float)

In [123]:
Xtr12, Xte12, ytr12, yte12 = train_test_split(X_12, y_12, train_size=0.40, random_state=42, shuffle=True)
sc12 = StandardScaler().fit(Xtr12)
Xtr12s = sc12.transform(Xtr12)
Xte12s  = sc12.transform(Xte12)

In [127]:
ols12 = LinearRegression().fit(Xtr12s, ytr12)
ytr_hat_12 = ols12.predict(Xtr12s)
yte_hat_12 = ols12.predict(Xte12s)
r2tr12 = r2_score(ytr12, ytr_hat_12)
r2te12 = r2_score(yte12, yte_hat_12)

res_12_ols = pd.DataFrame(
    np.column_stack([[r2tr12], [r2te12]]),
    columns=["R2_train", "R2_test"]
)

In [129]:
betas_12 = pd.Series(ols12.coef_, index=Xtr12.columns)
betas_12 = betas_12.reindex(betas_12.abs().sort_values(ascending=False).index)
signos_12 = ["+" if b >= 0 else "-" for b in betas_12.values]

tabla_betas_12 = pd.DataFrame(
    np.column_stack([betas_12.index.to_numpy(), betas_12.values, np.array(signos_12, dtype=object)]),
    columns=["feature", "beta", "signo"]
)

In [131]:
alphas12 = [0.1, 1.0, 10.0, 100.0, 300.0]
a12 = []
r2tr_list12 = []
r2te_list12 = []
for a in alphas12:
    rg = Ridge(alpha=a, fit_intercept=True, random_state=42).fit(Xtr12s, ytr12)
    a12.append(a)
    r2tr_list12.append(r2_score(ytr12, rg.predict(Xtr12s)))
    r2te_list12.append(r2_score(yte12, rg.predict(Xte12s)))

ridge_12 = pd.DataFrame(
    np.column_stack([a12, r2tr_list12, r2te_list12]),
    columns=["alpha", "R2_train", "R2_test"]
)

In [133]:
best_idx_12 = int(np.argmax(np.array(r2te_list12)))
best_alpha_12 = a12[best_idx_12]
ridge_best_12 = Ridge(alpha=best_alpha_12, fit_intercept=True, random_state=42).fit(Xtr12s, ytr12)

betas_ridge_12 = pd.Series(ridge_best_12.coef_, index=Xtr12.columns)
betas_ridge_12 = betas_ridge_12.reindex(betas_ridge_12.abs().sort_values(ascending=False).index)
signos_ridge_12 = ["+" if b >= 0 else "-" for b in betas_ridge_12.values]

tabla_betas_ridge_12 = pd.DataFrame(
    np.column_stack([betas_ridge_12.index.to_numpy(), betas_ridge_12.values, np.array(signos_ridge_12, dtype=object)]),
    columns=["feature", "beta_ridge", "signo"]
)

res_12_ols, tabla_betas_12, ridge_12, tabla_betas_ridge_12

(   R2_train   R2_test
 0  0.998948 -1.001342,
   feature      beta signo
 0      vs  2.353335     +
 1      wt  1.798005     +
 2     mpg  1.251131     +
 3    carb -1.194684     -
 4     cyl  1.042173     +
 5    drat  0.646343     +
 6    gear -0.445618     -
 7      am -0.392329     -
 8      hp  0.363284     +
 9    disp  0.332145     +,
    alpha  R2_train   R2_test
 0    0.1  0.991047  0.323923
 1    1.0  0.952727  0.744818
 2   10.0  0.807692  0.678281
 3  100.0  0.368438  0.305166
 4  300.0  0.176359  0.140024,
   feature beta_ridge signo
 0      wt   0.828095     +
 1      vs   0.626257     +
 2      hp  -0.514257     -
 3    gear  -0.395045     -
 4     cyl   -0.38441     -
 5      am  -0.291622     -
 6    carb  -0.274802     -
 7     mpg   0.272671     +
 8    drat   0.234852     +
 9    disp  -0.069548     -)

Con qsec como objetivo pasó lo mismo que antes. La regresión normal “se lo aprendió de memoria” y falló con datos nuevos. 

**2.1** Realiza una regresión tomando 'mpg' como salida y eliminando la columna 'model'. Crea columnas dummies para los factores 'cyl', 'gear' y 'carb'.

Calcula el R2 e interpreta los signos de los betas.
Realiza un train-test-split donde se use el 40% de los datos para entrenar. Calcula el R2 de entrenamiento y de prueba.

In [153]:
y_21 = df["mpg"].astype(float)
dums_21 = pd.get_dummies(df[["cyl","gear","carb"]].astype("category"), drop_first=True)
cont_21 = df.drop(columns=["mpg","cyl","gear","carb"]).select_dtypes(include=[np.number])

X_21_vals = np.column_stack([cont_21.values, dums_21.values])
X_21_cols = list(cont_21.columns) + list(dums_21.columns)
tmp_21 = pd.DataFrame(np.column_stack([X_21_vals, y_21.values]), columns=X_21_cols+["__y__"]).dropna()
X_21 = tmp_21.drop(columns="__y__")
y_21 = tmp_21["__y__"].astype(float)

In [205]:
Xtr_21, Xte_21, ytr_21, yte_21 = train_test_split(X_21, y_21, train_size=0.40, random_state=42, shuffle=True)
sc_21 = StandardScaler().fit(Xtr_21)
XtrS_21 = sc_21.transform(Xtr_21)
XteS_21 = sc_21.transform(Xte_21)

In [211]:
ols_21 = LinearRegression().fit(XtrS_21, ytr_21)
r2tr_21 = r2_score(ytr_21, ols_21.predict(XtrS_21))
r2te_21 = r2_score(yte_21, ols_21.predict(XteS_21))
res_21 = pd.DataFrame(np.column_stack([[r2tr_21],[r2te_21]]), columns=["R2_train","R2_test"])
res_21

Unnamed: 0,R2_train,R2_test
0,1.0,-1.565514


In [213]:
alphas_21 = [0.1, 1.0, 10.0, 100.0, 300.0]
a_21, rtr_21, rte_21 = [], [], []
for a in alphas_21:
    rg = Ridge(alpha=a, fit_intercept=True, random_state=42).fit(XtrS_21, ytr_21)
    a_21.append(a)
    rtr_21.append(r2_score(ytr_21, rg.predict(XtrS_21)))
    rte_21.append(r2_score(yte_21, rg.predict(XteS_21)))
ridge_21 = pd.DataFrame(np.column_stack([a_21, rtr_21, rte_21]), columns=["alpha","R2_train","R2_test"])
ridge_21

Unnamed: 0,alpha,R2_train,R2_test
0,0.1,0.995223,-0.34867
1,1.0,0.951007,0.604205
2,10.0,0.879875,0.81012
3,100.0,0.576385,0.63528
4,300.0,0.307993,0.324763


**2.2** Repite el ejercicio anterior usando 'qsec' como salida.

In [161]:
y_22 = df["qsec"].astype(float)
dums_22 = pd.get_dummies(df[["cyl","gear","carb"]].astype("category"), drop_first=True)
cont_22 = df.drop(columns=["qsec","cyl","gear","carb"]).select_dtypes(include=[np.number])

In [163]:
X_22_vals = np.column_stack([cont_22.values, dums_22.values])
X_22_cols = list(cont_22.columns) + list(dums_22.columns)
tmp_22 = pd.DataFrame(np.column_stack([X_22_vals, y_22.values]), columns=X_22_cols+["__y__"]).dropna()
X_22 = tmp_22.drop(columns="__y__")
y_22 = tmp_22["__y__"].astype(float)

In [165]:
Xtr_22, Xte_22, ytr_22, yte_22 = train_test_split(X_22, y_22, train_size=0.40, random_state=42, shuffle=True)
sc_22 = StandardScaler().fit(Xtr_22)
XtrS_22 = sc_22.transform(Xtr_22)
XteS_22 = sc_22.transform(Xte_22)

In [216]:
ols_22 = LinearRegression().fit(XtrS_22, ytr_22)
r2tr_22 = r2_score(ytr_22, ols_22.predict(XtrS_22))
r2te_22 = r2_score(yte_22, ols_22.predict(XteS_22))
res_22 = pd.DataFrame(np.column_stack([[r2tr_22],[r2te_22]]), columns=["R2_train","R2_test"])
res_22

Unnamed: 0,R2_train,R2_test
0,1.0,-0.266944


In [218]:
alphas_22 = [0.1, 1.0, 10.0, 100.0, 300.0]
a_22, rtr_22, rte_22 = [], [], []
for a in alphas_22:
    rg = Ridge(alpha=a, fit_intercept=True, random_state=42).fit(XtrS_22, ytr_22)
    a_22.append(a)
    rtr_22.append(r2_score(ytr_22, rg.predict(XtrS_22)))
    rte_22.append(r2_score(yte_22, rg.predict(XteS_22)))
ridge_22 = pd.DataFrame(np.column_stack([a_22, rtr_22, rte_22]), columns=["alpha","R2_train","R2_test"])
ridge_22

Unnamed: 0,alpha,R2_train,R2_test
0,0.1,0.996298,0.690983
1,1.0,0.980133,0.730232
2,10.0,0.860948,0.623792
3,100.0,0.434433,0.284449
4,300.0,0.214089,0.133655


2.1 y 2.2: 
En ambos casos el OLS queda casi perfecto en entrenamiento con el r2 dando 1, lo que indica que el modelo memoriza (muchos y por eso falla en prueba (r2 test negativo en 2.1 ≈ −1.57 y en 2.2 ≈ −0.27). Al aplicar Ridge el problema se corrige.

**3.1** Compara los R2 de los ejercicios 1.1 & 2.1.

In [186]:
r2tr_list = float(res_ols.iloc[0,0]); r2te_list = float(res_ols.iloc[0,1])
r2tr_21 = float(res_21.iloc[0,0]); r2te_21 = float(res_21.iloc[0,1])
r2tr_12 = float(res_12_ols.iloc[0,0]); r2te_12 = float(res_12_ols.iloc[0,1])
r2tr_22 = float(res_22.iloc[0,0]); r2te_22 = float(res_22.iloc[0,1])

In [188]:
c1_31 = np.array([r2tr_list, r2te_list])
c2_31 = np.array([r2tr_21, r2te_21])
cmp_31 = pd.DataFrame(np.column_stack([c1_31, c2_31]), columns=["R2_1_1","R2_2_1"])
cmp_31.index = ["train","test"]

In [190]:
cmp_31

Unnamed: 0,R2_1_1,R2_2_1
train,0.998212,1.0
test,-7.107141,-1.565514


**3.2** Compara los R2 de los ejercicios 1.2 & 2.2.

In [193]:
c1_32 = np.array([r2tr_12, r2te_12])
c2_32 = np.array([r2tr_22, r2te_22])
cmp_32 = pd.DataFrame(np.column_stack([c1_32, c2_32]), columns=["R2_1_2","R2_2_2"])
cmp_32.index = ["train","test"]

In [195]:
cmp_32

Unnamed: 0,R2_1_2,R2_2_2
train,0.998948,1.0
test,-1.001342,-0.266944


En la comparación de los ejercicios se ve que todos los modelos con OLS tienden a memorizar los datos, ya que el r2 de entrenamiento es prácticamente 1 en todos los casos. En la parte de prueba, el modelo 1.1 con mpg tiene un desempeño muy malo con r2 de alrededor de -7, y al usar dummies en 2.1 incluso empeora con un r2 de -1.5. En el caso de qsec, tanto 1.2 como 2.2 muestran sobreajuste, pero 2.2 resulta un poco menos malo, con un r2 de -0.27 frente al -1.0 de 1.2. Esto muestra que con OLS los modelos no generalizan bien y que la regularización es necesaria para obtener resultados útiles.