# Cross-validation

Generamos datos con las siguientes características: n=40, p=3.

Los tres regresores, $X_1,X_2,X_3$ siguen una distribución multivariada normal con media cero, varianza 1 y covarianzas 0.5. La dependiente se genera

$$
Y=2+X_1+U
$$

donde $U\sim N(0,3^2)$. 

El ejercicio de CV lo haremos para elegir entre 7 modelos lineales posibles

In [None]:
!pip install numpy pandas matplotlib seaborn scikit-learn scipy

**Paso 1** Creamos los datos

In [None]:
import numpy as np
from scipy.stats import multivariate_normal

# Parameters
mu = [0, 0, 0]
sigma = np.array([[1, 0.5, 0.5],
                  [0.5, 1, 0.5],
                  [0.5, 0.5, 1]])

n = 40
np.random.seed(12345)

# Generate multivariate normal data
X = multivariate_normal.rvs(mean=mu, cov=sigma, size=n)
print(X[:6])  # equivalent to head(X) in R



In [None]:
# Generate error term
e = np.random.normal(0, 3, 40)

# Generate y
y = 2 + 1 * X[:, 0] + e

# Combine X and y
data = np.column_stack((X, y))
print(data[:6])

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
import pandas as pd

Hacemos un ejercicio de CV únicamente para el modelo lineal con $X_1,X_2,X_3$. Esto con fines ilustrativos

In [None]:
# Prepare the data as a DataFrame
df = pd.DataFrame(data, columns=['V1', 'V2', 'V3', 'y'])

# Features and target
X = df[['V1', 'V2', 'V3']]
y = df['y']

# Set up 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=10101)
model = LinearRegression()

cv_scores = cross_val_score(model, X, y, cv=kf, scoring="neg_mean_squared_error")
mse_scores = -cv_scores  # Convert to positive MSE values

print("Cross-validation MSE scores:", mse_scores)
print("Mean MSE:", mse_scores.mean())
print("Std MSE:", mse_scores.std())

**Paso 2** Definimos los 7 modelos y se hace la CV para cada uno de ellos

Elegimos el modelo con el menor MSE promedio

In [None]:
# List of feature sets for each model
feature_sets = [
    ['V1'],
    ['V2'],
    ['V3'],
    ['V1', 'V2'],
    ['V1', 'V3'],
    ['V2', 'V3'],
    ['V1', 'V2', 'V3']
]

kf = KFold(n_splits=5, shuffle=True, random_state=10101)
model = LinearRegression()

for i, features in enumerate(feature_sets, 1):
    X = df[features]
    y = df['y']
    cv_scores = cross_val_score(model, X, y, cv=kf, scoring="neg_mean_squared_error")
    mse_scores = -cv_scores  # Convert to positive MSE values
    print(f"Model m{i}: y ~ {' + '.join(features)}")
    print("  Cross-validation MSE scores:", mse_scores)
    print("  Mean MSE:", mse_scores.mean())
    print("  Std MSE:", mse_scores.std())
    print()
    

## LASSO

Ajustamos un modelo con el estimador de LASSO. Hacemos la evaluación fuera de muestra


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import sklearn.linear_model as lm
from sklearn.linear_model import LassoCV

In [None]:
# train model using Lasso with cross validation and variable normalization
lasso_cv = Pipeline([('scale', StandardScaler()),  # standardize the variables
                  ('lasso', lm.LassoCV(cv=5, random_state=10101))])
model=lasso_cv.fit(X_train, y_train)

In [None]:
# Print summary information
lasso = model.named_steps['lasso']
print("Optimal alpha:", lasso.alpha_)
print("Coefficients:", lasso.coef_)
print("Intercept:", lasso.intercept_)
print("Number of Iterations:", lasso.n_iter_)

In [None]:
yhat_lasso = model.predict(X_test)
mse_test = np.mean((y_test - yhat_lasso) ** 2)
print("Test MSE:", mse_test)