Carga de base de datos "mpg" incluida en `seaborn`:

In [44]:
from seaborn import load_dataset
data = load_dataset("mpg")
data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


Se eliminan variables que no tienen sentido para el modelo y se eliminan datos faltantes.

In [45]:
data = data.drop(columns=["name"]).dropna()

Realizamos selección de variables *forward* con un nivel de significancia del $5\%$:

In [46]:
from estyp.linear_model.stepwise import forward_selection
import statsmodels.api as sm

variable_respuesta = "mpg"
nivel_significancia = 0.05

formula_obtenida = forward_selection(
    y     = variable_respuesta,
    data  = data,
    model = sm.OLS,
    alpha = nivel_significancia
)

Variable agregada: weight                         | valor-p: <0.0001
Variable agregada: model_year                     | valor-p: <0.0001
Variable agregada: origin                         | valor-p: <0.0001
|| Fin de la selección ||
Fórmula obtenida: mpg ~ weight + model_year + origin


Estudiamos el modelo con la fórmula obtenida:

In [47]:
especificacion = sm.OLS.from_formula(formula_obtenida, data)
modelo_resultante = especificacion.fit()
display(modelo_resultante.summary())

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.819
Model:,OLS,Adj. R-squared:,0.817
Method:,Least Squares,F-statistic:,437.9
Date:,"Wed, 26 Jul 2023",Prob (F-statistic):,3.53e-142
Time:,21:24:11,Log-Likelihood:,-1026.1
No. Observations:,392,AIC:,2062.0
Df Residuals:,387,BIC:,2082.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-16.3306,3.927,-4.158,0.000,-24.052,-8.610
origin[T.japan],0.2382,0.559,0.426,0.670,-0.861,1.337
origin[T.usa],-1.9763,0.518,-3.815,0.000,-2.995,-0.958
weight,-0.0059,0.000,-22.647,0.000,-0.006,-0.005
model_year,0.7698,0.049,15.818,0.000,0.674,0.866

0,1,2,3
Omnibus:,32.293,Durbin-Watson:,1.251
Prob(Omnibus):,0.0,Jarque-Bera (JB):,58.234
Skew:,0.507,Prob(JB):,2.26e-13
Kurtosis:,4.593,Cond. No.,72200.0


**Observación importante**: La selección de modelos con metodología *forward* según su valor-p se realiza mediante la tabla ANOVA. Los valores-p de la tabla ANOVA son distintos a los que muestra `modelo_resultante.summary()`, por lo que es normal que existan variables con valor-p mayor al nivel de significancia $\alpha$. A continuación se muestra la tabla ANOVA para verificar lo mencionado:

In [42]:
from statsmodels.stats.anova import anova_lm

display(anova_lm(modelo_resultante))

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
origin,2.0,7904.291038,3952.145519,354.834109,2.934332e-88
weight,1.0,8817.605374,8817.605374,791.668003,1.263286e-95
model_year,1.0,2786.687559,2786.687559,250.196202,8.022353999999999e-44
Residual,387.0,4310.409498,11.138009,,


Los valores-p son menores a 0.05!