### Hito3 para Rendimiento Escolar
*"Modelación descriptiva, que busca definir cuáles son los principales determinantes del objeto de estudio. En base a esta sección se podrá construir o depurar el modelo predictivo."*

####  Carga de librerias e importación de .csv obtenido en el hito anterior

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.formula as sm
import statsmodels.formula.api as smf

#from graficos import *
import funciones as fun

import warnings
warnings.filterwarnings('ignore')

#plt.style.use('seaborn') # Gráficos estilo seaborn
#plt.rcParams["figure.figsize"] = (8,6) # Tamaño gráficos (5, 3)
#plt.rcParams["figure.dpi"] = 75 # resolución gráficos 100

In [2]:
#Se realiza la importación de los datos
df = pd.read_csv('students_ready.csv')
print(df.shape)
df.head()

(395, 33)


Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,,U,GT3,A,4.0,4.0,at_home,teacher,...,4.0,3.0,4.0,1.0,1.0,3.0,6.0,5.0,6.0,6.0
1,GP,F,17.0,U,GT3,T,1.0,1.0,at_home,other,...,5.0,3.0,3.0,1.0,1.0,3.0,4.0,5.0,5.0,6.0
2,GP,F,15.0,U,LE3,T,1.0,1.0,at_home,other,...,4.0,3.0,2.0,2.0,3.0,3.0,10.0,,8.0,10.0
3,GP,F,15.0,U,GT3,T,4.0,2.0,health,services,...,3.0,2.0,2.0,1.0,1.0,5.0,2.0,15.0,14.0,15.0
4,GP,F,,U,GT3,T,3.0,3.0,other,other,...,4.0,3.0,2.0,1.0,2.0,5.0,4.0,6.0,10.0,10.0


###  Modelación descriptiva

- Eliminación de nulos.
- Transformacion de variables categoricas con Binary Encoding (dummies).
- Se resetea el indice del dataframe.
- Reemplazo de nombres.

In [3]:
# Eliminación de nulos
df = df.dropna()

# Reemplazo de yes/no
for i in ['schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']:
    df[i] = df[i].replace(fun.option_yn)

# Cambio a str
for i in ['goout', 'health']:
    df[i] = df[i].astype(str)

# Cambio a int
for i in ['G1', 'G3']:
    df[i] = df[i].astype(int)

# Dummies
df_dummies = pd.get_dummies(df, drop_first=True)
df_dummies.columns = [col.replace('-', '_') for col in df_dummies.columns]
df_dummies.columns = df_dummies.columns.str.replace('.0', '')
print('Tamaño del dataframe con dummies:',df_dummies.shape)


Tamaño del dataframe con dummies: (284, 48)


Estimación de los modelos de las VO G1-G2:
- Con el dataframe preparado y por medio del uso de OLS, se estiman descriptivamente los promedios semestrales (G1, G2) (OLS = mínimos cuadrados ordinarios)
- Se genera la formula para cada VO
- Se prueban modelos para cada VO

In [4]:
# Se eliminan las VO de acuerdo al promedio a calcular.
df_dummies_g1 = df_dummies.drop(['G2', 'G3'], axis=1) # solo queda G1
df_dummies_g2 = df_dummies.drop(['G1', 'G3'], axis=1) # solo queda G2

#Construcion de la formula OLS
base_g1_1 = fun.mf(df_dummies_g1, var_obj='G1')
base_g2_1 = fun.mf(df_dummies_g2, var_obj='G2')

OLS para G1

In [5]:
model_g1_1 = smf.ols(base_g1_1, data=df_dummies_g1).fit()
model_g1_1.summary()

0,1,2,3
Dep. Variable:,G1,R-squared:,0.349
Model:,OLS,Adj. R-squared:,0.226
Method:,Least Squares,F-statistic:,2.837
Date:,"Sat, 09 Jul 2022",Prob (F-statistic):,1.7e-07
Time:,10:49:14,Log-Likelihood:,-679.63
No. Observations:,284,AIC:,1451.0
Df Residuals:,238,BIC:,1619.0
Df Model:,45,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,9.4379,3.818,2.472,0.014,1.916,16.959
age,0.1241,0.188,0.660,0.510,-0.246,0.494
Medu,0.0962,0.287,0.335,0.738,-0.469,0.662
Fedu,0.1468,0.237,0.619,0.536,-0.320,0.614
traveltime,-0.0750,0.294,-0.255,0.799,-0.654,0.504
studytime,0.5767,0.248,2.323,0.021,0.088,1.066
failures,-1.1226,0.285,-3.935,0.000,-1.685,-0.561
schoolsup,-1.7347,0.569,-3.048,0.003,-2.856,-0.614
famsup,-0.9889,0.413,-2.394,0.017,-1.803,-0.175

0,1,2,3
Omnibus:,6.163,Durbin-Watson:,1.975
Prob(Omnibus):,0.046,Jarque-Bera (JB):,4.108
Skew:,0.133,Prob(JB):,0.128
Kurtosis:,2.475,Cond. No.,442.0
