## Análise de Dados do Enem 2021

O Objetivo desse noteboook é trabalhar:



1.   Análise Descritiva da base
2.   Explorar ideias com Modelagem Multinível


Essa análise de dados utiliza os microdados do Enem disponibilizados no portal do Inep [aqui](https://www.gov.br/inep/pt-br/acesso-a-informacao/dados-abertos/microdados/enem).

A análise dos dados foi feita em Python e foi utilizado o [Rpy2](https://rpy2.github.io/) para ajustar um modelo no R, com o pacote [lme4](https://cran.r-project.org/web/packages/lme4/index.html).

In [None]:
# Install R and Rpy2
!apt-get install r-base
!pip install -Iv rpy2==3.4.2

# Install LMER packages (THIS TAKES ABOUT 3~5 minutes)
packnames = ('lme4', 'lmerTest', 'emmeans', "geepack","optimx")
from rpy2.robjects.packages import importr
from rpy2.robjects.vectors import StrVector
utils = importr("utils")
utils.chooseCRANmirror(ind=1)
utils.install_packages(StrVector(packnames))

In [None]:
%load_ext rpy2.ipython
# Enable cell magic for Rpy2 interface

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [None]:
import pandas as pd

url = 'https://raw.githubusercontent.com/accustodio/WDS_DadosPublicos/main/amostra_enem_2021.csv?token=GHSAT0AAAAAACDIHCW2ALPK4ECAYLEQBZZ2ZD2QEDQ'
df = pd.read_csv(url)
df.head()

Unnamed: 0,NU_NOTA_MT,TP_COR_RACA,TP_ESCOLA,TP_SEXO,Q002,NU_NOTA_CN,NU_NOTA_CH,NU_NOTA_LC,TP_ANO_CONCLUIU
0,501.1,1,1,F,H,556.8,393.2,373.7,4
1,555.2,3,2,M,H,365.2,419.7,424.6,0
2,577.7,3,1,F,G,605.5,592.5,626.4,4
3,424.0,3,1,F,E,386.9,388.6,391.1,0
4,416.9,4,1,F,C,438.3,540.5,481.1,3


In [None]:
df.shape

Unnamed: 0,NU_NOTA_MT,TP_COR_RACA,TP_ESCOLA,TP_SEXO,Q002,NU_NOTA_CN,NU_NOTA_CH,NU_NOTA_LC,TP_ANO_CONCLUIU,TP_ESCOLACAT
0,501.1,1,1,F,H,556.8,393.2,373.7,4,NaoRespondeu
1,555.2,3,2,M,H,365.2,419.7,424.6,0,Publica
2,577.7,3,1,F,G,605.5,592.5,626.4,4,NaoRespondeu
3,424.0,3,1,F,E,386.9,388.6,391.1,0,NaoRespondeu
4,416.9,4,1,F,C,438.3,540.5,481.1,3,NaoRespondeu


In [None]:
df.isnull().sum()

NU_NOTA_MT           0
TP_COR_RACA          0
TP_ESCOLA            0
TP_SEXO              0
Q002                 0
NU_NOTA_CN           0
NU_NOTA_CH         426
NU_NOTA_LC         426
TP_ANO_CONCLUIU      0
dtype: int64

In [None]:
mapeamento = {1: 'NaoRespondeu', 2: 'Publica', 3: 'Privada'}
df['TP_ESCOLACAT'] = df['TP_ESCOLA'].replace(mapeamento)
df['TP_ESCOLACAT'].value_counts()

NaoRespondeu    69481
Publica         33762
Privada          9057
Name: TP_ESCOLACAT, dtype: int64

In [None]:
df.Q002.value_counts().sort_index()

A     2457
B    12314
C    11412
D    12308
E    39729
F    16060
G    15520
H     2500
Name: Q002, dtype: int64

In [None]:
df.NU_NOTA_MT.describe()

count    112300.000000
mean        535.248245
std         110.424382
min           0.000000
25%         445.000000
50%         516.000000
75%         614.100000
max         953.100000
Name: NU_NOTA_MT, dtype: float64

In [None]:
%%R -i df

# LMER model in R
modelo<-lmer('NU_NOTA_MT ~ Q002+(1|TP_ESCOLACAT)', data=df)
print(summary(modelo))

  for name, values in obj.iteritems():


Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: "NU_NOTA_MT ~ Q002+(1|TP_ESCOLACAT)"
   Data: df

REML criterion at convergence: 1358325

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.7640 -0.7609 -0.1263  0.6866  4.3311 

Random effects:
 Groups       Name        Variance Std.Dev.
 TP_ESCOLACAT (Intercept)  1233     35.12  
 Residual                 10485    102.40  
Number of obs: 112300, groups:  TP_ESCOLACAT, 3

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept) 4.754e+02  2.038e+01 2.043e+00   23.32  0.00165 ** 
Q002B       2.303e+01  2.263e+00 1.123e+05   10.18  < 2e-16 ***
Q002C       4.030e+01  2.279e+00 1.123e+05   17.68  < 2e-16 ***
Q002D       4.879e+01  2.266e+00 1.123e+05   21.53  < 2e-16 ***
Q002E       6.989e+01  2.132e+00 1.123e+05   32.78  < 2e-16 ***
Q002F       1.114e+02  2.226e+00 1.123e+05   50.05  < 2e-16 ***
Q002G       1.209e+02  2.233e+00 1.123e+05   54.15  <

In [None]:
%%R
(rr2 <- ranef(modelo))

$TP_ESCOLACAT
             (Intercept)
NaoRespondeu   -5.820972
Privada        16.073418
Publica       -10.252446

with conditional variances for “TP_ESCOLACAT” 


In [None]:
%%R -i df
# LMER model in R
modelo<-lmer('NU_NOTA_MT ~ Q002+NU_NOTA_CN+(1|TP_ESCOLACAT)', data=df)
print(summary(modelo))

  for name, values in obj.iteritems():


Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: "NU_NOTA_MT ~ Q002+NU_NOTA_CN+(1|TP_ESCOLACAT)"
   Data: df

REML criterion at convergence: 1302373

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.8200 -0.7036 -0.0185  0.6897  9.8830 

Random effects:
 Groups       Name        Variance Std.Dev.
 TP_ESCOLACAT (Intercept)  199     14.11   
 Residual                 6370     79.81   
Number of obs: 112300, groups:  TP_ESCOLACAT, 3

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept) 8.633e+01  8.433e+00 2.285e+00  10.236  0.00588 ** 
Q002B       1.261e+01  1.764e+00 1.123e+05   7.148 8.88e-13 ***
Q002C       2.326e+01  1.778e+00 1.123e+05  13.083  < 2e-16 ***
Q002D       2.928e+01  1.767e+00 1.123e+05  16.567  < 2e-16 ***
Q002E       3.850e+01  1.666e+00 1.123e+05  23.113  < 2e-16 ***
Q002F       5.617e+01  1.747e+00 1.123e+05  32.148  < 2e-16 ***
Q002G       5.998e+01  1.755e+00 1.123e+05

In [None]:
%%R -i df
# LMER model in R
modelo<-lmer('NU_NOTA_MT ~ NU_NOTA_CN+TP_SEXO+(1|TP_ESCOLACAT)', data=df)
print(summary(modelo))

  for name, values in obj.iteritems():


Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: "NU_NOTA_MT ~ NU_NOTA_CN+TP_SEXO+(1|TP_ESCOLACAT)"
   Data: df

REML criterion at convergence: 1304728

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.6389 -0.7032 -0.0172  0.6938  9.7850 

Random effects:
 Groups       Name        Variance Std.Dev.
 TP_ESCOLACAT (Intercept)  384.6   19.61   
 Residual                 6504.0   80.65   
Number of obs: 112300, groups:  TP_ESCOLACAT, 3

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept) 1.015e+02  1.143e+01 2.072e+00   8.883   0.0111 *  
NU_NOTA_CN  8.821e-01  3.096e-03 1.123e+05 284.872   <2e-16 ***
TP_SEXOM    1.961e+01  4.996e-01 1.123e+05  39.250   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
           (Intr) NU_NOT
NU_NOTA_CN -0.134       
TP_SEXOM    0.000 -0.130


In [None]:
%%R

(rr2 <- ranef(modelo))

$TP_ESCOLACAT
             (Intercept)
NaoRespondeu   -8.773191
Privada        22.455481
Publica       -13.682291

with conditional variances for “TP_ESCOLACAT” 


In [None]:
%%R -i df
# LMER model in R
modelo2<-lmer('NU_NOTA_MT ~ Q002+(1+Q002|TP_ESCOLACAT)', data=df)
print(summary(modelo2))

  for name, values in obj.iteritems():



Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: "NU_NOTA_MT ~ Q002+(1+Q002|TP_ESCOLACAT)"
   Data: df

REML criterion at convergence: 1358068

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.8206 -0.7599 -0.1252  0.6858  4.3653 

Random effects:
 Groups       Name        Variance Std.Dev. Corr                               
 TP_ESCOLACAT (Intercept)  1185.51  34.431                                     
              Q002B         162.64  12.753   0.87                              
              Q002C         135.16  11.626   0.71  0.97                        
              Q002D          84.12   9.172   0.07  0.56  0.76                  
              Q002E         132.59  11.515  -0.06  0.44  0.66  0.99            
              Q002F         406.09  20.152  -0.11  0.40  0.62  0.98  1.00      
              Q002G         496.89  22.291   0.03  0.52  0.73  1.00  1.00  0.99
              Q002H         717.79  26.792   0.93  0

In [None]:
%%R

(rr2 <- ranef(modelo2))

UsageError: Cell magic `%%R` not found.
