# ENEM - Data Cleaning and Feature Selection

In [25]:
#Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

In [2]:
#Import Dataset
dataEnem = pd.read_csv ("microdados_enem_2019/DADOS/MICRODADOS_ENEM_2019.csv", sep=';', encoding='ISO-8859-1')

In [3]:
dataEnem.columns.values

array(['NU_INSCRICAO', 'NU_ANO', 'CO_MUNICIPIO_RESIDENCIA',
       'NO_MUNICIPIO_RESIDENCIA', 'CO_UF_RESIDENCIA', 'SG_UF_RESIDENCIA',
       'NU_IDADE', 'TP_SEXO', 'TP_ESTADO_CIVIL', 'TP_COR_RACA',
       'TP_NACIONALIDADE', 'CO_MUNICIPIO_NASCIMENTO',
       'NO_MUNICIPIO_NASCIMENTO', 'CO_UF_NASCIMENTO', 'SG_UF_NASCIMENTO',
       'TP_ST_CONCLUSAO', 'TP_ANO_CONCLUIU', 'TP_ESCOLA', 'TP_ENSINO',
       'IN_TREINEIRO', 'CO_ESCOLA', 'CO_MUNICIPIO_ESC',
       'NO_MUNICIPIO_ESC', 'CO_UF_ESC', 'SG_UF_ESC',
       'TP_DEPENDENCIA_ADM_ESC', 'TP_LOCALIZACAO_ESC', 'TP_SIT_FUNC_ESC',
       'IN_BAIXA_VISAO', 'IN_CEGUEIRA', 'IN_SURDEZ',
       'IN_DEFICIENCIA_AUDITIVA', 'IN_SURDO_CEGUEIRA',
       'IN_DEFICIENCIA_FISICA', 'IN_DEFICIENCIA_MENTAL',
       'IN_DEFICIT_ATENCAO', 'IN_DISLEXIA', 'IN_DISCALCULIA',
       'IN_AUTISMO', 'IN_VISAO_MONOCULAR', 'IN_OUTRA_DEF', 'IN_GESTANTE',
       'IN_LACTANTE', 'IN_IDOSO', 'IN_ESTUDA_CLASSE_HOSPITALAR',
       'IN_SEM_RECURSO', 'IN_BRAILLE', 'IN_AMPLIADA_24

## The contextual biases

In [4]:
#Select unwanted columns
selectedColumns = ['NU_INSCRICAO', 'TP_NACIONALIDADE', 'TP_ST_CONCLUSAO', 'TP_ESCOLA',
       'TP_DEPENDENCIA_ADM_ESC', 'TP_SIT_FUNC_ESC',
       'TP_PRESENCA_CN', 'TP_PRESENCA_CH', 'TP_PRESENCA_LC',
       'TP_PRESENCA_MT', 'NU_NOTA_CN', 'NU_NOTA_CH', 'NU_NOTA_LC',
       'NU_NOTA_MT', 'NU_NOTA_REDACAO', 'Q001', 'Q002', 'Q003', 'Q004', 'Q005', 'Q006',
       'Q007', 'Q008', 'Q009', 'Q010', 'Q011', 'Q012', 'Q013', 'Q014',
       'Q015', 'Q016', 'Q017', 'Q018', 'Q019', 'Q020', 'Q021', 'Q022',
       'Q023', 'Q024', 'Q025']

#Filter unwanted columns
filteredDataEnem = dataEnem.filter(items=selectedColumns)

In [5]:
#Filter brazilian only
brDataEnem = filteredDataEnem[filteredDataEnem['TP_NACIONALIDADE'] == 1]

In [6]:
#Filter only people who is in the last year of high school
brDataEnem = brDataEnem[brDataEnem['TP_ST_CONCLUSAO'] == 2]

In [7]:
#Filter only people from schools that are open
brDataEnem = brDataEnem[brDataEnem['TP_SIT_FUNC_ESC'] == 1]

In [8]:
#Filter only people that did not miss any test
brDataEnem = brDataEnem[(brDataEnem['TP_PRESENCA_CN'] == 1) & (brDataEnem['TP_PRESENCA_LC'] == 1)]

In [None]:
brDataEnem.to_csv('brDataEnem.csv')

#### Doubts

1. Since we have a well-defined context and purpose, it make sense to define the most important features to predict our result, but does it make more sense to use automated methods like for example Stepwise Selection or LASSO regression?

2. Since our learning model was not fully successful, should we have used another classification model, or do you think the problem is associated with the size of the training dataset? Another possibility that came to mind is related to the large number of features (Apr 25) and the number of possible classifications, success and failure seems very binary.

3. For the next and final phase we want to predict students scores based on their socioeconomic data, that put us on a regression kind of problem. How do you think it would be better for us to work? Should we focous on the Gradient Descent model?