## **Codenation 2020** - ENEM - Desafio 4 (Semana 9)

**Nome**: Camila Morais de Melo <br>
**E-mail**: camila_moraismelo@hotmail.com 

**Desafio:** Descubra quem fez a prova do ENEM apenas para treinar

**Pontos de Atenção:** Muitas universidades brasileiras utilizam o ENEM para selecionar seus futuros alunos e alunas. Isto é feito com uma média ponderada das notas das provas de matemática, ciências da natureza, linguagens e códigos, ciências humanas e redação, com os pesos abaixo:

* matemática: 3
*ciências da natureza: 2
*linguagens e códigos: 1.5
*ciências humanas: 1
*redação: 3

Alguns estudantes decidem realizar prova do ENEM de forma precoce, como um teste.

No arquivo test.csv crie um modelo para prever quem fez a prova apenas para treino (coluna **IN_TREINEIRO**) de quem participou do ENEM 2016.

Salve sua resposta em um arquivo chamado answer.csv com duas colunas: **NU_INSCRICAO** e **IN_TREINEIRO**.

Faça o upload do arquivo answer.csv usando o botão “Submeter resposta”.

## **Setup**

In [1]:
# Pandas nos permite trabajar con DataFrames
import pandas as pd
# Para la visualización de datos
import plotly.express as px
import json
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# model
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier ,VotingClassifier

# roc curve and auc score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

# show Colab Tables
from IPython.display import display
pd.set_option('display.max_columns', 100)
pd.options.display.max_columns = 100

#%matplotlib inline
#%load_ext google.colab.data_table

In [3]:
#!wget https://s3-us-west-1.amazonaws.com/codenation-challenges/enem-ps/testfiles.zip
#!unzip -o testfiles.zip

In [2]:
#!ls

## **Exploratório**

In [4]:
## Funções
def ler_arquivos(arquivo):
  df = pd.read_csv(arquivo)  
  print("Dimenções ",arquivo, "(filas, columnas)  : ", df.shape)
  return df

def remover_valores(df,column):
  df = df[(df[column].notnull())]
  print(df.shape)
  return df

def missing_values(df_train,df_test):
  percent_missing_train = df_train.isnull().sum() * 100 / len(df_train)
  percent_missing_test = df_test.isnull().sum() * 100 / len(df_test)
  missing_value_df = pd.DataFrame({'column_name': df_train.columns,
                                  'percent_missing_train': percent_missing_train,
                                  'percent_missing_test': percent_missing_test}).round(2)

  missing_value_df.sort_values('percent_missing_train', inplace=True)
  return missing_value_df

In [5]:
#Importa os dataframes e apresenta as dimenções 
df_training = ler_arquivos('train.csv')
df_test = ler_arquivos('test.csv')

('Dimen\xc3\xa7\xc3\xb5es ', 'train.csv', '(filas, columnas)  : ', (13730, 167))
('Dimen\xc3\xa7\xc3\xb5es ', 'test.csv', '(filas, columnas)  : ', (4570, 43))


In [6]:
#Manter apenas as colunas existentes no modelo de teste
#Neste caso fiz isso pois as dimenções eram diferentes
col = df_test.columns
#Acrescenta a variavel target
col = col.append(pd.Index(["IN_TREINEIRO"]))

df_training = df_training.filter(col)
print(df_training.shape)
print(df_test.shape)

(13730, 44)
(4570, 43)


In [7]:
cat_var = [key for key in dict(df_training.dtypes)
             if dict(df_training.dtypes)[key] in ['object'] ]

cat_var

['Q025',
 'Q024',
 'Q026',
 'Q047',
 'Q027',
 'Q006',
 'Q002',
 'SG_UF_RESIDENCIA',
 'Q001',
 'NU_INSCRICAO',
 'TP_SEXO']

In [118]:
#DataSet desalanceado
df_training['IN_TREINEIRO'].value_counts(normalize=True)

0    0.870138
1    0.129862
Name: IN_TREINEIRO, dtype: float64

In [11]:
ms = missing_values(df_training,df_test)
ms


invalid value encountered in rint



Unnamed: 0,column_name,percent_missing_test,percent_missing_train
CO_UF_RESIDENCIA,NU_INSCRICAO,0.0,0.0
Q024,NU_NOTA_CN,0.0,0.0
Q025,NU_NOTA_CH,0.0,0.0
Q026,NU_NOTA_LC,0.0,0.0
Q047,TP_STATUS_REDACAO,0.0,0.0
SG_UF_RESIDENCIA,NU_NOTA_COMP1,0.0,0.0
TP_ANO_CONCLUIU,NU_NOTA_COMP2,0.0,0.0
TP_COR_RACA,NU_NOTA_COMP3,0.0,0.0
TP_ESCOLA,NU_NOTA_REDACAO,0.0,0.0
TP_LINGUA,Q001,0.0,0.0


In [16]:
#Transforma features categoricas em numericas
lb_make = LabelEncoder()
df_training["Q001"] = lb_make.fit_transform(df_training["Q001"])
df_test["Q001"] = lb_make.transform(df_test["Q001"])

lb_make = LabelEncoder()
df_training["Q002"] = lb_make.fit_transform(df_training["Q002"])
df_test["Q002"] = lb_make.transform(df_test["Q002"])

lb_make = LabelEncoder()
df_training["Q006"] = lb_make.fit_transform(df_training["Q006"])
df_test["Q006"] = lb_make.transform(df_test["Q006"])

lb_make = LabelEncoder()
df_training["Q024"] = lb_make.fit_transform(df_training["Q024"])
df_test["Q024"] = lb_make.transform(df_test["Q024"])

lb_make = LabelEncoder()
df_training["Q025"] = lb_make.fit_transform(df_training["Q025"])
df_test["Q025"] = lb_make.transform(df_test["Q025"])

lb_make = LabelEncoder()
df_training["Q026"] = lb_make.fit_transform(df_training["Q026"])
df_test["Q026"] = lb_make.transform(df_test["Q026"])

lb_make = LabelEncoder()
df_training["Q047"] = lb_make.fit_transform(df_training["Q047"])
df_test["Q047"] = lb_make.transform(df_test["Q047"])

lb_make = LabelEncoder()
df_training["TP_SEXO"] = lb_make.fit_transform(df_training["TP_SEXO"])
df_test["TP_SEXO"] = lb_make.transform(df_test["TP_SEXO"])

lb_make = LabelEncoder()
df_training["SG_UF_RESIDENCIA"] = lb_make.fit_transform(df_training["SG_UF_RESIDENCIA"])
df_test["SG_UF_RESIDENCIA"] = lb_make.transform(df_test["SG_UF_RESIDENCIA"])

#lb_make = LabelEncoder()
#df_training["CO_PROVA_CN"] = lb_make.fit_transform(df_training["CO_PROVA_CN"])
#df_test["CO_PROVA_CN"] = lb_make.transform(df_test["CO_PROVA_CN"])

#lb_make = LabelEncoder()
#df_training["CO_PROVA_CH"] = lb_make.fit_transform(df_training["CO_PROVA_CH"])
#df_test["CO_PROVA_CH"] = lb_make.transform(df_test["CO_PROVA_CH"])

#lb_make = LabelEncoder()
#df_training["CO_PROVA_LC"] = lb_make.fit_transform(df_training["CO_PROVA_LC"])
#df_test["CO_PROVA_LC"] = lb_make.transform(df_test["CO_PROVA_LC"])

#b_make = LabelEncoder()
#df_training["CO_PROVA_MT"] = lb_make.fit_transform(df_training["CO_PROVA_MT"])
#df_test["CO_PROVA_MT"] = lb_make.transform(df_test["CO_PROVA_MT"])

In [17]:
#Cria valor NA para P:Com que idade você começou a exercer uma atividade remunerada?
df_training['Q027'] = df_training.Q027.fillna("NA")
df_test['Q027'] = df_test.Q027.fillna("NA")

lb_make = LabelEncoder()
df_training["Q027"] = lb_make.fit_transform(df_training["Q027"])
df_test["Q027"] = lb_make.transform(df_test["Q027"])

In [18]:
#Substitui o NA por um valor negativo
df_training['Q027'] = df_training.Q027.replace(to_replace=13,value=-1)
df_test['Q027'] = df_test.Q027.replace(to_replace=13,value=-1)

df_training.Q027.unique()

array([ 7, -1,  5,  4, 12,  2,  3,  1,  9,  0,  8,  6, 10, 11],
      dtype=int64)

In [19]:
cat_var = [key for key in dict(df_training.dtypes)
             if dict(df_training.dtypes)[key] in ['object'] ]

print(cat_var)

['NU_INSCRICAO']


## **Tratamento dos nulos e enriquecimento dos dados**

In [61]:
df_training.IN_TREINEIRO.value_counts(normalize=True)

0    0.870138
1    0.129862
Name: IN_TREINEIRO, dtype: float64

In [21]:
df_training['TP_ENSINO'] = df_training.TP_ENSINO.fillna(1)
df_test['TP_ENSINO'] = df_training.TP_ENSINO.fillna(1)

df_training.TP_ENSINO.value_counts(normalize=True)

1.0    0.971158
3.0    0.026948
2.0    0.001894
Name: TP_ENSINO, dtype: float64

In [23]:
#Notas NA são informadas como -1 pois não foram informadas e há notas atualmente como 0 para features aqui atualizadas
df_training['NU_NOTA_CH'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_CH'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_CN'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_CN'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_LC'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_LC'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_REDACAO'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_REDACAO'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_COMP5'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_COMP5'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_COMP4'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_COMP4'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_COMP3'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_COMP3'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_COMP2'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_COMP2'] = df_training.TP_ENSINO.fillna(-1)

df_training['NU_NOTA_COMP1'] = df_training.TP_ENSINO.fillna(-1)
df_test['NU_NOTA_COMP1'] = df_training.TP_ENSINO.fillna(-1)

df_training['TP_STATUS_REDACAO'] = df_training.TP_ENSINO.fillna(-1)
df_test['TP_STATUS_REDACAO'] = df_training.TP_ENSINO.fillna(-1)

#Nota mínima determinada para não gerar no modelo nota negativa
#df_training['NU_NOTA_MT'] = df_training.NU_NOTA_MT.fillna(0)

In [26]:
# TP_DEPENDENCIA_ADM_ESC: Dependência administrativa (Escola)
# 1	Federal
# 2	Estadual
# 3	Municipal
# 4	Privada
# NOVO: 5 NA
#
# TP_ESCOLA: Tipo de escola do Ensino Médio
# 1	Não Respondeu
# 4	Exterior

#Considerar que os valores faltantes são de Escolas Estaduais
df_training['TP_DEPENDENCIA_ADM_ESC'] = df_training.TP_DEPENDENCIA_ADM_ESC.fillna(2)
df_test['TP_DEPENDENCIA_ADM_ESC'] = df_training.TP_DEPENDENCIA_ADM_ESC.fillna(2)

In [25]:
ms = missing_values(df_training,df_test)
ms

Unnamed: 0,column_name,percent_missing_test,percent_missing_train
CO_UF_RESIDENCIA,NU_INSCRICAO,0.0,0.0
Q024,NU_NOTA_CN,0.0,0.0
Q025,NU_NOTA_CH,0.0,0.0
Q026,NU_NOTA_LC,0.0,0.0
Q027,TP_LINGUA,0.0,0.0
Q047,TP_STATUS_REDACAO,0.0,0.0
SG_UF_RESIDENCIA,NU_NOTA_COMP1,0.0,0.0
TP_ANO_CONCLUIU,NU_NOTA_COMP2,0.0,0.0
TP_COR_RACA,NU_NOTA_COMP3,0.0,0.0
Q006,TP_PRESENCA_MT,0.0,0.0


In [0]:
#lb_make = LabelEncoder()
#df_training["NU_INSCRICAO"] = lb_make.fit_transform(df_training["NU_INSCRICAO"])
#df_test["NU_INSCRICAO"] = lb_make.fit_transform(df_test["NU_INSCRICAO"])

## **Modelo**

### **KNeighborsClassifier**

In [62]:
X = df_training
y = df_training['IN_TREINEIRO']

X = X.drop(['IN_TREINEIRO'],axis=1)
X = X.drop(['NU_INSCRICAO'],axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [63]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [84]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 1)

knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

In [85]:
y_pred = knn.predict(X_test)

In [86]:
from sklearn import metrics
from sklearn.metrics import mean_squared_error

print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.21588530901209396


#### **Submission**

In [107]:
df_test.shape

(4570, 43)

In [108]:
df = df_test.drop(['NU_INSCRICAO',],axis=1)
pred = knn.predict(df)

In [109]:
sub= pd.Series(pred, index=df_test['NU_INSCRICAO'].astype(np.str), name='IN_TREINEIRO')
sub.shape

(4570L,)

In [110]:
sub.head()

NU_INSCRICAO
ba0cc30ba34e7a46764c09dfc38ed83d15828897    0
177f281c68fa032aedbd842a745da68490926cd2    0
6cf0d8b97597d7625cdedc7bdb6c0f052286c334    1
5c356d810fa57671402502cd0933e5601a2ebf1e    0
df47c07bd881c2db3f38c6048bf77c132ad0ceb3    0
Name: IN_TREINEIRO, dtype: int64

#### **Submission**


In [113]:
sub.to_csv("answer.csv", header=True)