<h3>Rebuild ENEM's answers<br></h3>

The enem's test with 45 single choice math questions, followed by alternatives ranging from A to E. In this scenario the last five answers have been removed from the test dataset, so you will rebuild them from the final average result - creating a model to predict the marked down answers.

In [1]:
import pandas as pd
import numpy as np

from src.send_answer import send_answer
from src.features.build_features import estimate_math
from src.models.score import score, naive_approach

pd.set_option('display.max_columns', 500)

In [2]:
# input data
train = pd.read_csv('../../data/raw/train.csv', index_col=0).set_index('NU_INSCRICAO')
test = pd.read_csv('../../data/raw/test3.csv').set_index('NU_INSCRICAO')

# quick data clean-up
train.loc[:,'TX_RESPOSTAS_MT'] = train.loc[:,'TX_RESPOSTAS_MT'].str.replace('\.','*')
train = train.loc[train.TX_RESPOSTAS_MT.dropna(axis=0).index]

<h3>Strategy</h3><br>
Considering the available options to choose from (A to E) and including the possibility to leave the question blank (*) the student last 5 answers would have a 1/7776 probability in a uniform distribution.<br><br> The strategy to overcome the dataset diversity consists on segmenting data in subsets in which the likelihood of similar answers gets increased. Not only segmenting the dataset by the types of tests employed, but by using the model defined in the previous challenge to recreate the math grades, we may better segment the dataset

In [3]:
# predict the grades on the test set using the Quantile Transformation
grade_prediction = estimate_math(train.drop('TX_RESPOSTAS_MT', axis=1), test.drop('TX_RESPOSTAS_MT', axis=1))
test.loc[list(grade_prediction.index), 'NU_NOTA_MT'] = grade_prediction.loc[:,'NU_NOTA_MT']

# remove the 0 scores from the training set
train = train.loc[train.NU_NOTA_MT != 0,:]

# reposition the training set
train = train.copy()[list(test.columns)+['TX_GABARITO_MT']]

# separte the datasets in quartiles based on the math grade
quartiles = 4
merged_grades = pd.qcut(pd.concat([train.NU_NOTA_MT, test.NU_NOTA_MT]), quartiles, labels = False)
train['MT_QT'] = merged_grades.loc[train.index].values
test['MT_QT'] = merged_grades.loc[test.index].values

predict_n = 5

train['PREDICTION'] = ''
test['PREDICTION'] = '' 

<h3>Estimate</h3><br>
The underlying idea of the following function is to predict the most common written answer for the segmented performance quartile as well as for its corresponding test

In [4]:
def predict_answers(dataset, predict_n=5):
    
    df = dataset.copy()
    # iterate through each type of math test in the training set
    for code in df.CO_PROVA_MT.unique():
        
        # iterate through each quartile in this math test
        for quartile in df.loc[df.CO_PROVA_MT == code, 'MT_QT'].unique():
            
            enem_answer = ''
            filter = (df.CO_PROVA_MT == code) & (df.MT_QT == quartile)
            for _ in range(predict_n):
            
                # accumulate the last n-enem_answers for each row
                enem_answer += df.loc[filter, 'TX_RESPOSTAS_MT'].str[-predict_n+len(enem_answer)].mode()[0]
            df.loc[filter, 'PREDICTION'] = enem_answer
    return df

In [5]:
train_answers = predict_answers(train)
test_answers = predict_answers(test)
print('Naive approach accuracy: {:.2f}%'.format(naive_approach(train)*100))
print('Traning set accuracy: {:.2f}%'.format(score(train_answers.TX_RESPOSTAS_MT.str[-predict_n:], train_answers.PREDICTION)*100))

Naive approach accuracy: 19.90%
Traning set accuracy: 32.32%


In [6]:
# send answers
answer = test_answers.copy().loc[:,['PREDICTION']]
answer = answer.rename(index=str, columns={"PREDICTION": "TX_RESPOSTAS_MT"})
#send_answer(answer.reset_index(), 3)