<h3>Rebuild ENEM's answers<br></h3>

The enem's math test consists of 45 single choice questions with alternatives ranging from A to E. In this scenario the last five answers have been removed from the test dataset, so you will rebuild them from the final average result - creating a model to predict the marked down answers.

In [140]:
import pandas as pd
import numpy as np

from src.send_answer import send_answer
from src.models.markov import Markov
from src.models.score import score, naive_approach

pd.set_option('display.max_columns', 500)

Considering the available options to choose from (A to E) and including the possibility to leave the question blank (*) the student last 5 answers would have a 1/7776 probability in a uniform distribution.<br><br> 

In [136]:
# input data
train = pd.read_csv('../../data/interim/train3.csv').set_index('NU_INSCRICAO')
validation = pd.read_csv('../../data/interim/validation3.csv').set_index('NU_INSCRICAO')
test = pd.read_csv('../../data/interim/test3.csv').set_index('NU_INSCRICAO')

### Strategy
The strategy to overcome the dataset diversity consists on segmenting data in subsets in which the likelihood of similar answers gets increased. Not only segmenting the dataset by the **types of tests employed**, but by **using the model defined in the previous challenge to recreate the math grades**, we may better segment the dataset. By **using quantiles to classify the grades** is possible to create a new category of segregation

In [None]:
# number of previous answers to take in account to predict the next values
order = 3

# number of answers to predict per row
streak = 5

# target to predict
target = 'TX_RESPOSTAS_MT'

# fields to segment the dataset
id = ['CO_PROVA_MT', 'NU_NOTA_MT']

<h3>Estimate</h3><br>
The underlying idea of the following function is to predict written answer for the segmented performance quantile as well as for its corresponding test using a Markov Chain.<br>

*Reference materials:*
<ol>
    <li><a href="https://www.youtube.com/watch?v=eGFJ8vugIWA">Coding Train - Markov Chains</a></li>
    <li><a href="http://setosa.io/ev/markov-chains/">Markov Chains Visually Explained</a></li>
</ol>

<br>In this case, the Markov Chain will be trained with the last 3 predecessors of the first answer to predict along with the answers of the trained dataset. The prediction will then identify the last three elements of the input to estimate the next 5 answers

In [137]:
model = Markov(order, streak, target, id)
model.train_chain(train, save=False)

In [138]:
predict = {
    'train': lambda df, id, target: model.predict(df[target][-(order+streak):-streak], tuple(df.loc[id].values)),
    'test': lambda df, id, target: model.predict(df[target][-order:], tuple(df.loc[id].values))
}


train['PREDICTION'] = train.apply(predict['train'], id=id, target=target, axis=1)

validation['PREDICTION'] = validation.apply(predict['train'], id=id, target=target, axis=1)

test['PREDICTION'] = test.apply(predict['test'], id=id, target=target, axis=1)

```order = 3```

The order of the Markov Chain in this case is crucial to the success of the model. A small order can result in a very random guess by the estimator as well as a very large order can result in a strongly biased model

### Results

In [142]:
print('Training set accuracy: %.2f' % (score(train.TX_RESPOSTAS_MT.str[-streak:], train.PREDICTION)*100))
print('Validation set accuracy: %.2f' % (score(validation.TX_RESPOSTAS_MT.str[-streak:], validation.PREDICTION)*100))
print('Naive approach accuracy: %.2f' % (naive_approach(validation)*100))

Training set accuracy: 58.55
Validation set accuracy: 22.73
Naive approach accuracy: 20.06


It is plausible to infer, by observing the train and validation results, that the Markov chain is very conditioned to the trained results, thus implying in a overfit

**Send answers**

In [6]:
answer = test.copy().loc[:,['PREDICTION']]
answer = answer.rename(index=str, columns={"PREDICTION": "TX_RESPOSTAS_MT"})
#send_answer(answer.reset_index(), 3)