<h3>Rebuild ENEM's answers<br></h3>

The enem's test with 45 single choice math questions, followed by alternatives ranging from A to E. In this scenario the last five answers have been removed from the test dataset, so you will rebuild them from the final average result - creating a model to predict the marked down answers.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from scipy.spatial.distance import cdist

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from src.send_answer import send_answer
from src.models.markov import Markov
from src.models.score import score, naive_approach

np.random.seed(42)
pd.set_option('display.max_columns', 500)

<h3>Strategy</h3><br>
Considering the available options to choose from (A to E) and including the possibility to leave the question blank (*) the student last 5 answers would have a 1/7776 probability in a uniform distribution.<br><br> 

The strategy to overcome the dataset diversity consists on segmenting data in subsets in which the likelihood of similar answers gets increased. Not only using the test codes but utilizing `k_means` to analyze the previously filled answers to better define the subsets.

Additionally, a validation set will be taken from the train set to identify potential overfitting

<h3>Estimate</h3><br>
The underlying idea of the following function is to predict written answer for the segmented performance quartile as well as for its corresponding test using a Markov Chain.<br>

*Reference materials:*
<ol>
    <li><a href="https://www.youtube.com/watch?v=eGFJ8vugIWA">Coding Train - Markov Chains</a></li>
    <li><a href="http://setosa.io/ev/markov-chains/">Markov Chains Visually Explained</a></li>
</ol>

<br>Different from the other cases, this Markov Chain predicts all the answers at once, reducing the potential randomness in the results, however likely increasing overfitting


In [16]:
# set up the variables for the Markov Chains

# number of letters considered to train the Markov Chain (e.g. Reads the last three answer to predict the next)

order = 3
n_predictions = 5
shift = n_predictions + order

# iterate through all the math test codes
test_codes = prev_answers.code.unique()
for cod in test_codes:
    
    # iterate through the classified groups
    groups = prev_answers.group.unique()
    for group in groups:
       
        model = Markov(order, n_predictions)
        # train markov chain using each line
        
        train_set = train.loc[(train.CO_PROVA_MT == cod) & (train.group == group)]
        validation_set = validation.loc[(validation.CO_PROVA_MT == cod) & (validation.group == group)]
        test_set = test.loc[(test.CO_PROVA_MT == cod) & (test.group == group)]
        
        train_validation_test_set = pd.concat([
            train_set.loc[:,'TX_RESPOSTAS_MT'].str[-shift:-n_predictions], 
            validation_set.loc[:,'TX_RESPOSTAS_MT'].str[-order:],
            test_set.loc[:,'TX_RESPOSTAS_MT'].str[-order:]
        ])
        
        # assimilate all answers at once
        for i in train_set['TX_RESPOSTAS_MT'].str[-shift:]:
            model.train(i, multilple=False)
              
        for index, element in train_validation_test_set.iteritems():
            # build answer from empty string
            enem_answer = ''
            try:    
                enem_answer += model.predict(element)
            except KeyError or ValueError:
                for _ in range(n_predictions):
                    # In case it tries to make an unseen prediction, the result will be the mode on that position
                    enem_answer += train_set.loc[:, 'TX_RESPOSTAS_MT'].str[-n_predictions+len(enem_answer)].mode()[0]
            
            train_validation_test_set.loc[index] = enem_answer
        
        train.loc[train_set.index,'PREDICTION'] = train_validation_test_set.loc[train_set.index]
        validation.loc[validation_set.index,'PREDICTION'] = train_validation_test_set.loc[validation_set.index]
        test.loc[test_set.index,'PREDICTION'] = train_validation_test_set.loc[test_set.index]

print('Training set accuracy: %.2f' % (score(train.TX_RESPOSTAS_MT.str[-n_predictions:], train.PREDICTION)*100))
print('Validation set accuracy: %.2f' % (score(validation.TX_RESPOSTAS_MT.str[-n_predictions:], validation.PREDICTION)*100))

Training set accuracy: 50.29
Validation set accuracy: 25.55


In [1]:
from src.models.markov import Markov
import pandas as pd

t = Markov(streak=5)

# input data
train = pd.read_csv('../../data/interim/train3.csv').set_index('NU_INSCRICAO')

t.train_chain(train)