<h3>Rebuild ENEM's answers<br></h3>

The enem's test with 45 single choice math questions, followed by alternatives ranging from A to E. In this scenario the last five answers have been removed from the test dataset, so you will rebuild them from the final average result - creating a model to predict the marked down answers.

In [1]:
import pandas as pd
import numpy as np

from src.send_answer import send_answer
from src.models.regression import predict
from src.models.score import score

np.random.seed(42)
pd.set_option('display.max_columns', 500)

In [2]:
# input data
train = pd.read_csv('../data/raw/train.csv', index_col=0).set_index('NU_INSCRICAO')
test = pd.read_csv('../data/raw/test3.csv').set_index('NU_INSCRICAO')

<h3>Strategy</h3><br>
Considering the available options to choose from (A to E) and including the possibility to leave the question blank (*) the student last 5 answers would have a 1/7776 probability in a uniform distribution.<br><br> The strategy to overcome the dataset diversity consists on segmenting data in subsets in which the likelihood of similar answers gets increased. Not only segmenting the dataset by the types of tests employed, but by using the model defined in the previous challenge to recreate the math grades, we may better segment the dataset

In [3]:
# quick data clean-up
train.loc[:, 'TX_RESPOSTAS_MT'] = train.loc[:, 'TX_RESPOSTAS_MT'].str.replace('\.', '*')
train = train.loc[train.TX_RESPOSTAS_MT.dropna(axis=0).index]

# predict the grades on the test set using the Quantile Transformation
grade_prediction = predict(train.drop('TX_RESPOSTAS_MT', axis=1), test.drop('TX_RESPOSTAS_MT', axis=1))
test.loc[list(grade_prediction.index), 'NU_NOTA_MT'] = grade_prediction.loc[:,'NU_NOTA_MT']

# remove the 0 scores from the training set
train = train.loc[train.NU_NOTA_MT != 0,:]

# reposition the training set
train = train.copy()[list(test.columns)+['TX_GABARITO_MT']]

# separte the datasets in quartiles based on the math grade
quartiles = 4
merged_grades = pd.qcut(pd.concat([train.NU_NOTA_MT, test.NU_NOTA_MT]), quartiles, labels = False)
train['MT_QT'] = merged_grades.loc[train.index].values
test['MT_QT'] = merged_grades.loc[test.index].values


train['PREDICTION'] = ''
test['PREDICTION'] = '' 

<h3>Estimate</h3><br>
The underlying idea of the following function is to predict written answer for the segmented performance quartile as well as for its corresponding test using a Markov Chain.<br>

*Reference materials:*
<ol>
    <li><a href="https://www.youtube.com/watch?v=eGFJ8vugIWA">Coding Train - Markov Chains</a></li>
    <li><a href="http://setosa.io/ev/markov-chains/">Markov Chains Visually Explained</a></li>
</ol>

<br>In this case, the Markov Chain will be trained with the last 3 predecessors of the first answer to predict along with the answers of the trained dataset. The prediction of the chain will happen incrementally, as it predicts the answers one by one 


In [8]:
class Markov:
    def __init__(self, order = 3):
        
        self.states = {}
        self.order = order
    
    def train(self, elements):
        for i in range(len(elements)):
            # create the keys based on the order of the Markov Chain            
            key = tuple(elements[i:self.order+i])
            if key not in self.states.keys():
                self.states[key] = []
            try:
                self.states[key].append(elements[self.order+i])
            except IndexError:
                pass
    
    def predict(self, elements):
        try:
            return np.random.choice(self.states[tuple(elements[-self.order:])])
        except ValueError:
            raise KeyError

In [7]:
# set up the variables for the Markov Chains

# number of letters considered to train the Markov Chain (e.g. Reads the last three answer to predict the next)
order = 3 
n_predictions = 5
shift = n_predictions + order

# iterate through all the math test codes in the training set
test_codes = train.CO_PROVA_MT.unique()
for cod in test_codes:
    
    # iterate through the performance quartiles
    grade_quartiles = train.MT_QT.unique()
    for quartile in grade_quartiles:
       
        model = Markov(order)
        # train markov chain using each line
        
        train_set = train.loc[(train.CO_PROVA_MT == cod) & (train.MT_QT == quartile)]
        test_set = test.loc[(test.CO_PROVA_MT == cod) & (test.MT_QT == quartile)]
        
        train_test_set = pd.concat([
            train_set.loc[:,'TX_RESPOSTAS_MT'].str[-shift:-n_predictions], 
            test_set.loc[:,'TX_RESPOSTAS_MT'].str[-order:]
        ])
        
        for i in train_set['TX_RESPOSTAS_MT'].str[-shift:]:
            model.train(i)
            # attempt to enforce higher grades
            #for _ in range(int(quartile)):
            #    model.train(train.loc[i,'TX_GABARITO_MT'][-shift:])
              
        for index, element in train_test_set.iteritems():
            # build answer from empty string
            enem_answer = ''
            for _ in range(n_predictions):
                try:
                    enem_answer += model.predict(element)
                except KeyError or ValueError:
                    # In case it tries to make an unseen prediction, the result will be the mode on that position
                    enem_answer += train_set.loc[:, 'TX_RESPOSTAS_MT'].str[-n_predictions+len(enem_answer)].mode()[0]
                element = element[-order+1:]+enem_answer[-1]
            train_test_set.loc[index] = enem_answer
        
        train.loc[train_set.index,'PREDICTION'] = train_test_set.loc[train_set.index]
        test.loc[test_set.index,'PREDICTION'] = train_test_set.loc[test_set.index]

print('Traning set accuracy: {:.2f}%'.format(score(train.TX_RESPOSTAS_MT.str[-n_predictions:], train.PREDICTION)*100))

Traning set accuracy: 23.42%


The order of the Markov Chain in this case is crucial to the success of the model. A small order can result in a very random guess by the estimator as well as a very large order can result in a strongly biased model, causing an overfit

In [None]:
answer = test.copy().loc[:,['PREDICTION']]
answer = answer.rename(index=str, columns={"PREDICTION": "TX_RESPOSTAS_MT"})
#send_answer(answer.reset_index(), 3)