<h3>Rebuild ENEM's answers<br></h3>

Since some ENEM answers have been lost, you will rebuild them from the final average result - creating a model to predict the marked down answers.

In [1]:
import pandas as pd
import numpy as np
import sys

# could not manage to use a package version of refactor modules
sys.path.insert(0, '../src')
from send_answer import send_answer

sys.path.insert(0, '../src/models')
from regression import predict
from score import score

pd.set_option('display.max_columns', 500)

In [2]:
# input data
train = pd.read_csv('../data/raw/train.csv', index_col=0).set_index('NU_INSCRICAO')
test = pd.read_csv('../data/raw/test3.csv').set_index('NU_INSCRICAO')

# quick data clean-up
train.loc[:,'TX_RESPOSTAS_MT'] = train.loc[:,'TX_RESPOSTAS_MT'].str.replace('\.','*')
train = train.loc[train.TX_RESPOSTAS_MT.dropna(axis=0).index]

Based on the previous challenge, the strategy consists on recreating the math grades in order to better segment the dataset

In [3]:
# predict the grades on the test set using the Quantile Transformation
grade_prediction = predict(train.drop('TX_RESPOSTAS_MT', axis=1), test.drop('TX_RESPOSTAS_MT', axis=1))
test.loc[list(grade_prediction.index), 'NU_NOTA_MT'] = grade_prediction.loc[:,'NU_NOTA_MT']

# remove the 0 scores from the training set
train = train.loc[train.NU_NOTA_MT != 0,:]

# reposition the training set
train = train.copy()[list(test.columns)+['TX_GABARITO_MT']]

# separte the datasets in quartiles based on the math grade
quartiles = 4
merged_grades = pd.qcut(pd.concat([train.NU_NOTA_MT, test.NU_NOTA_MT]), quartiles, labels = False)
train['MT_QT'] = merged_grades.loc[train.index].values
test['MT_QT'] = merged_grades.loc[test.index].values

predict_n = 5

train['PREDICTION'] = ''
test['PREDICTION'] = '' 

<h3>Prediction strategy</h3><br>
The underlying idea of the following function is to predict the most common written answer for the segmented performance quartile as well as for its corresponding test

In [4]:
def predict_answers(dataset, predict_n=5):
    
    df = dataset.copy()
    # iterate through each type of math test in the training set
    for code in df.CO_PROVA_MT.unique():
        
        # iterate through each quartile in this math test
        for quartile in df.loc[df.CO_PROVA_MT == code, 'MT_QT'].unique():
            
            enem_answer = ''
            filter = (df.CO_PROVA_MT == code) & (df.MT_QT == quartile)
            for _ in range(predict_n):
            
                # accumulate the last n-enem_answers for each row
                enem_answer += df.loc[filter, 'TX_RESPOSTAS_MT'].str[-predict_n+len(enem_answer)].mode()[0]
            df.loc[filter, 'PREDICTION'] = enem_answer
    return df

def naive_approach(dataset, predict_n = 5):
    df = dataset.copy()
    for i in df.index:
        df.loc[i,'PREDICTION'] = df.loc[i,'PREDICTION'].join(''.join(str(x) for x in np.random.choice(['A','B','C','D','E'], size=predict_n)))
    print('Naive approach: %.2f' % (score(df.TX_RESPOSTAS_MT.str[-predict_n:], df.PREDICTION)*100))

Predictions

In [5]:
# define the number of questions to estimate

naive_approach(train)
train = predict_answers(train)
test = predict_answers(test)
print('Traning set accuracy: %.2f' % (score(train.TX_RESPOSTAS_MT.str[-predict_n:], train.PREDICTION)*100))

Naive approach: 20.08
Traning set accuracy: 32.30


In [6]:
answer = test.copy().loc[:,['PREDICTION']]
answer = answer.rename(index=str, columns={"PREDICTION": "TX_RESPOSTAS_MT"})
#send_answer(answer.reset_index(), 3)