I have based my code off the code here: https://github.com/siyuanzhao/2016-EDM. The

The ASSISTDataProvider has a couple of modifications compared to the usual DataProvider.
 - You need to give it the directory containing the .npz files
 - you can tell it which_year (09 or 15), since there are two different assist data sets, one from each year
 - For each batch, it produces inputs, targets AND target_ids.  The target_ids contain indices for extracting a predictions vector from the output of the RNN (exactly like the 2016-EDM code).
 - There is no .npz file for the validation set. Instead, we use k-fold cross validation by first constructing a DataProvider using the training data, and then calling get_k_folds method, which returns k tuples of DataProviders: (train_dp, val_dp)

In [3]:
import numpy as np
import pandas as pd
import scipy.sparse as sp
from data_provider import ASSISTDataProvider
from sklearn.model_selection import KFold

In [4]:
data_dir = '/home/ben/mlp/mlp-group-project/data/assist09/'

In [30]:
# batch_size = 10, so 10 students in this batch
train_dp = ASSISTDataProvider(data_dir=data_dir, batch_size=10, max_num_batches=1, shuffle_order=True)

In [32]:
for inputs, targets, target_ids in train_dp:
    
    # inputs is a list of arrays. List has length equal to max number of questions any student has answered
    print(len(inputs)) 
    
    # shape of design matrix for the first question each student answered
    # (1, num_students, length of feature vector)
    print(inputs[0].shape)
    
    # each student has a sequence of correctness scores (0=correct, 1=incorrect) for the problems they answered
    # targets is just a flattened array containing all these scores.
    print(len(targets))
    
    # indices that we need after learning to extract a predictions vector. Should be same shape as targets.
    print(len(target_ids))

973
(1, 10, 293)
816
816


In [37]:
# example of how to use cross-validiation
i = 1
for data_provider_train, data_provider_val in train_dp.get_k_folds(5):
    print('FOLD {}'.format(i))
    print('train data provider has {} students'.format(data_provider_train.inputs.shape[0]))
    print('val data provider has {} students'.format(data_provider_val.inputs.shape[0]))
    print('----------------')
    i += 1

FOLD 1
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 2
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 3
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 4
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 5
train data provider has 2484 students
val data provider has 620 students
----------------
