Code based on [this code](https://github.com/siyuanzhao/2016-EDM) used in paper "[Going Deeper with Deep Knowledge Tracing](http://www.educationaldatamining.org/EDM2016/proceedings/paper_133.pdf)" 

The ASSISTment Data is collected from a web-based automated math tutoring system. Core features include student id, question id & whether or not their answer was correct. 

There is [a dataset from 2009](https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data/skill-builder-data-2009-2010) and is [a dataset from 2015](https://sites.google.com/site/assistmentsdata/home/2015-assistments-skill-builder-data)

The ASSISTDataProvider has a couple of modifications compared to the usual DataProvider.
 - You need to give it the path to the directory containing the .npz files
 - you can tell it which_year ('09' or '15')
 - For each batch, it produces inputs, targets AND target_ids.  
     - The target_ids contain indices for extracting a predictions vector from the output of the RNN (exactly like the 2016-EDM code).
 - There is no .npz file for the validation set. 
  - Instead, we use k-fold cross validation by first constructing a DataProvider using the training data, and then calling get_k_folds method, which returns k tuples of DataProviders: (train_dp, val_dp)

In [1]:
from data_provider import ASSISTDataProvider

In [2]:
# your path to directory containing data files
DATA_DIR = '/home/ben/mlp/mlp-group-project/data/assist09'

In [3]:
# batch_size is the number students included in each training batch
TrainingProvider = ASSISTDataProvider(DATA_DIR, batch_size=10)

In [4]:
# iterate through the batches
for inputs, targets, target_ids in TrainingProvider:
    
    # inputs is a list with length equal to 
    # max number of questions any student in the data has answered
    print(len(inputs))
    
    # the first element is the first question each student answered 
    # and has shape (1, num_students, length_of_feature_vector)
    print(inputs[0].shape)
    
    # each student has a sequence of answer correctness labels
    # for the problems they answered, with 0=correct, 1=incorrect.
    # Targets is a flattened array containing all these scores, so
    # is length \sum_i num_questions_answered_by_student_i
    print(len(targets))
    
    # ids of the questions answered (need to extract a predictions
    # after training), should be same shape as targets
    print(len(target_ids))

10
(973, 293)
895
1420580
10
(973, 293)
321
1420580
10
(973, 293)
484
1420580
10
(973, 293)
990
1420580
10
(973, 293)
590
1420580
10
(973, 293)
682
1420580
10
(973, 293)
334
1420580
10
(973, 293)
394
1420580
10
(973, 293)
886
1420580
10
(973, 293)
543
1420580
10
(973, 293)
442
1420580
10
(973, 293)
978
1420580
10
(973, 293)
1046
1420580
10
(973, 293)
919
1420580
10
(973, 293)
1461
1420580
10
(973, 293)
866
1420580
10
(973, 293)
564
1420580
10
(973, 293)
874
1420580
10
(973, 293)
1137
1420580
10
(973, 293)
277
1420580
10
(973, 293)
949
1420580
10
(973, 293)
620
1420580
10
(973, 293)
552
1420580
10
(973, 293)
156
1420580
10
(973, 293)
524
1420580
10
(973, 293)
519
1420580
10
(973, 293)
193
1420580
10
(973, 293)
1181
1420580
10
(973, 293)
421
1420580
10
(973, 293)
1448
1420580
10
(973, 293)
1016
1420580
10
(973, 293)
562
1420580
10
(973, 293)
429
1420580
10
(973, 293)
322
1420580
10
(973, 293)
654
1420580
10
(973, 293)
430
1420580
10
(973, 293)
470
1420580
10
(973, 293)
631
1420580
10
(97

In [5]:
print(TrainingProvider.max_prob_set_id)

146


In [6]:
# example of how to use cross-validiation
i = 1
for data_provider_train, data_provider_val in TrainingProvider.get_k_folds(5):
    print('FOLD {}'.format(i))
    print('train data provider has {} students'.format(data_provider_train.inputs.shape[0]))
    print('val data provider has {} students'.format(data_provider_val.inputs.shape[0]))
    print('----------------')
    i += 1

FOLD 1
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 2
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 3
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 4
train data provider has 2483 students
val data provider has 621 students
----------------
FOLD 5
train data provider has 2484 students
val data provider has 620 students
----------------


### Notes on how data is represented:

In [7]:
# entire dataset `.inputs` is stored as a spare matrix 
train_set = TrainingProvider.inputs.todense()

# sparse matrix must be 2D, so has shape 
# (num_students, (2*max_question_id)+1 * max_number_of_questions_answered)
print(train_set.shape)

(3104, 285089)


- Each row of data is a student 
    - So, matrix has first dimension = num_students
- Each each column is an "answer label" (incorrect/correct) to a question, and the number of columns is the maximum number of questions any student answered 
    - So, matrix has second dimension = max_number_of_questions_answered 
    - Example: Student_A answered more questions than anyone else, answering 100 questions. The matrix has 100 columns, and Student_B who answered only 90 questions, has zeros in the last 10 columns.
- However, each "answer label" is encoded as a one-hot vector in the following way:
    - there are max_question_id number of questions, e.g. 15 different questions
    - let the vector 'is_incorrrect' be a one-hot vector with 1 in the i^th position is a student got question with id number i incorrect
    - let the vector 'is_corrrect' be a one-hot vector with 1 in the i^th position is a student got question with id number i correct
    - each "answer label" is the represented by the vector [is_incorrrect, is_correct], which has length 2*max_question_id
    - this vector is left-padded with a zero (I don't know why...)
    - So, each "answer label" is a one-hot vecotr of length (2 $\times$ max_question_id)+1
    - So, matrix has second dimension = 
    (2 $\times$ max_question_id)+1 $\times$ 
    max_number_of_questions_answered)