### Data Description
The skill builder dataset has 30 columns.
Please refer to this [link](https://sites.google.com/site/assistmentsdata/home/assistment-2009-2010-data) for detail description.

**The relevant columns are:**
- order_id: it is chronological.
- user_id: the id of the student doing the problem.
- problem_id: the id of the problem
- correct: 1 means correct on the first attempt, 0 means incorrect on the first attempt, or asked for help

**The following columns are useful but may not be used for DKT:**
- skill_id: the skill associated with the problem. 
- **orginal: 1 means main problem, 0 means scaffolding problem**
    - It is required to determine whether to include scaffolding
- ms_first_response: The time in milliseconds for the student's first response.
- hint_count: number of student attempts on this problem.
- attempt_count: number of student attmepts on this problem.

---
The following code will use numpy and pandas to process the **2009-2010 ASSISTment Data** so as to convert it into a tensorflow-friendly data file.

In [2]:
import numpy as np
import pandas as pd
import logging
import csv

LOGGER = logging.getLogger(__name__)
file_path = './data/skill_builder_data.csv'

# encoding are required as it is not utf8 encoded.
data = pd.DataFrame.from_csv(file_path, encoding='ISO-8859-1')

  if self.run_code(code, result):


In [3]:
num_users = len(data.user_id.unique())
num_problems = len(data.problem_id.unique())
num_records = data.shape[0]
msg = "In this dataset, there are {0} records, with {1} students and {2} \
different questions."
print(msg.format(num_records, num_users, num_problems))
print("With the following columns: \n", data.columns)

In this dataset, there are 525534 records, with 4217 students and 26688 different questions.
With the following columns: 
 Index(['assignment_id', 'user_id', 'assistment_id', 'problem_id', 'original',
       'correct', 'attempt_count', 'ms_first_response', 'tutor_mode',
       'answer_type', 'sequence_id', 'student_class_id', 'position', 'type',
       'base_sequence_id', 'skill_id', 'skill_name', 'teacher_id', 'school_id',
       'hint_count', 'hint_total', 'overlap_time', 'template_id', 'answer_id',
       'answer_text', 'first_action', 'bottom_hint', 'opportunity',
       'opportunity_original'],
      dtype='object')


### Processing the data
1. Filter out students with exactly one interaction.

In [4]:
def generate_id_to_idx_dict(df, column):
    ids = df[column].unique()
    num_unique_ids = len(ids)
    id_to_idx_dict = dict(zip(ids, range(num_unique_ids)))
    return id_to_idx_dict

In [9]:
REQUIRE_COLS = ['time_idx', 'user_id', 'problem_id', 'correct']

# get the time index
data['time_idx'] = data.index.values
data.head()

# remove nan in skill_id
nan_records = data.skill_id.apply(np.isnan)
data = data[~nan_records]
print("The data shape after remove nan:", data.shape)

# remove duplicated records
columns = set(data.columns.values)
columns.remove('opportunity')
columns.remove('opportunity_original')
columns = list(columns)
data = data[~data.duplicated(subset=columns)]
print("The data shape after remove duplicated records:", data.shape)

The data shape after remove nan: (459208, 30)
The data shape after remove duplicated records: (338001, 30)


In [12]:
user_ids = data.user_id.unique()
problem_to_idx_dict = generate_id_to_idx_dict(data, column='problem_id')

tuples = []
for id in user_ids:
    df = data[data.user_id == id]
    df = df[REQUIRE_COLS]
    problems = [problem_to_idx_dict[pid] for pid in df.problem_id]
    corrects = [corr for corr in df.correct]
    num_problems = len(problems)
#     print (num_problems)
#     print (problems)
#     print (corrects)
#     print ("============")
    result = (num_problems, problems, corrects)
    tuples.append(result)

In [7]:
with open('data/temp.csv', 'w') as f:
    writer = csv.writer(f, 
                        delimiter=',', 
                        quotechar="'", 
                        quoting=csv.QUOTE_MINIMAL,
                        lineterminator='\n')
    for tup in tuples:
        writer.writerow([tup[0]])
        writer.writerow(tup[1])
        writer.writerow(tup[2])

In [14]:
len(tuples)

4217