Goal:
classify a task for a given student as (1) too easy, (2) just right, or (3) too difficult.

Data:
- Robotanist, order and times of solved tasks
- proxy for diffulty levels:
- too easy = less than 1 minute
- too difficult = more than 20 minutes

Ideally, the same methods should be usable in scenario where we have explicit user information (qualitative data) about perceived difficulty (obtained via the flow-level question).

Usage:
- soft recommendation via color each task (too easy, just right, too difficult)
- soft recommendation via coarse time predictions (1m, 10m, >20m)
- set challenge level (1m or 10m)
- hard recommendation: hide/lock too easy/difficult tasks

In [60]:
%matplotlib inline
from collections import OrderedDict
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Data

In [114]:
ordering = pd.read_csv('../data/robotanik/user_time_ordering.csv').set_index('Login')
# Make task IDs integers (from string labels).
ordering.columns = map(int, ordering.columns)
ordering.head()

Unnamed: 0_level_0,635,636,637,638,639,640,641,642,643,644,...,1402,1403,1404,1405,1406,1407,1704,1705,1706,1707
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,1.0,2.0,13.0,5.0,,9.0,,,,10.0,...,,,,,,,,,,
U2,1.0,2.0,4.0,3.0,5.0,6.0,9.0,10.0,8.0,11.0,...,,,,,,,58.0,59.0,60.0,61.0
U4,1.0,2.0,,,,,,,,,...,,,,,,,,,,
U5,1.0,2.0,3.0,4.0,16.0,5.0,15.0,,11.0,13.0,...,,,,,,,,,,
U6,1.0,2.0,3.0,7.0,,8.0,,,,,...,,,,,,,,,,


In [115]:
times = pd.read_csv('../data/robotanik/user_time.csv').set_index('Login')
# Make task IDs integers (from string labels).
times.columns = map(int, times.columns)
# We will work with log times; TODO: explain why
times = times.applymap(np.log)
times.head()

Unnamed: 0_level_0,635,636,637,638,639,640,641,642,643,644,...,1402,1403,1404,1405,1406,1407,1704,1705,1706,1707
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,3.218876,2.564949,2.639057,3.637586,,6.12905,,,,4.882802,...,,,,,,,,,,
U2,2.302585,1.609438,3.89182,2.944439,4.521789,3.135494,4.043051,3.688879,3.73767,5.351858,...,,,,,,,5.220356,4.727388,3.583519,4.290459
U4,3.258097,1.791759,,,,,,,,,...,,,,,,,,,,
U5,3.951244,3.663562,3.218876,2.944439,5.897154,3.218876,4.59512,,4.025352,3.828641,...,,,,,,,,,,
U6,4.219508,2.995732,3.465736,2.890372,,3.73767,,,,,...,,,,,,,,,,


The dataset contains about 10K of users and 80 tasks.

In [116]:
times.shape

(10605, 78)

TODO: split data randomly into train/validation/test sets

In [117]:
train_times = times.head(10)
train_order = ordering.head(10)

# Explorative Analysis

# Features

Start with 2 features:
1. how many tasks has the student already solved (student_solved_percentage)
2. how many students have solved the task (task_solved_percentage)

Feature ideas:
- student-related (skill):
    - number of solved tasks
    - number of solved tasks in easy/flow/difficult mode
    - the most difficult solved task [in easy/flow/difficult mode]
    - which tasks solved [in which mode] / log-times / quartiles
- task-related (difficulty)
    - how many students have solved the task (percentage)
    - mean/median order
    - median log-time
    - percentage of easy/flow/difficult attempts
- content based

In [118]:
train_order

Unnamed: 0_level_0,635,636,637,638,639,640,641,642,643,644,...,1402,1403,1404,1405,1406,1407,1704,1705,1706,1707
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,1.0,2.0,13.0,5.0,,9.0,,,,10.0,...,,,,,,,,,,
U2,1.0,2.0,4.0,3.0,5.0,6.0,9.0,10.0,8.0,11.0,...,,,,,,,58.0,59.0,60.0,61.0
U4,1.0,2.0,,,,,,,,,...,,,,,,,,,,
U5,1.0,2.0,3.0,4.0,16.0,5.0,15.0,,11.0,13.0,...,,,,,,,,,,
U6,1.0,2.0,3.0,7.0,,8.0,,,,,...,,,,,,,,,,
U7,1.0,2.0,,,4.0,,3.0,,,,...,,,,,,,,,,
U8,1.0,2.0,5.0,6.0,16.0,4.0,23.0,,11.0,12.0,...,,35.0,,,,,31.0,33.0,34.0,32.0
U9,1.0,2.0,5.0,3.0,13.0,4.0,7.0,9.0,8.0,10.0,...,,,,,,,,,,
U10,1.0,2.0,4.0,5.0,,3.0,,,,,...,,,,,,,,,,
U11,1.0,2.0,4.0,3.0,9.0,5.0,8.0,19.0,7.0,10.0,...,72.0,70.0,69.0,73.0,71.0,,74.0,75.0,76.0,77.0


In [119]:
task_features = pd.DataFrame({
  'time_median': train_times.median(),
  'solved_percentage': train_times.count() / len(train_times)
})
task_features.head()

Unnamed: 0,solved_percentage,time_median
635,1.0,3.329647
636,1.0,2.381087
637,0.8,3.310037
638,0.8,3.138322
639,0.6,4.516324


In [120]:
# create a dataframe of events (student-task interactions)
task_count = train_order.shape[1]
data = []
for user_id in train_order.index:
    order = train_order.loc[user_id].dropna().sort_values()
    for task_id, student_order in order.items():
        event = OrderedDict(
            student_id=int(user_id[2:]),
            student_order=int(student_order),
            student_solved_percentage=(student_order-1)/task_count,
            task_id=int(task_id),
            time=train_times.loc[user_id, task_id])
        data.append(event)


events = pd.DataFrame(data)
events.head()

Unnamed: 0,student_id,student_order,student_solved_percentage,task_id,time
0,1,1,0.0,635,3.218876
1,1,2,0.012821,636,2.564949
2,1,3,0.025641,656,4.59512
3,1,4,0.038462,698,3.850148
4,1,5,0.051282,638,3.637586


In [122]:
# join events with task features
# TODO: reset index
# TODO: sort by (student_id, student_order)
events = pd.merge(events, task_features, left_on='task_id', right_index=True)
events.head()

Unnamed: 0,student_id,student_order,student_solved_percentage,task_id,time,solved_percentage,time_median
0,1,1,0.0,635,3.218876,1.0,3.329647
14,2,1,0.0,635,2.302585,1.0,3.329647
75,4,1,0.0,635,3.258097,1.0,3.329647
78,5,1,0.0,635,3.951244,1.0,3.329647
101,6,1,0.0,635,4.219508,1.0,3.329647
