Goal:
classify a task for a given student as (1) too easy, (2) just right, or (3) too difficult.

Data:
- Robotanist, order and times of solved tasks
- proxy for diffulty levels:
- too easy = less than 1 minute
- too difficult = more than 20 minutes

Ideally, the same methods should be usable in scenario where we have explicit user information (qualitative data) about perceived difficulty (obtained via the flow-level question).

Usage:
- soft recommendation via color each task (too easy, just right, too difficult)
- soft recommendation via coarse time predictions (1m, 10m, >20m)
- set challenge level (1m or 10m)
- hard recommendation: hide/lock too easy/difficult tasks

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Data

In [33]:
ordering = pd.read_csv('../data/robotanik/user_time_ordering.csv').set_index('Login')
ordering.head()

Unnamed: 0_level_0,635,636,637,638,639,640,641,642,643,644,...,1402,1403,1404,1405,1406,1407,1704,1705,1706,1707
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,1.0,2.0,13.0,5.0,,9.0,,,,10.0,...,,,,,,,,,,
U2,1.0,2.0,4.0,3.0,5.0,6.0,9.0,10.0,8.0,11.0,...,,,,,,,58.0,59.0,60.0,61.0
U4,1.0,2.0,,,,,,,,,...,,,,,,,,,,
U5,1.0,2.0,3.0,4.0,16.0,5.0,15.0,,11.0,13.0,...,,,,,,,,,,
U6,1.0,2.0,3.0,7.0,,8.0,,,,,...,,,,,,,,,,


In [29]:
times = pd.read_csv('../data/robotanik/user_time.csv').set_index('Login')
times.head()

Unnamed: 0_level_0,635,636,637,638,639,640,641,642,643,644,...,1402,1403,1404,1405,1406,1407,1704,1705,1706,1707
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,25.0,13.0,14.0,38.0,,459.0,,,,132.0,...,,,,,,,,,,
U2,10.0,5.0,49.0,19.0,92.0,23.0,57.0,40.0,42.0,211.0,...,,,,,,,185.0,113.0,36.0,73.0
U4,26.0,6.0,,,,,,,,,...,,,,,,,,,,
U5,52.0,39.0,25.0,19.0,364.0,25.0,99.0,,56.0,46.0,...,,,,,,,,,,
U6,68.0,20.0,32.0,18.0,,42.0,,,,,...,,,,,,,,,,


The dataset contains about 10K of users and 80 tasks.

In [37]:
times.shape

(10605, 78)

TODO: split data randomly into train/validation/test sets

In [46]:
train_times = times.head(10)
train_order = ordering.head(10)

# Explorative Analysis

# Features

Start with 2 features:
1. how many tasks has the student already solved (student_solved_percentage)
2. how many students have solved the task (task_solved_percentage)

Feature ideas:
- student-related (skill):
    - number of solved tasks
    - number of solved tasks in easy/flow/difficult mode
    - the most difficult solved task [in easy/flow/difficult mode]
    - which tasks solved [in which mode] / log-times / quartiles
- task-related (difficulty)
    - how many students have solved the task (percentage)
    - mean/median order
    - median log-time
    - percentage of easy/flow/difficult attempts
- content based

In [47]:
train_order

Unnamed: 0_level_0,635,636,637,638,639,640,641,642,643,644,...,1402,1403,1404,1405,1406,1407,1704,1705,1706,1707
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
U1,1.0,2.0,13.0,5.0,,9.0,,,,10.0,...,,,,,,,,,,
U2,1.0,2.0,4.0,3.0,5.0,6.0,9.0,10.0,8.0,11.0,...,,,,,,,58.0,59.0,60.0,61.0
U4,1.0,2.0,,,,,,,,,...,,,,,,,,,,
U5,1.0,2.0,3.0,4.0,16.0,5.0,15.0,,11.0,13.0,...,,,,,,,,,,
U6,1.0,2.0,3.0,7.0,,8.0,,,,,...,,,,,,,,,,
U7,1.0,2.0,,,4.0,,3.0,,,,...,,,,,,,,,,
U8,1.0,2.0,5.0,6.0,16.0,4.0,23.0,,11.0,12.0,...,,35.0,,,,,31.0,33.0,34.0,32.0
U9,1.0,2.0,5.0,3.0,13.0,4.0,7.0,9.0,8.0,10.0,...,,,,,,,,,,
U10,1.0,2.0,4.0,5.0,,3.0,,,,,...,,,,,,,,,,
U11,1.0,2.0,4.0,3.0,9.0,5.0,8.0,19.0,7.0,10.0,...,72.0,70.0,69.0,73.0,71.0,,74.0,75.0,76.0,77.0


In [48]:
tasks_count = train_order.shape[1]

In [63]:
def map_student_to_events(group):
    return pd.DataFrame([{
        'student_solved_percentage': 0.1,
        'task_solved_percentage': 0.2,
    }])


train_order.groupby(level=0).apply(map_student_to_events)

Unnamed: 0_level_0,Unnamed: 1_level_0,student_solved_percentage,task_solved_percentage
Login,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
U1,0,0.1,0.2
U10,0,0.1,0.2
U11,0,0.1,0.2
U2,0,0.1,0.2
U4,0,0.1,0.2
U5,0,0.1,0.2
U6,0,0.1,0.2
U7,0,0.1,0.2
U8,0,0.1,0.2
U9,0,0.1,0.2


In [56]:
def map_student_to_events(row):
    print(row.shape)
    print('next')
    return pd.DataFrame({'x': [0], 'y': [1]})

train_order.apply(map_student_to_events, axis=1)    

(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next
(78,)
next


ValueError: cannot copy sequence with size 2 to array axis with dimension 1

In [None]:
train_features
train_targets