# M4 | Research Investigation Notebook

In this notebook, you will do a research investigation of your chosen dataset in teams. You will begin by formally selecting your research question (task 0), then processing your data (task 1), creating a predictive model (task 2), evaluating your model's results (task 3), and describing the contributions of each team member (task 4).

For grading, please make sure your notebook has all cells run and is stored in your team's [Github Classroom repository](https://classroom.github.com/a/CNxME27U). You will also need to write a short, 2 page report about your design decisions as a team, to be stored in your repository. The Milestone 4 submission will be the contents of your repository at the due date (April 28 at 23:59 CET).

## Brief overview of Calcularis
[Calcularis](https://school.alemira.com/de/calcularis/) by Alemira School is a mathematics learning program developed with neuroscientists and computer scientists from ETH Zurich. It promotes the development and interaction of the different areas of the brain that are responsible for processing numbers and quantities and solving mathematical tasks. Calcularis can be used from 1st grade to high school. Children with dyscalculia also benefit in the long term and overcome their arithmetic weakness.

The Calcularis dataset has three main tables:
* ***users***: meta information about users (i.e. total time spent learning with Calcularis, geographic location).
* ***events***: events done by the users in the platform (i.e. playing a game, selecting a new animal in the zoo simulation).
* ***subtasks***: sub-tasks with answer attempts solved by users, primarily in the context of game events.

These tables and useful metadata information are described in detail in the [Milestone 2 data exploration notebook](https://github.com/epfl-ml4ed/mlbd-2023/blob/main/project/milestone-02/m2_calcularis_sciper.ipynb).

We have provided access to the [full dataset](https://moodle.epfl.ch/mod/forum/discuss.php?d=88179) (~65k users) and a randomly selected subset (~1k users from M2). We have also provided access to a [test account to experiment with Calcularis](https://moodle.epfl.ch/mod/forum/discuss.php?d=88094). You should provide arguments and justifications for all of your design decisions throughout this investigation. You can use your M3 responses as the basis for this discussion.

In [1]:
# Imports
# data
import numpy as np
import pandas as pd
import re

# graph
import networkx as nx
from networkx.drawing.nx_agraph import read_dot

# plots
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('TkAgg')

from pyBKT.models import Model

In [2]:
# Import the tables of the data set as dataframes.

DATA_DIR = './data' # You many change the directory

# You can use the nrows=X argument in pd.read_csv to truncate your data
users = pd.read_csv('{}/calcularis_small_users.csv'.format(DATA_DIR), index_col=0)
events = pd.read_csv('{}/calcularis_small_events.csv'.format(DATA_DIR), index_col=0)
subtasks = pd.read_csv('{}/calcularis_small_subtasks.csv'.format(DATA_DIR), index_col=0)

## Task 0: Research Question

**Research question:**
*Your chosen research question goes here*

## Task 1: Data Preprocessing

In this section, you are asked to preprocess your data in a way that is relevant for the model. Please include 1-2 visualizations of features / data explorations that are related to your downstream prediction task.

In [3]:
# In subtasks dataset exist more event_id than in events dataset
print(f'How many events in dataset: {len(events)}')
print(f'How many subtasks in dataset: {len(subtasks)}')
subtasks = subtasks[subtasks.event_id < len(events)]

# Set the game names in subtasks dataset
subtasks = subtasks.copy()
subtasks['game_name'] = events.iloc[subtasks['event_id']]['game_name'].values

How many events in dataset: 34094
How many subtasks in dataset: 55047


In [5]:
# Read the DOT file and store it as a NetworkX graph
dot_file_path = 'data/04_calcularis_skill_map_dot_file.dot'
G = read_dot(dot_file_path)

In [6]:
# Draw graph of skills
def draw_graph(G):
    plt.figure(figsize=(40, 80))
    pos = nx.spring_layout(G, k=0.3, iterations=50)
    node_sizes = [len(G.adj[node]) * 100 for node in G.nodes]
    edge_widths = [1 + len(G.get_edge_data(u, v)) for u, v in G.edges()]
    nx.draw_networkx_nodes(G, pos, node_size=node_sizes, alpha=0.5)
    nx.draw_networkx_edges(G, pos, width=edge_widths, alpha=0.3, arrowsize=10, arrowstyle='->')
    labels = {node: node.replace('\n', ' ') for node in G.nodes}
    nx.draw_networkx_labels(G, pos, labels=labels, font_size=10)
    plt.axis('off')
    plt.savefig('graph')

draw_graph(G)

*Your discussion about your processing decisions goes here*

In [7]:
#Trying display more info
user_subtasks = subtasks.loc[subtasks['user_id'] == 1]
user_subtasks = user_subtasks.copy()

user_subtasks['subtask_finished_timestamp'] = pd.to_datetime(user_subtasks['subtask_finished_timestamp'])

# Fill missing values with a default value
#user_subtasks['subtask_finished_timestamp'] = user_subtasks['subtask_finished_timestamp'].fillna('2000-01-01')
user_subtasks = user_subtasks.dropna(subset=['subtask_finished_timestamp'])

# Group by week and assign week number
week_groups = user_subtasks.groupby(pd.Grouper(key='subtask_finished_timestamp', freq='W-MON', closed='left'))
user_subtasks['week_number'] = week_groups.ngroup() + 1

temp = user_subtasks[user_subtasks['user_id'] == 1]
print(user_subtasks[['user_id', 'week_number', 'game_name', 'correct']])

            user_id  week_number                  game_name  correct
subtask_id                                                          
0                 1            1                 Subitizing     True
1                 1            1                 Subitizing     True
2                 1            2                 Conversion     True
3                 1            3                 Conversion     True
4                 1            4                    Landing    False
5                 1            5                 Conversion     True
6                 1            6                 Conversion    False
7                 1            7                 Comparison     True
8                 1            7                 Comparison     True
9                 1           12                    Landing     True
10                1           13                    Landing     True
11                1           14  Estimation on Number Line     True
12                1           15  

In [8]:
# Get nodes that fits the pattern based on skill_id and game_name
def get_nodes(G, first_word, last_word):

    # Create the regular expression pattern
    pattern = r"\b" + first_word + r"\b(.*?)" + last_word.split()[0] + r"\b(.*)"

    nodes = list(G.nodes())

    matches = [re.search(pattern, node).group() for node in nodes if re.search(pattern, node)]
    return matches

# Get the skill_id ranks of the game
def get_skill_id_ranks(G, game):

    all_matches = get_nodes(G, '', game)

    skill_ranks = [s.split()[0] for s in all_matches]
    return set(skill_ranks)

# Find the ranking for skill_id
def choose_ranking(G, game, skill_id):

    rankings = list(get_skill_id_ranks(G, game))

    upper_limits = np.array([int(rank.split("-")[1]) for rank in rankings])
    if not upper_limits.any():
        return ''
    is_rank = skill_id < upper_limits

    if not is_rank.any():
        max_index = np.argmin(upper_limits)
    else:
        min_limit = np.min(upper_limits[is_rank])
        max_index = np.where(upper_limits == min_limit)[0][0]

    return rankings[max_index]

# Find the name of game in names of games from graph
def find_word_in_list(word, string_list):
    for string in string_list:
        if str(word) in string:
            return True
    return False

In [9]:
# Calculate mastery level
def calculate_mastery_level(G, user_subtasks, week, game, skill_id):

    # Find ranking of skill
    skill_rank = choose_ranking(G, game, skill_id)

    # Get nodes that matches the game and skill id 
    skill_nodes = get_nodes(G, skill_rank, game)

    # If the game is not in skills graph
    if not skill_nodes:
        games = [game]
    # Find the ancestors of the node
    else:
        skill_node = skill_nodes[0]
        ancestors = nx.ancestors(G, skill_node)
        games = [ancestors, skill_nodes[0]]

    # Check if the event was previous than the subtask
    user_subtasks = user_subtasks[user_subtasks['week_number'] <= week]
    answers = []

    for idx, subtask in user_subtasks.iterrows():
        answers.append(find_word_in_list(subtask['game_name'], games) and subtask['correct'])

    if not answers:
        return 0.0
    return sum(answers) / len(answers)

In [20]:
# Find the records corresponding to the chosen user_id
def find_user_subtasks(user_id):
    user_subtasks = subtasks.loc[subtasks['user_id'] == user_id]
    user_subtasks = user_subtasks.copy()

    user_subtasks['subtask_finished_timestamp'] = pd.to_datetime(user_subtasks['subtask_finished_timestamp'])
    user_subtasks = user_subtasks.dropna(subset=['subtask_finished_timestamp'])

    # Group by week and assign week number
    week_groups = user_subtasks.groupby(pd.Grouper(key='subtask_finished_timestamp', freq='W-MON', closed='left'))
    user_subtasks['week_number'] = week_groups.ngroup() + 1
    return user_subtasks

In [21]:
# Create dataframe with multindex [user_id, week, game]
def create_dataframe_multi_index(G, how_many=100, verbose=False):

    # Create empty dataframe
    multi_index = [[], [], []]
    df = pd.DataFrame(columns = ['mastery_level'], index = multi_index)
    df.index = df.index.set_names(['user_id', 'game_name', 'week'])

    for user_id, user in users.iterrows(): 
        # Find user_subtasks 
        user_subtasks = find_user_subtasks(user_id)

        # Create index for the user:
        # Get unique games names
        game_names = events.loc[subtasks['event_id']]['game_name'].unique()

        # Get unique weeks
        unique_weeks = user_subtasks['week_number'].unique()

        len_weeks = len(unique_weeks)
        len_game_names = len(game_names)

        user_ids = [user_id for i in range(len_weeks * len_game_names)]

        user_unique_weeks = user_subtasks['week_number'].unique()
        user_unique_weeks = np.concatenate([user_unique_weeks]* (len_game_names))

        user_unique_games = [game for game in game_names for week in range(len_weeks)]

        tuples = list(zip(user_ids, user_unique_games, user_unique_weeks))

        # Assign index values
        index = pd.MultiIndex.from_tuples(tuples, names=['user_id', 'game_name', 'week'])
        user_df = pd.DataFrame(columns = ['mastery_level'], index = index)

        # Calculate mastery level
        mastery_level = []
        for game in game_names:
            for week in unique_weeks:
                # Find info about statistics of the user for the game for the week
                associated_events = user_subtasks.merge(events, on='event_id')[['week_number', 'game_name_x', 'skill_id']]
                associated_events = associated_events[(associated_events['game_name_x'] == game) & (associated_events['week_number'] == week)]

                # If the user played the game during the week
                if not associated_events.empty:
                    mean_skill = associated_events['skill_id'].mean()
                    lv = calculate_mastery_level(G, user_subtasks, week, game, mean_skill)
                    mastery_level.append(lv)
                # If the user did not play the game during the week, but we can get previous statictics
                elif week > unique_weeks[0]:
                    mastery_level.append(mastery_level[-1])
                # If the week == 1 and player did not play the game
                else: 
                    mastery_level.append(0.0)
                

        # Assign to the column
        mastery_level = pd.DataFrame(mastery_level, columns = ['mastery_level'])
        mastery_level.index = index
        user_df['mastery_level'] = mastery_level

        # Add stats of the user to the dataframe
        df = pd.concat([df, user_df], axis=0)

        # Display info
        if user_id % 10 == 0 and verbose:
            print(f'**** processing data for user id == {user_id} ****')
        # Calculate for fraction of all users
        if user_id > how_many:
            break
    return df

df = create_dataframe_multi_index(G, 100, True)
df.to_csv('my_dataframe.csv')

**** processing data for user id == 10 ****
**** processing data for user id == 20 ****
**** processing data for user id == 30 ****
**** processing data for user id == 40 ****
**** processing data for user id == 50 ****
**** processing data for user id == 60 ****
**** processing data for user id == 70 ****
**** processing data for user id == 80 ****


In [33]:
#TODO: remove before pushing to github
df = pd.read_csv('my_dataframe.csv')
df.head()

Unnamed: 0,user_id,game_name,week,mastery_level
0,1,Subitizing,1,1.0
1,1,Subitizing,2,1.0
2,1,Subitizing,3,1.0
3,1,Subitizing,4,1.0
4,1,Subitizing,5,1.0


In [17]:
# Choose rows with not only 0 values (users that did not play the game)
def get_random_ids(df, how_many, game):
    user_ids = df.index.levels[0]
    random_user_ids = []
    for user in user_ids:
        if float(df.loc[(user, game), :].sum()):
            random_user_ids.append(user)

    if len(random_user_ids) < how_many:
        return random_user_ids
    return np.array(random_user_ids)[: how_many]

In [19]:
# Display mastery levels
def show_mastery_lvls(df, users, game):
    fig, ax = plt.subplots(figsize=(12, 6))

    for i, user in enumerate(users):
        temp_df = df.sort_index().loc[(user, game), :]
        ax.plot(temp_df.index, temp_df['mastery_level'], label=f'{user}')

    ax.set_title(f'Mastery of {game} Over Time')
    ax.set_xlabel('Weeks')
    ax.set_ylabel('Mastery Level')
    ax.legend(title='User ID', loc='center left', bbox_to_anchor=(1, 0.5))

    plt.show()

random_user_ids = get_random_ids(df, 10, 'Subitizing')
show_mastery_lvls(df, random_user_ids, 'Subitizing')

  if float(df.loc[(user, game), :].sum()):


## Task 2: Model Building

Train a model for your research question. 

### Training baseline model: BKT

In [88]:
# Loading preprocessed dataframe generated in experiment.ipynb
# Unamed:0 is the index column of the dataframe before preprocessing (many rows where removed)
df_task_events = pd.read_csv('data/calcularis_small_task_events.csv')
df_task_events.rename(columns={'game_name': 'skill_name'}, inplace=True)
df_task_events.head()

Unnamed: 0.1,Unnamed: 0,event_id,user_id,mode_event,skill_name,number_range,skill_id,type_subtask,date,Year,Week,Day,week_sequential,correct,level_2,cumulative_percent_correct
0,0,118,7,NORMAL,Subitizing,R10,1.0,ConciseTimeoutDescription,2015-03-19 18:48:57.303000+00:00,2015,12,4,0,True,135,1.0
1,1,118,7,NORMAL,Subitizing,R10,1.0,ConciseSubitizingTaskDescription,2015-03-19 18:48:57.303000+00:00,2015,12,4,0,True,134,1.0
2,2,119,7,NORMAL,Conversion,R10,3.0,ConciseConversionTaskDescription,2015-03-20 18:07:17.288000+00:00,2015,12,5,0,True,136,1.0
3,3,120,7,NORMAL,Landing,R10,19.0,ConciseLandingTaskDescription,2015-03-23 15:18:26.515000+00:00,2015,13,1,1,True,137,1.0
4,4,121,7,END_OF_NR,Conversion,R10,7.0,ConciseConversionTaskDescription,2015-04-02 14:03:06.836000+00:00,2015,14,4,2,True,138,1.0


In [94]:
# Creating the first model with default parameters and fitting it for all games
model = Model(seed=0)
model.fit(data=df_task_events, forgets=False)
model.evaluate(data=df_task_events, metric='auc') 

0.7765331719667002

In [100]:
# Vizualizing parameters
model.params().head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
skill,param,class,Unnamed: 3_level_1
Subitizing,prior,default,0.66811
Subitizing,learns,default,0.05285
Subitizing,guesses,default,0.4548
Subitizing,slips,default,0.20789
Subitizing,forgets,default,0.0
Conversion,prior,default,0.89709
Conversion,learns,default,0.03963
Conversion,guesses,default,0.6221
Conversion,slips,default,0.1036
Conversion,forgets,default,0.0


In [131]:
# Predicting mastery level for all users
# state_predictions: score between 0 and 1 that measures the extent to which the student has mastered that skill, after that question
# correct_predictions: score between 0 and 1 that measures the extent to which the model thinks that the student will answer correctly to that question
df_preds = model.predict(data=df_task_events)
df_preds[df_preds['skill_name']=='Conversion'][['user_id', 'correct', 'correct_predictions', 'state_predictions']].head()

Unnamed: 0,user_id,correct,correct_predictions,state_predictions
50952,1,1,0.86817,0.89709
51366,1,1,0.87698,0.92918
52090,1,1,0.88317,0.95175
52365,1,0,0.88745,0.96736
48376,2,0,0.86817,0.89709


In [97]:
# Creating the second model with forgets=True and fitting it for all games
model_forgets = Model(seed=0)
model_forgets.fit(data=df_task_events, forgets=True)
model_forgets.evaluate(data=df_task_events, metric='auc')

0.7830957178701483

In [99]:
# Vizualizing parameters
model_forgets.params().head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,value
skill,param,class,Unnamed: 3_level_1
Subitizing,prior,default,0.8048
Subitizing,learns,default,0.35636
Subitizing,guesses,default,0.25194
Subitizing,slips,default,0.21333
Subitizing,forgets,default,0.06911
Conversion,prior,default,0.84631
Conversion,learns,default,0.29306
Conversion,guesses,default,0.54626
Conversion,slips,default,0.08488
Conversion,forgets,default,0.02918


In [132]:
# Predicting mastery level for all users
df_preds_forgets = model_forgets.predict(data=df_task_events)
df_preds_forgets[df_preds_forgets['skill_name']=='Landing'][['user_id', 'correct', 'correct_predictions', 'state_predictions']].head()

Unnamed: 0,user_id,correct,correct_predictions,state_predictions
51742,1,0,0.37399,0.36995
53936,1,1,0.47541,0.55869
54302,1,1,0.59904,0.78873
54907,1,1,0.61771,0.82348
49587,2,0,0.37399,0.36995


In [106]:
df_preds.head()

Unnamed: 0.1,Unnamed: 0,event_id,user_id,mode_event,skill_name,number_range,skill_id,type_subtask,date,Year,Week,Day,week_sequential,correct,level_2,cumulative_percent_correct,correct_predictions,state_predictions
50463,54328,0,1,NORMAL,Subitizing,R10,1.0,ConciseSubitizingTaskDescription,2022-11-02 08:39:12.355000+00:00,2022,44,3,0,1,0,1.0,0.68229,0.8048
50464,54329,0,1,NORMAL,Subitizing,R10,1.0,ConciseTimeoutDescription,2022-11-02 08:39:12.355000+00:00,2022,44,3,0,1,1,1.0,0.72757,0.88948
50952,54340,1,1,NORMAL,Conversion,R10,4.0,ConciseConversionTaskDescription,2022-11-11 10:26:27.893000+00:00,2022,45,5,1,1,2,1.0,0.85843,0.84631
51366,54331,2,1,NORMAL,Conversion,R10,7.0,ConciseConversionTaskDescription,2022-11-18 10:34:01.044000+00:00,2022,46,5,2,1,3,1.0,0.87991,0.90453
51742,54330,3,1,NORMAL,Landing,R10,19.0,ConciseLandingTaskDescription,2022-11-25 10:32:43.428000+00:00,2022,47,5,3,0,4,0.0,0.37399,0.36995


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
# Plotting the learning rate per skill
# Plotting the mastery level for a random user



## Task 3: Model Evaluation
In this task, you will use metrics to evaluate your model.

In [None]:
# Your code for model evaluation goes here

*Your discussion/interpretation about your model's behavior goes here*

## Task 4: Team Reflection
Please describe the contributions of each team member to Milestone 4. Reflect on how you worked as team: what went well, what can be improved for the next milestone?

*Your discussion about team responsibilities goes here*