# M4 | Research Investigation Notebook

In this notebook, you will do a research investigation of your chosen dataset in teams. You will begin by formally selecting your research question (task 0), then processing your data (task 1), creating a predictive model (task 2), evaluating your model's results (task 3), and describing the contributions of each team member (task 4).

For grading, please make sure your notebook has all cells run and is stored in your team's [Github Classroom repository](https://classroom.github.com/a/CNxME27U). You will also need to write a short, 2 page report about your design decisions as a team, to be stored in your repository. The Milestone 4 submission will be the contents of your repository at the due date (April 28 at 23:59 CET).

## Brief overview of Calcularis
[Calcularis](https://school.alemira.com/de/calcularis/) by Alemira School is a mathematics learning program developed with neuroscientists and computer scientists from ETH Zurich. It promotes the development and interaction of the different areas of the brain that are responsible for processing numbers and quantities and solving mathematical tasks. Calcularis can be used from 1st grade to high school. Children with dyscalculia also benefit in the long term and overcome their arithmetic weakness.

The Calcularis dataset has three main tables:
* ***users***: meta information about users (i.e. total time spent learning with Calcularis, geographic location).
* ***events***: events done by the users in the platform (i.e. playing a game, selecting a new animal in the zoo simulation).
* ***subtasks***: sub-tasks with answer attempts solved by users, primarily in the context of game events.

These tables and useful metadata information are described in detail in the [Milestone 2 data exploration notebook](https://github.com/epfl-ml4ed/mlbd-2023/blob/main/project/milestone-02/m2_calcularis_sciper.ipynb).

We have provided access to the [full dataset](https://moodle.epfl.ch/mod/forum/discuss.php?d=88179) (~65k users) and a randomly selected subset (~1k users from M2). We have also provided access to a [test account to experiment with Calcularis](https://moodle.epfl.ch/mod/forum/discuss.php?d=88094). You should provide arguments and justifications for all of your design decisions throughout this investigation. You can use your M3 responses as the basis for this discussion.

In [1]:
# Imports
# plots
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('TkAgg')

# helper methods
from utils import *

In [2]:
# Import the tables of the data set as dataframes.

DATA_DIR = './data' # You many change the directory

# You can use the nrows=X argument in pd.read_csv to truncate your data
users = pd.read_csv('{}/calcularis_small_users.csv'.format(DATA_DIR), index_col=0)
events = pd.read_csv('{}/calcularis_small_events.csv'.format(DATA_DIR), index_col=0)
subtasks = pd.read_csv('{}/calcularis_small_subtasks.csv'.format(DATA_DIR), index_col=0)

## Task 0: Research Question

**Research question:**
* What factors influence the process of learning?  

* Which tasks should be solved to obtain the fastest progress? Is there any tasks of this kind? 

* How in task-based learning we can assume that the student is learnt. Is the time most important? How many correct answers? 

## Task 1: Data Preprocessing

In this section, you are asked to preprocess your data in a way that is relevant for the model. Please include 1-2 visualizations of features / data explorations that are related to your downstream prediction task.

In [3]:
# In subtasks dataset exist more event_id than in events dataset
print(f'How many events in dataset: {len(events)}')
print(f'How many subtasks in dataset: {len(subtasks)}')
subtasks = subtasks[subtasks.event_id < len(events)]

# Set the game names in subtasks dataset
#subtasks = subtasks.copy()
#subtasks['game_name'] = events.iloc[subtasks['event_id']]['game_name'].values

How many events in dataset: 34094
How many subtasks in dataset: 55047


### Skill graph
We decide to base calculation of the lmastery in games on skill graph, which we are downloading below.

In [4]:
# Read the DOT file and store it as a NetworkX graph
dot_file_path = 'data/04_calcularis_skill_map_dot_file.dot'
G = read_dot(dot_file_path)

In [5]:
# Draw graph of skills
def draw_graph(G):
    plt.figure(figsize=(40, 80))
    pos = nx.spring_layout(G, k=0.3, iterations=50)
    node_sizes = [len(G.adj[node]) * 100 for node in G.nodes]
    edge_widths = [1 + len(G.get_edge_data(u, v)) for u, v in G.edges()]
    nx.draw_networkx_nodes(G, pos, node_size=node_sizes, alpha=0.5)
    nx.draw_networkx_edges(G, pos, width=edge_widths, alpha=0.3, arrowsize=10, arrowstyle='->')
    labels = {node: node.replace('\n', ' ') for node in G.nodes}
    nx.draw_networkx_labels(G, pos, labels=labels, font_size=10)
    plt.axis('off')
    plt.savefig('skill_graph')
    plt.close()

draw_graph(G)

![myfig](skill_map.png)

### Dataframe with mastery levels
At the beginning, we are looking for the games titles in skill map, which fits with game name and skill id found in the dataframe. Then we have to find ancestors of the each game, based on position in the skill map. Thanks to this we can calculate the mastery skill over all games that contributes to development of each skill for all the users.

To calculate mastery level we are using methods from utils.py. They are described in comments.

Finally we stored created dataframe in dataframe.csv file

In [7]:
# Create dataframe with multindex [user_id, week, game]
def create_dataframe_multi_index(G, how_many=100, verbose=False):

    # Create empty dataframe
    multi_index = [[], [], []]
    df = pd.DataFrame(columns = ['mastery_level', 'mastery_level_diff'], index = multi_index)
    df.index = df.index.set_names(['user_id', 'game_name', 'week'])

    subtasks_events = subtasks.merge(events, on='event_id')

    for user_id, user in users.iterrows(): 
        # Find user_subtasks 
        user_subtasks = find_user_subtasks(subtasks_events, user_id)[['week_number', 'game_name', 'skill_id', 'correct']]
        
        # Create index for the user:
        # Get unique games names
        game_names = events.loc[subtasks['event_id']]['game_name'].unique()

        # Get unique weeks
        unique_weeks = user_subtasks['week_number'].unique()

        len_weeks = len(unique_weeks)
        len_game_names = len(game_names)

        user_ids = [user_id for i in range(len_weeks * len_game_names)]

        user_unique_weeks = user_subtasks['week_number'].unique()
        user_unique_weeks = np.concatenate([user_unique_weeks]* (len_game_names))

        user_unique_games = [game for game in game_names for week in range(len_weeks)]

        tuples = list(zip(user_ids, user_unique_games, user_unique_weeks))

        # Assign index values
        index = pd.MultiIndex.from_tuples(tuples, names=['user_id', 'game_name', 'week'])
        user_df = pd.DataFrame(columns = ['mastery_level', 'mastery_level_diff'], index = index)

        # Calculate mastery level
        mastery_level = []
        for game in game_names:
            for week in unique_weeks:
                # Find info about statistics of the user for the game for the week
                associated_events = user_subtasks[(user_subtasks['game_name'] == game) & (user_subtasks['week_number'] == week)]
                
                # If the user played the game during the week
                if not associated_events.empty:
                    mean_skill = associated_events['skill_id'].mean()
                    lv = calculate_mastery_level(G, user_subtasks, week, game, mean_skill)
                    mastery_level.append(lv)
                # If the user did not play the game during the week, but we can get previous statictics
                elif week > unique_weeks[0]:
                    mastery_level.append(mastery_level[-1])
                # If the week == 1 and player did not play the game
                else: 
                    mastery_level.append(0.0)
                

        # Assign mastery levls
        mastery_level = pd.DataFrame(mastery_level, columns = ['mastery_level'])
        mastery_level.index = index
        user_df['mastery_level'] = mastery_level

        # Assign the difference of mastery lvls
        user_df['mastery_level_diff'] = user_df['mastery_level'].diff()
        user_df.loc[user_df.index.get_level_values('week') == 1, 'mastery_level_diff'] = 0.0

        # Add stats of the user to the dataframe
        df = pd.concat([df, user_df], axis=0)

        # Display info
        if user_id % 10 == 0 and verbose:
            print(f'**** processing data for user id == {user_id} ****')
        # Calculate for fraction of all users
        if user_id > how_many:
            break
    return df

df = create_dataframe_multi_index(G, len(users), True)
convert_to_csv(df, 'dataframe.csv')

**** processing data for user id == 10 ****
**** processing data for user id == 20 ****
**** processing data for user id == 30 ****
**** processing data for user id == 40 ****
**** processing data for user id == 50 ****
**** processing data for user id == 60 ****
**** processing data for user id == 70 ****
**** processing data for user id == 80 ****
**** processing data for user id == 90 ****
**** processing data for user id == 100 ****
**** processing data for user id == 110 ****
**** processing data for user id == 120 ****
**** processing data for user id == 130 ****
**** processing data for user id == 140 ****
**** processing data for user id == 150 ****
**** processing data for user id == 160 ****
**** processing data for user id == 170 ****
**** processing data for user id == 180 ****
**** processing data for user id == 190 ****
**** processing data for user id == 200 ****
**** processing data for user id == 210 ****
**** processing data for user id == 220 ****
**** processing dat

In [8]:
df = load_from_csv('dataframe.csv')

### Display results
We are creating two graphs, each of them showing one column in our dataframe for 10 users for chosen game. 
- mastery_level: it is representation of development of skill over time
- mastery_level: it represents the progress between each week ver time

In [9]:
# Display mastery levels
def show_mastery_details(df, users, game, col):
    fig, ax = plt.subplots(figsize=(12, 6))

    for i, user in enumerate(users):
        temp_df = df.sort_index().loc[(user, game), :]
        ax.plot(temp_df.index, temp_df[col], label=f'{user}')

    ax.set_title(f'{col} of {game} Over Time')
    ax.set_xlabel('Weeks')
    ax.set_ylabel(col)
    ax.legend(title='User ID', loc='center left', bbox_to_anchor=(1, 0.5))

    plt.savefig(f'{col}.png')
    plt.close()

random_user_ids = get_random_ids(df, 10, 'Subitizing', 'mastery_level')
show_mastery_details(df, random_user_ids, 'Subitizing', 'mastery_level')

random_user_ids = get_random_ids(df, 10, 'Subitizing', 'mastery_level_diff')
show_mastery_details(df, random_user_ids, 'Subitizing', 'mastery_level_diff')

  if df.loc[(user, game), col].sum() != 0.0:
  if df.loc[(user, game), col].sum() != 0.0:


*Your discussion about your processing decisions goes here*

#### mastery level
We can track the progress or regress of developing of skill, what is important not many users achive the mastery level above 0.5 at the end of their learning, so it may mean that they drop the learning process because of some reasons, which could be boredom or lack of motivation to continue.

![mastery_level](mastery_level.png)

#### mastery level differences
What is of much significance, most of users are not making progress during the period of research. They only play the game in some of the weeks, where in others they are not progressing at all. This could provide us some insights why they are not making great progess which can be observed in first plot. Also the differences between mastery levels over weeks are not bigger than 0.4. The ratio between how much of them making progress and regress is close to 0.5

![mastery_level_diff](mastery_level_diff.png)

We can conclude that the values of this column and patterns that can be found in this graphs are completely different. These will be used to check which allows us to produce better predictions, as the first one might be misinterpreted by our model, because eventhough player is not playing the game his mastery skill is maintained.


## Task 2: Model Building

Train a model for your research question. 

In [None]:
# Your code for training a model goes here

*Your discussion about your model training goes here*

## Task 3: Model Evaluation
In this task, you will use metrics to evaluate your model.

In [None]:
# Your code for model evaluation goes here

*Your discussion/interpretation about your model's behavior goes here*

## Task 4: Team Reflection
Please describe the contributions of each team member to Milestone 4. Reflect on how you worked as team: what went well, what can be improved for the next milestone?

*Your discussion about team responsibilities goes here*