# Welcome to the MOCK  MLBD final exam (Spring 2022)

The exam questions are contained in this Jupyter Notebook. The `data` folder contains the data. 

The logistical details, rules, and guidelines pertaining to the exam are stated below. 

### Timeline and Submission
**Exam date:** Wednesday, June 1, 2022   
**Exam start:** 08:20 am (CEST)   
**Exam end:** 9:20 am (CEST) 

### Instructions
This exam consists of two parts, a Moodle quiz with conceptual questions and programming exercises in this notebook. Be sure to answer the conceptual questions and upload this notebook to Moodle. 

In case of issues with Moodle, send your file named as "SCIPER_Firstname_Lastname.ipynb" via email to paola.mejia@epfl.ch, subject "[MLBD] Exam notebook".

### Rules

1. You are allowed to use any environment. We recommend using EPFL's Noto environment, accessible through the link: [https://noto.epfl.ch/](https://noto.epfl.ch/). We prepared a Python environment with all the Python packages that you might need for the exam, in the default EPFL's Noto installation. If you want to use some additional packages, feel free to install and use them in a virtual environment. In this case, it is your own responsibility to make sure that your environment is functional and that your results can be properly interpreted for grading. 

2. For questions containing the **/Discuss:/** prefix, answer not with code, but with a textual explanation (in markdown). Add a textual description of your thought process, the assumptions you made, and your results.

3. Please write all your comments in English, and use meaningful variable names in your code.

4. When asked for plots, please include all the needed decorations, namely title, x/y-axis labels, appropriate x/y-ticks, legend, and so on. 

5. We will grade your notebook as is, which means that only the results showed in your evaluated code cells will be considered. Please be sure to submit a **fully-run and evaluated notebook**. We will not run the notebook for you. Interactive plots, such as those generated using `plotly`, should be **strictly avoided**.

6. You can use all the online resources you want except for communication tools (emails, web chats, forums, phone, etc.). Remember, this is not a project assignment. Therefore, no teamwork is allowed.

In [1]:
### SOME MINIMALISTIC IMPORTS ###
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import sklearn

%matplotlib inline

### YOUR ADDITIONAL IMPORT STATEMENTS BELOW ###

# > Task A. Dropout Prediction in MOOCs (30 points)

Massive open online course (MOOC) platforms allow learners to interact with the **course content** and learn concepts of the learning domain **on their own**. However, the consistently high dropout rate of MOOC learners is a major issue. Guidance in the form of **targeted interventions** has the potential to decrease this dropout rate and improve educational outcomes. In this task, you will explore a promising approach for this purpose: **classification of new learners in real time**.

### The data
The dataset for this task stems from XuetangX, one of the largest massive open online course platforms in China. The reduced dataset provided to you will include only information about one course. For a user, the system records multiple types of activities: video watching (e.g., load or play a video), forum discussion (e.g., create and delete threads), and problem completion (with correct/incorrect answers). The dropout label is provided for each user. 

The dataset is available in the `data/` directory pushed to the same GitHub repo as the exam. Inside the data directory, you will find two files:

####  1. `log.csv`: a comma-separated file with the following information about user-course interactions

- *user_id*: the id of user
- *session_id*: the id of session
- *action*: the type of action performed by the user 
    - 'load_video': the user has decided to open and load a course video 
    - 'play_video': the user has started playing a course video
    - 'pause_video': the user has paused a course video
    - 'stop_video': the user has stopped a course video
    - 'seek_video': the user has selected to go to a different point in the course video
    - 'problem_get': the user has accessed to a problem page
    - 'problem_check_correct': the problem answer submitted by the student was *correct*
    - 'problem_check_incorrect': the problem answer submitted by the student was *incorrect*
    - 'create_comment': the user has created a forum comment
    - 'delete_comment': the user has deleted a forum comment
    - 'create_thread': the user has created a forum thread
    - 'delete_thread': the user has deleted a forum thread
- *timestamp*: the timestamp of the action
- *week*: the week of the course the action occurred (week 0 is the first week, week 5 is the last one included here)

####  2. `cluster_labels.csv`: a comma-separated file with the following information about which cluster students belong to
- *user_id*: the id of user
- *cluster*: the label of user's cluster 
- *gender*: 1 for female, 0 for men, 2 for non-binary

### > A1. Load and prepare the data 
1. Load `data/log.csv` and the `data/cluster_labels.csv`  

In [2]:
### YOUR CODE HERE ###

2. From the `logs` pandas dataframe, per user compute the percentage of correct problems taking into account only the information from the first three weeks (week 0, week 1 and week 2)


In [4]:
### YOUR CODE HERE ###

3. Compute two other interesting features.

In [5]:
### YOUR CODE HERE ###

4. **/Discuss:/** Why have you added that feature? Why do you think that it can help in modelling users' behavior with respect to dropout? It is expected that you will provide an hypothesis by text

> YOUR DISCUSSION HERE

### > A2. Classify users' behavior for personalized early intervention

You will train a classifier that will predict the cluster label for a student, earlier (week 3), based on the three features you computed in A1. 


1. Perform a train-test set split on the data with the combine features `X` and the corresponding dropout labels `y`, with 80\% of the users as the train set and the remainder of the users as the test set. Use `random_state=0`. Save the train data in `X_train` (train features) and `y_train` (train cluster labels) and the test data in `X_test` (test features) and `y_test` (test cluster labels). 

In [7]:
### YOUR CELLS AND CODE HERE ###

2. Train a Random Forest classifier exploring the following hyperparameters:
* Estimators [10,100, 500]
* Maximum depth [2,5,10, None]

Please, note that you need to choose an appropriate way of transforming features such that they can be correctly fed into the classifier. Use `random_state=0`. 

In [12]:
### YOUR CELLS AND CODE HERE ###

3. Report the performance of the fine-tuned Random Forest classifier on `X_test` (test features) and `y_test` (test cluster labels), and create a heatmap with the confusion matrix for this classifier. 


In [14]:
### YOUR CELLS AND CODE HERE ###

4. **/Discuss:/** Are you satisfied with the model performance? How does the predictive performance vary across cluster labels? Do you think that this model could be applied for early intervention in the real world? Discuss then the performance of the model by comparing it to a random prediction baseline and a majority class prediction baseline.

> YOUR DISCUSSION HERE

5. The school principal is worried that your model is better for males than for females. Is this true? Support your claim.

> YOUR DISCUSSION HERE