# Welcome to the MLBD final exam (Spring 2022)

The exam questions are contained in this Jupyter Notebook. The `data` folder contains the data. 

The logistical details, rules, and guidelines pertaining to the exam are stated below.   

### Timeline and Submission
**Exam date:** July 6, 2022   
**Exam start:** 15h15  
**Exam end:** 18h15

### Instructions
This exam consists of two parts, a Moodle quiz with conceptual questions and programming exercises in this notebook. **Note that the Moodle quiz with the conceptual questions will be closed by 16h15, therefore please make sure to answer the conceptual questions within the first hour of the exam. To submit this notebook for the coding questions, you should upload it to Moodle (at the latest by 18h15).**

In case of issues with Moodle, send your file named as "SCIPER_Firstname_Lastname.ipynb" via email to paola.mejia@epfl.ch, subject "[MLBD] Exam notebook".

### Rules

1. You are allowed to use any environment. We recommend using EPFL's Noto environment, accessible through the link: [https://noto.epfl.ch/](https://noto.epfl.ch/). We prepared a Python environment with all the Python packages that you might need for the exam, in the default EPFL's Noto installation. If you want to use some additional packages, feel free to install and use them in a virtual environment. In this case, it is your own responsibility to make sure that your environment is functional and that your results can be properly interpreted for grading. 


2. Please write all your comments in English, and use meaningful variable names in your code.

3. When asked for plots, please include all the needed decorations: namely title, x/y-axis labels, appropriate x/y-ticks, legend, and so on. 

4. We will grade your notebook as is, which means that only the results showed in your evaluated code cells will be considered. Please be sure to submit a **fully-run and evaluated notebook**. We will not run the notebook for you. Interactive plots, such as those generated using `plotly`, should be **strictly avoided**.

5. You can use all the online resources (including the code from the demo notebooks from the course) you want except for communication tools (emails, web chats, forums, phone, etc.). Remember, this is not a project assignment. Therefore, no teamwork is allowed.

### Setup
We intend this notebook to be completed on EPFL's Noto environment. As in past lecture exercises, you will need to use the `Tensorflow` kernel for the dependencies to be installed appropriately. Change the kernel in the upper right corner of Noto. Select `Tensorflow`.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import tensorflow as tf
import seaborn as sns
from scipy import linalg

from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import adjusted_mutual_info_score
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import roc_auc_score, balanced_accuracy_score, confusion_matrix, ConfusionMatrixDisplay, silhouette_score
from sklearn.model_selection import train_test_split

from sklearn.neighbors import kneighbors_graph
from sklearn.metrics.pairwise import pairwise_kernels
from sklearn.manifold import spectral_embedding
from scipy.sparse.csgraph import laplacian
from sklearn.cluster import KMeans
from scipy.spatial.distance import pdist, squareform

## Question 1 (15 points)
You are the Senior Data Scientist in a learning platform called LernTime. You have realized that many users stop using the platform and want to increase user retention. For this purpose, you decide to build a model to predict whether a student will stop using the learning platform or not.

Your data science team built a data frame in which each row contains the aggregated features per student (calculated over the first 5 weeks of interactions) and the feature `dropout` indicates whether the student stopped using the platform (1) or not (0) before week 10.

The dataframe is in the file `lerntime.csv` and contains the following features:
- `video_time`: total video time (in minutes) 
- `num_sessions` total number of sessions
- `num_quizzes`: total number of quizzes attempts
- `reading_time`: total theory reading time
- `previous_knowledge`: standardized previous knowledge
- `browser_speed`: standardized browser speed
- `device`:  whether the student logged in using a smartphone (1) or a computer (-1)
- `topics`: the topics covered by the user
- `education`: current level of education (0: middle school, 1: high school, 2: bachelor, 3: master, 4: Ph.D.).
- `dropout`: whether the student stopped using the platform (1) or not (0) before week 5.

The newest data scientist created two models with an excellent performance. As a Senior Data Scientist, you are suspicious of the results and decide to revise the code. 

Your task is to:

a) Identify the mistakes. In the first cell, add a comment above each line in which you identify an error and explain the error.

b) In the second cell, you must correct the code.

In [8]:
df = pd.read_csv('data/lerntime_dropout.csv')

y = df['dropout']
X = df[['video_time', 'num_sessions', 'num_quizzes', 'reading_time',
       'previous_knowledge', 'browser_speed']]

### a) Identify the mistakes in the code (10 points)
In the following cell, add a comment above each line in which you identify an error and explain the why it is erroneous.
Please start each of your comments with `#ERROR:`. For example:

`#ERROR: the RMSE of the model is printed instead of the AUC`

`print("The AUC of the model is: {}".format(rmse))          `

You may assume that: 
- all the features are continous and numerical. 
- the features have already been cleaned and processed. 

In [9]:
## 1. Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

## 2. Feature selection (Lasso)
print(X.shape)
lasso = Lasso(alpha=0.1, random_state=0).fit(X, y)
selector = SelectFromModel(lasso, prefit = True)
X = selector.transform(X)
print(X.shape)

## 3. Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

## Model 1
clf = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 1: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))

## Model 2
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 2: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))

## Discussion
# Our second model achieved perfect results with unseen data and outperforms the first model.
## This is because we increased the number of estimators.

(300, 6)
(300, 3)
Score model 1: 0.31
Score model 2: 1.0


### b) Correct the code (5 points)
Correct all the erroneous code in the following cell.

In [10]:
y = df['dropout']
X = df[['video_time', 'num_sessions', 'num_quizzes', 'reading_time',
       'previous_knowledge', 'browser_speed']]
## 1. Scale the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

## 2. Feature selection (Lasso)
print(X.shape)
lasso = Lasso(alpha=0.1, random_state=0).fit(X, y)
selector = SelectFromModel(lasso, prefit = True)
X = selector.transform(X)
print(X.shape)

## 3. Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

## Model 1
clf = RandomForestClassifier(n_estimators=10, max_depth=2, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 1: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))

## Model 2
clf = RandomForestClassifier(n_estimators=1000, max_depth=None, random_state=0)
clf.fit(X,y)
preds = clf.predict(X_test)
print("Score model 2: {}".format(np.round(adjusted_mutual_info_score(preds, y_test), 2)))

## Discussion
# Our second model achieved perfect results with unseen data and outperforms the first model.
## This is because we increased the number of estimators.

(300, 6)
(300, 3)
Score model 1: 0.31
Score model 2: 1.0


# Question 2 (5 points)
You decide to explore the data further. You are especially interested in the two features `device` and `education` and decide to explore the relationship between them.  

What is the relationship between the two features `device` and `education`? Support your answer with informative metrics. 

In [None]:
# YOUR CODE FOR DATA EXPLORATION AND INFORMATIVE METRICS HERE

> YOUR DISCUSSION HERE

# Question 3 (40 points)

After having looked in more detail into the features, you decide to explore the different type of users. You want to use your knowledge from your ML4BD course and decide to cluster using Spectral Clustering. In the course, you learnt different ways of constructing the similarity graph, yielding the adjacency matrix serving as an input to the Spectral Clustering. Based on your in-depth exploration of the data, you decide to construct the similarity graph as a  *k-nearest neighbor graph*.

Your tasks are to:

a) Write a function to compute the k-nearest neighbor graph.

b) Cluster the users using Spectral Clustering and your k-nearest neighbor graph function (use 4 neighbors). Use only the features *reading_time* and *topics*. You can assume that optimal number of clusters is 2.

c) Discuss the fairness of the obtained cluster solution regarding the level of education ('education')

## a) Computation of the k-nearest neighbor graph (17 points)
Unfortunately, there is no k-nearest neighbor graph implementation available in scikit-learn and you therefore have to implement the function yourself. The function `'k_nearest_neighbor_graph'` takes a similarity matrix `S` as well as the number of neighbors `k` as an input an returns the adjacency matrix `W`.

Note that we will not evaluate the coding efficiency of your function. 

In [None]:
def k_nearest_neighbor_graph(S, k):
    # S: similarity matrix
    # k: number of neighbors
    # YOUR CODE HERE
    W = None # CHANGE THIS
    
    return W

In [None]:
k = 2
# Please run this cell for evaluation purposes
S = [[1, 0.2, 0.7, 0.1],
     [0.2, 1, 0.8, 0.4],
     [0.7, 0.8, 1, 0.6],
     [0.1, 0.4, 0.6, 1]]

k_nearest_neighbor_graph(S, k)

In [None]:
# Please run this cell for evaluation purposes
S = [[1, 0.3, 0.01, 0.1],
     [0.3, 1, 0.8, 0.9],
     [0.01, 0.8, 1, 0.6],
     [0.1, 0.9, 0.6, 1]]

k_nearest_neighbor_graph(S, k)

## b) Spectral Clustering (15 points)
Perform a spectral clustering using a k-nearest neighbor graph (with 4 neighbors). Use the two features `reading_time` and `topics` only. If you did not manage to solve task 3a), use a *fully connected graph* as similarity graph to obtain the adjacency matrix `W`. You can assume that the optimal number of clusters is 2. Print the obtained cluster labels. 

In [None]:
# YOUR CODE FOR PERFORMING SPECTRAL CLUSTERING HERE

## c) Fairness of clustering solution (8 points)
Some students approach you and say that your clustering algorithm is not fair with respect to the education level (specified in feature `education level`). You therefore decide to investigate the fairness of the obtained clustering solution. To do so, you choose an appropriate fairness metric, implement it, and apply it to compute the fairness of your clustering solution. Your further decide to visualize the obtained results in an informative manner, as a basis for your discussion with the students.

Choose an appropriate fairness metric, implement it, and apply it to compute the fairness of your clustering solution. Justify your choice of metric and discuss your results.

In [None]:
# YOUR CODE FOR THE FAIRNESS ANALYSIS HERE

Justify your choice of metric.

> YOUR DISCUSSION HERE

Visualize your results in an informative manner.

In [None]:
# YOUR CODE FOR THE FAIRNESS VISUALIZATION HERE

Is your clustering solution fair? If yes, why? If not, why not?

> YOUR DISCUSSION HERE

# Question 4 (30 points)
To improve course quality, the CEO of LernTime decides to adapt the difficulty level of the tasks to the knowledge of the students. She asks you to develop a type of knowledge tracing model able to predict the number of points a student will get on the next problem, based on the observed performance (in terms of points) on all the past problems.
You are provided with an example data set from a mathematics course, containing the following columns: 

| Name                   | Description                         |
| ---------------------- | ------------------------------------------------------------ |
| user_id | The ID of the student who is solving the problem.  | |
| order_id | The temporal ID (timestamp) associated with the student's answer to the problem.  | |
| relative_week | The week # since the student's first interaction with the platform.   | |
| problem_id | The ID associated with the problem. | |
| score | The student's performance on the problem in terms of obtained points. The maximum number of points is 10 and the minimum number of points is 0.

You decide to use a Deep Knowledge Tracing model (with an LSTM layer). Unfortunately, you cannot directly use a standard DKT architecture as:

1. Your data does not have skill names or IDs available, so you will have to modify DKT to predict based on problem IDs.

2. Instead of predicting a binary outcome (right/wrong), your goal is to predict the number of obtained points (score).

Your tasks therefore are to:

a) Implement and evaluate an adjusted version of a DKT model able to predict the number of points a student will obtain on a problem.

b) Justify and discuss all your design choices.

In [11]:
student_df = pd.read_csv('data/lerntime_kt.csv')
student_df

Unnamed: 0,user_id,order_id,relative_week,problem_id,score
0,94400,163987467,0,6473,10.0
1,94400,164499411,5,11893,10.0
2,94987,172053491,0,37570,10.0
3,94987,172092447,0,7195,10.0
4,94987,172243850,2,5945,10.0
...,...,...,...,...,...
31653,361511,175196995,1,164496,10.0
31654,361770,175148600,0,37570,6.5
31655,361770,175280418,1,39162,0.0
31656,361771,175138828,0,39162,10.0


### a) Implementation of (adjusted) DKT model (20 points)

Fortunately, you already have code available (below) for properly training and evaluating a standard DKT model. **Modify** this code to be able to predict the number of points a student will get on the next problem, based on the observed performance (in terms of points) on all the past problems. Note that the code in its current format **will not run properly** due to the two differences mentioned above: the data set at hand does not have skill names available and your model needs to predict the total number of obtained points for a problem instead of a binary outcome.

Train your model for 10 epochs, and use the best model callback to find the optimal model. Evaluate your model using **appropriate performance metric(s)** - please print your metric(s). You do not need to tune the hyperparameters of your model for this task; instead use the following settings (already provided in the code below):

```
params['batch_size'] = 32
params['mask_value'] = -1.0
params['verbose'] = 1
params['best_model_weights'] = 'weights/bestmodel' 
params['optimizer'] = 'adam'
params['recurrent_units'] = 32
params['epochs'] = 10
params['dropout_rate'] = 0.1
```

In [None]:
# MODIFY THE CODE BELOW TO ADJUST THE DKT IMPLEMENTATION

In [None]:
# Function for splitting the data into a training and test set
def create_iterator(data):
    '''
    Create an iterator to split interactions in data into train and test, with the same student not appearing in two diverse folds.
    :param data:        Dataframe with student's interactions.
    :return:            An iterator.
    '''    
    # Both passing a matrix with the raw data or just an array of indexes works
    X = np.arange(len(data.index))
    # Groups of interactions are identified by the user id (we do not want the same user appearing in two folds)
    groups = data['user_id'].values 
    return model_selection.GroupShuffleSplit(n_splits=1, train_size=.8, test_size=0.2, random_state=0).split(X, groups=groups)

In [None]:
# Hyperparameters are fixed!
params = {}
params['batch_size'] = 32
params['mask_value'] = -1.0
params['verbose'] = 1
params['best_model_weights'] = 'weights/bestmodel' 
params['optimizer'] = 'adam'
params['recurrent_units'] = 32
params['epochs'] = 10
params['dropout_rate'] = 0.1

In [None]:
# Functions for building the Tensorflow input sequences for the model
def prepare_seq(df):
    '''
    Extract user_id sequence in preparation for DKT. The output of this function 
    feeds into the prepare_data() function. 
    '''
    # Enumerate skill id as a categorical variable 
    # (i.e. [32, 12, 32, 45] -> [0, 1, 0, 2])
    df['skill'], skill_codes = pd.factorize(df['skill_name'], sort=True)

    # Cross skill id with answer to form a synthetic feature
    df['skill_with_answer'] = df['skill'] * 2 + df['correct']

    # Convert to a sequence per user_id and shift features 1 timestep
    seq = df.groupby('user_id').apply(lambda r: (r['skill_with_answer'].values[:-1], r['skill'].values[1:], r['correct'].values[1:],))
    
    # Get max skill depth and max feature depth
    skill_depth = df['skill'].max() 
    features_depth = df['skill_with_answer'].max() + 1

    return seq, int(features_depth), int(skill_depth)

def prepare_data(seq, params, features_depth, skill_depth):
    '''
    Manipulate the data sequences into the right format for DKT with padding by batch
    and encoding categorical features.
    '''
    
    # Get Tensorflow Dataset
    dataset = tf.data.Dataset.from_generator(generator=lambda: seq, output_types=(tf.int32, tf.int32, tf.float32))

    # Encode categorical features and merge skills with labels to compute target loss
    dataset = dataset.map(
        lambda feat, skill, label: (
            tf.one_hot(feat, depth=features_depth),
            tf.concat(values=[tf.one_hot(skill, depth=skill_depth), tf.expand_dims(label, -1)], axis=-1)
        )
    )

    # Pad sequences to the appropriate length per batch
    dataset = dataset.padded_batch(
        batch_size=params['batch_size'],
        padding_values=(params['mask_value'], params['mask_value']),
        padded_shapes=([None, None], [None, None]),
        drop_remainder=True
    )

    return dataset.repeat(), len(seq)

In [None]:
# Function for getting the Tensorflow output sequences for the model
def get_target(y_true, y_pred, mask_value=params['mask_value']):
    ''' 
    Adjust y_true and y_pred to ignore predictions made using padded values.
    '''
    # Get skills and labels from y_true
    mask = 1. - tf.cast(tf.equal(y_true, mask_value), y_true.dtype)
    y_true = y_true * mask

    skills, y_true = tf.split(y_true, num_or_size_splits=[-1, 1], axis=-1)

    # Get predictions for each skill
    y_pred = tf.reduce_sum(y_pred * skills, axis=-1, keepdims=True)

    return y_true, y_pred

In [None]:
# Obtain indexes for training and test sets
train_index, test_index = next(create_iterator(student_df))

# Split the data into training and test
X_train, X_test = student_df.iloc[train_index], student_df.iloc[test_index]

# Obtain indexes for training and validation sets
train_val_index, val_index = next(create_iterator(X_train))

# Split the training data into training and validation
X_train_val, X_val = X_train.iloc[train_val_index], X_train.iloc[val_index]

In [None]:
# Build TensorFlow sequence datasets for training, validation, and test data
seq, features_depth, skill_depth = prepare_seq(student_df)
seq_train = seq[X_train_val.user_id.unique()]
seq_val = seq[X_val.user_id.unique()]
seq_test = seq[X_test.user_id.unique()]

# Prepare the training, validation, and test data in the DKT input format
tf_train, length = prepare_data(seq_train, params, features_depth, skill_depth)
tf_val, val_length  = prepare_data(seq_val, params, features_depth, skill_depth)
tf_test, test_length = prepare_data(seq_test, params, features_depth, skill_depth)

# Calculate the length of each of the train-test-val sets and store as parameters
params['train_size'] = int(length // params['batch_size'])
params['val_size'] = int(val_length // params['batch_size'])
params['test_size'] = int(test_length // params['batch_size'])

In [None]:
# Custom metrics for training and testing
class AUC(tf.keras.metrics.AUC):
    # Our custom AUC calls our get_target function first to remove predictions on padded values, 
    # then computes a standard AUC metric.
    def __init__(self):
        # We use a super constructor here just to make our metric name pretty!
        super(AUC, self).__init__(name='auc')

    def update_state(self, y_true, y_pred, sample_weight=None):
        true, pred = get_target(y_true, y_pred)
        super(AUC, self).update_state(y_true=true, y_pred=pred, sample_weight=sample_weight)

def CustomBinaryCrossEntropy(y_true, y_pred): 
    # Our custom binary cross entropy loss calls our get_target function first 
    # to remove predictions on padded values, then computes standard binary cross-entropy.
    y_true, y_pred = get_target(y_true, y_pred)
    return tf.keras.losses.binary_crossentropy()(y_true, y_pred)  

In [None]:
# Function for creating the model itself
def create_model_lstm(nb_features, nb_skills, params):
    
    # Create an LSTM model architecture
    inputs = tf.keras.Input(shape=(None, nb_features), name='inputs')

    # We use a masking layer here to ignore our masked padding values
    x = tf.keras.layers.Masking(mask_value=params['mask_value'])(inputs)

    # This LSTM layer is the crux of the model; we use our parameters to specify
    # what this layer should look like (# of recurrent_units, fraction of dropout).
    x = tf.keras.layers.LSTM(params['recurrent_units'], return_sequences=True, dropout=params['dropout_rate'])(x)
    
    # We use a dense layer with the sigmoid function activation to map our predictions 
    # between 0 and 1.
    dense = tf.keras.layers.Dense(nb_skills, activation='sigmoid')

    # The TimeDistributed layer takes the dense layer predictions and applies the sigmoid 
    # activation function to all time steps.
    outputs = tf.keras.layers.TimeDistributed(dense, name='outputs')(x)
    model = tf.keras.models.Model(inputs=inputs, outputs=outputs, name='DKT')

    # Compile the model with our loss functions, optimizer, and metrics.
    model.compile(loss=CustomBinaryCrossEntropy, 
                  optimizer=params['optimizer'], 
                  metrics=[AUC()])
    
    return model

# Create our DKT model using an LSTM
dkt_lstm = create_model_lstm(features_depth, skill_depth, params)

In [None]:
# This line tells our training procedure to only save the best version of the model at any given time.
ckp_callback = tf.keras.callbacks.ModelCheckpoint(params['best_model_weights'], 
                                                  save_best_only=True, save_weights_only=True)

# Let's fit our LSTM model on the training data.
history = dkt_lstm.fit(tf_train, epochs=params['epochs'], steps_per_epoch=params['train_size']-1, 
                       validation_data=tf_val, validation_steps=params['val_size'],
                       callbacks=[ckp_callback], verbose=params['verbose'])

In [None]:
# We load the LSTM model with the best performance, and evaluate it on the test set. 
dkt_lstm.load_weights(params['best_model_weights'])
dkt_lstm.evaluate(tf_test, steps=params['test_size'], verbose=params['verbose'], return_dict=True)

### b) Justification of design choices (10 points)

In your model architecture, how did you construct your model inputs and outputs? Why did you design it this way?

> YOUR DISCUSSION HERE

Which metric(s) did you choose to measure your model performance? Why?

> YOUR DISCUSSION HERE

# Question 5 (30 points)
The CEO of LernTime decides to further improve the quality of the platform. Specifically, she would like to support struggling students early on by offering them targeted interventions and to also provide advanced tasks to excellent students. She asks you to develop a model that is able to identify the very high and very low performers **early on**, i.e. after the **first 6 weeks** of their interactions with a course. For all courses, advanced tasks should be offered to the top 20% of students, while the bottom 20% of students should benefit from interventions. The overall performance of a student at the end of the course is determined by the final exam score at the end of the course, from 0 to 100 in `exam_score`.

Using the dataframe `aggregated_student_df` from the already familiar mathematics course, your data scientist colleagues have already divided the students into three groups based on the procedure described above:

1. intervene: students who need help (`exam_score` <= 46.5)
2. on-track: students who are on track (`exam_score` > 46.7 and `exam_score` <= 70.5)
3. advanced: exceptional students (`exam_score` > 70.5)

They provide you the information about the clusters in the `group` column in the `aggregated_student_df` dataframe, as well as the `exam_score` each student obtained at the end of the course.

Your tasks are to:

a) Implement and evaluate a model (using an LSTM layer) able to predict the *group* (intervene, on-track, advanced) of a student based on his/her performance during the first six weeks (total points on problems obtained each week: `week_0`, `week_1`,  ... `week_5`).

b) Visualize and discuss the performance of your model.

In [12]:
# Dataframe with group labels, aggregated from the original student_df.
# The scores for week_0 through week_5 are the aggregated (summed) points of all 
# the problems the student answered that week.

aggregated_student_df = pd.read_csv('data/lerntime_classification.csv')
aggregated_student_df

Unnamed: 0,user_id,week_0,week_1,week_2,week_3,week_4,week_5,group,exam_score
0,94400,10.0,0.0,0.0,0.0,0.0,10.0,intervene,34.0
1,94987,20.0,0.0,10.0,0.0,0.0,0.0,on-track,63.5
2,95610,20.0,10.0,0.0,0.0,0.0,0.0,advanced,71.0
3,96409,10.0,0.0,0.0,0.0,0.0,20.0,on-track,63.5
4,118571,10.0,0.0,0.0,0.0,0.0,10.0,on-track,51.0
...,...,...,...,...,...,...,...,...,...
5864,361379,10.0,0.0,20.0,0.0,0.0,0.0,on-track,63.5
5865,361510,10.0,10.0,0.0,0.0,0.0,0.0,on-track,51.0
5866,361511,10.0,10.0,0.0,0.0,0.0,0.0,on-track,60.5
5867,361770,6.5,0.0,0.0,0.0,0.0,0.0,intervene,38.5


### a) Implementation of performance prediction model (20 points)
Create an a time-series model (using an LSTM layer), able to predict the `group` of a student based on the interactions of the first six weeks. Train your model for 10 epochs, and use the best model callback to find the optimal model. Evaluate your model using an appropriate performance metric (simply print it). Again do not need to tune the hyperparameters of your model for this task; instead use the following settings:

```
params['batch_size'] = 32
params['mask_value'] = -1.0
params['verbose'] = 1
params['best_model_weights'] = 'weights/bestmodel' 
params['optimizer'] = 'adam'
params['recurrent_units'] = 32
params['epochs'] = 10
params['dropout_rate'] = 0.1
```

Luckily, one of your colleagues has already implemented a skeleton for the model and therefore, you only need to add your code to this skeleton.

In [None]:
# Build df_x (the input data) and  df_y (the labels)

# YOUR CODE FOR CONSTRUCTING THE MODEL INPUTS AND OUTPUTS HERE

In [None]:
# Split the data into training and test set
df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(df_x, 
                                                                df_y,
                                                                test_size=0.2, 
                                                                random_state=0, 
                                                                stratify=df_y)

# Split the training data further into training and validation sets.
df_x_train_val, df_x_val, df_y_train_val, df_y_val = train_test_split(df_x_train, 
                                                                      df_y_train, 
                                                                      test_size=0.2,
                                                                      random_state=0, 
                                                                      stratify=df_y_train)

In [None]:
# Hyperparameters are fixed!
params = {}
params['batch_size'] = 32
params['mask_value'] = -1.0
params['verbose'] = 1
params['best_model_weights'] = 'weights/bestmodel' 
params['optimizer'] = 'adam'
params['recurrent_units'] = 32
params['epochs'] = 10
params['dropout_rate'] = 0.1

In [None]:
# Create the LSTM time-series model (use one LSTM layer)

# YOUR CODE FOR BUILDING THE MODEL HERE

In [None]:
# We save only the best model during the training process.
ckp_callback = tf.keras.callbacks.ModelCheckpoint(params['best_model_weights'], 
                                                  save_best_only=True, save_weights_only=True)

# Fit the DKT LSTM on the given data set.
history = time_series_lstm.fit(df_x_train_val, 
                               df_y_train_val, 
                               epochs=params['epochs'],
                               validation_data=(df_x_val, df_y_val),
                               callbacks=[ckp_callback], 
                               verbose=params['verbose'])

In [None]:
# Load the best version of the the trained model and compute its prediction
time_series_lstm.load_weights(params['best_model_weights'])
predictions = time_series_lstm.predict(df_x_test)

In [None]:
# Use an appropriate error metric to evaluate the performance of your model and print the error metric

# YOUR CODE FOR EVALUATING MODEL PERFORMANCE HERE

### b) Visualization and discussion (10 points)
How well does your model perform? Provide a visualization to support your argument.

In [None]:
# YOUR CODE FOR THE MODEL PERFORMANCE VISUALIZATION HERE

> YOUR DISCUSSION HERE