# M2 | Exploration Notebook (GoGymi)

In this notebook, you will do a first exploration of the data set that you will use for your project. One part of this exploration is guided, i.e. we will ask you to solve specific questions (task 1-3). The other part is open and requires building full ML pipeline for a regression or classification task of your choice (task 4). 

Your discussions or justifications for tasks 1, 2 and 3, can be one or two sentence long justification. For task 4 please elaborate your choices and approach in details.

Please upload your solved notebook to Moodle (under Milestone 2 Submission) adding your SCIPER number in title, example: `m2-gogymi-456392.ipynb`

## Brief Overview of GoGymi

GoGymi is a platform helping students to prepare for the gymnasium exam in Switzerland.

The platform includes four main aspects:

- Personal AI tutor: an always-available chat-based intelligent digital learning companion

- AI-based essay correction: detailed feedback on student-submitted essays for improvement and correction

- Exercises and solutions: prepare for the exam by practicing sample question from the previous years

- Learning analytics: detailed analysis of students' learning activities and their readiness for the exam

For the guided part of the exploration we will focus on two main tables:

- `quiz_results.csv` (corresponds to the math quizzes)

- `essay_results.csv` (corresponds to the essay results)

The data description has is available in the data folder shared with you.

## Task 0: Load the Data

In [None]:
# Imports to start with

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_DIR = '.' # you might need to change the directory

quiz_results = pd.read_csv('{}/quiz_results.csv'.format(DATA_DIR))
essay_results = pd.read_csv('{}/essay_results.csv'.format(DATA_DIR))

# Task 1: Simple Statistics

In this task you are asked to do a first coarse exploration of the data set, usi?ng simple statistics and visualizations.

### A) How many distinct users do we have in each of the math and essay results submissions?

In [None]:
# Your code goes here

### B) What is the distrubution of the number of words of the essays submitted by the students? Provide a suitable visualization and discuss your finding.

In [None]:
# Your code goes here

### C) How many submission attempts have been done by each user in the math and essay results datasets? Please provide a plot visualization (showing both math and essay dataset results next to each other) and discuss the distribution. Which plot do you think is the most appropriate for this task, and why?

In [None]:
# Your code goes here

Your explanation goes here

## Task 2: Static Analysis

In this second task, you will do univariate and multivariate explorations of some of the features of the dataset.

### A) Build a dataframe containing one row per user:

`[user_id, average_point_math_absolute, average_point_math_relative, average_structure_coherence_absolute, average_structure_coherence_relative]`

The features are defined as follows:

- `user_id`: The ID of the user, who should be both in the math and essay results datasets.

- `average_point_math_absolute`: The average of the `points` for the user from the quiz results dataset.

- `average_point_math_relative`: The average of the relative points for the user from the quiz results dataset, that is, the average of the values of `points / max_points`.

- `average_structure_coherence_absolute`: The average of the `structure_coherence` for the user from the essay results dataset.

- `average_structure_coherence_relative`: The average of the relative structure coherence for the user from the essay results dataset, that is, the average of the values of `structure_coherence / max_structure_coherence` where `max_structure_coherence` is the maximum value of `structure_coherence` for ALL users in the dataset.

At the end of the cell display the first 20 rows of the dataframe, sorted by user_id.

Hint 1: you need to conduct a merging of the two data tables.

Hint 2: Check how many rows are there in the quiz results dataset where the `max_points` is less than `points`. If yes, change the `max_points` to be equal to `points`.


In [None]:
# Your code goes here

### B) Perform a univariate analysis (including descriptive statistics and visualizations) for the four features (`average_point_math_absolute, average_point_math_relative, average_structure_coherence_absolute, average_structure_coherence_relative`) of your dataframe. Please check the lecture slides regarding information on how to perform a univariate analysis for categorical and numerical features. Discuss your results: how are the features distributed?

In [None]:
# Your code goes here

Your explanation goes here

### C) Come up with two additional features on your own, based on the two initial CSV files, and add them to the dataframe. Please provide an explanation/description of your features as well as an argument/hypothesis of why you think these features are interesting to explore.

In [None]:
# Your code goes here

Your explanation goes here

### D) Perform a univariate analysis of your features (including descriptive statistics and visualization). What can you observe?

In [None]:
# Your code goes here

Your explanation goes here

### E) Perform a multivariate analysis for one pair of features of your choice. Please provide a metric and a visualization. Please discuss: why did you choose this pair? What was your hypothesis? Do the results confirm your hypothesis?

In [None]:
# Your code goes here

Your explanation goes here

### F) Find two features with highest correlation and two features with lowest correlation. Discuss the correlation.

In [None]:
# Your code for correlation analysis goes here

*Your discussion/interpretation goes here*

## Task 3: Time-series analysis

In this task, you will perform a time-series analysis.

### A) Build a dataframe containing one row per user **per day**:

`[user_id, absolute_points, relative_points, day_index]`

The features are defined as follows:

- `user_id`: The ID of the user, with possible values equal to all the users found in the quiz results dataset.

- `absolute_points`: The `points` for the user from the quiz results dataset.

- `relative_points`: The relative points for the user from the quiz results dataset, that is, the values of `points / max_points`.

- `day_index`: The index of the day of the submission. This should be a discrete version of the time of the submission, where the first day among the submission times of a user is 0, the second day is 1, and so on. If for one day there is no data point, you should ignore it and count the next upcoming day as the next time step. All times should be relative to the first submission time of the user.

At the end of the cell display the first 20 rows of the dataframe, sorted by user_id.


In [None]:
# Your code goes here

### B) Select two features and analyze their behavior over time. Please provide a hypothesis and visualization for both features. For ideas on how to perform a time series exploration, please check the lecture slides and notebook. Discuss your results: what do you observe? Do the results confirm your hypotheses?

In [None]:
# Your code goes here

Your explanation goes here

# Task 4: Machine learning pipeline

Build a complete ML pipeline consisting of:

1. Data preprocessing
2. Feature engineering
3. Model building and training
4. Model evaluation
5. Interpretation of results


You are free to choose any classification or regression problem given the dataset. For example, you can try to predict user dropout or answer correctness given prior behavior. 

Make sure to select correct model given chosen features and proper evaluation method. Use standard principles for ML methods, data preparation and evaluation learned in lectures.
Visualize the loss values during training.


Justify your problem choice (what do you predict), feature selection (why do you predict given these features), model choice and model evaluation.

*Your justification goes here*

In [None]:
# Your code goes here:

### DATA PREPROCESSING

In [None]:
### FEATURE ENGINEERING

In [None]:
### MODEL BUILDING AND TRAINING (make sure to visualize the loss)

In [None]:
### MODEL EVALUATION

*Your interpretation of results goes here*