# M2 | Exploration Notebook

In this notebook, you will do a first exploration of the data set that you will use for your project. One part of this exploration is guided, i.e. we will ask you to solve specific questions (task 1-3). The other part is open, i.e. we will ask you to come up with your own exploration ideas (task 4). 

Please upload your solved notebook to Moodle (under Milestone 2 Submission)adding your SCIPER number in title, example: m2-lernnavi-456392.ipynb


## Brief overview of Classtime
[Classtime](https://www.classtime.com/en/) is a Swiss EdTech startup that complements in-class teaching with immediate feedback on studentsâ€™ level of understanding. Teachers create ClassTime sessions for their class, where each session consists of *n* questions and a reflection in the end. Each student is tracked by a **participant_id**, which is unique within a session and in some cases can be tracked over sessions using the **social_user_id**. We have 5 data tables from this startup:
- *sessions* (meta information about session format, teacher)
- *questions* (all questions asked across sessions)
- *answers* (each row represents a student answer to a question)
- *reflection_questions* (4 reflection questions available to ask students)
- *reflection_answers* (each row represents a student answer to a reflection questions)

The *answers* table has been filtered to ~20% of the original dataset, only keeping participants who have also answered reflection questions. Similarly, the *questions* table has been filtered to only include questions where our subset of participants have answered. If you choose this dataset for the project, we will give you access to the full data later on.

### sessions:
- **id**: The unique identifier of a ClassTime session, also referred to in other tables as **session_id**.
- **teacher_id**: The unique identifier of a teacher, who creates a Classtime session.
- **title**: The title of the session. The default title is extracted from the Question Set title but the teacher can change it manually.
- **participant_count**: N/A
- **group**: N/A
- **mode**: round-based = whole class goes through the session together, one question at the time, teacher gives the pace. flexible = teacher can still active/deactive questions but generally students can answer at their own pace.
- **question_set_id**: Identifier of the question set that was used to create the session (not always populated). If this is the same for two sessions, it means the same questions were used for both.
- **is_onboarding**: This session was automatically created by classtime when the teacher signed up to show him/her how classtime works and are not real sessions.
- **feedback_mode**: per-question = solution is shown immediately after answering question, for each question. manual = solutions only shown when teacher triggers it
- **is_one_attempt**: If  true, students cannot change their answers after submission (note that if students answer multiple times for the same question this will also show up in the answer dump as multiple rows).
- **is_shuffle_choices**: For appropriate question types (i.e. multi-choice) randomly shuffle the options for each student. If False, show in order as it is defined in the question.
- **is_shuffle_questions**: If true, randomly shuffle order of questions for each student, otherwise the same order is used for all students.
- **is_partial_grading**: If true, ratings other than {0, question weight} are possible.
- **platform_only**: If not empty (NaN), students need to identify themself (e.g. with their google account) to join the session. If NaN students can join with a nickname for the session and are not required to sign in.
- **has_reflection**: If true, students can answer session reflection questions.
- **force_reflection**: If true, teachers actively required students to answer session reflections.
- **timer**: Number of seconds each student has to answer all questions, which starts after student joins session
- **is_solo**: Self-learning session without teacher

### questions:
- **id**: The unique identifier of this question.
- **session_id**: Session ID (see *sessions* table).
- **library_question_id**: Also referred to as **question_id**, the library ID for this question. If it's the first time, a new library_question_id is created. If the same question is used over multiple sessions, they will have the same **library_question_id**.
- **kind**: Type of question (i.e. choice, multiple, categorizer, cloze).
- **title**: Title of question.
- **content**: Raw JSON of question in the exact format available to the student.
- **explanation**: Explanation of the correct answer choice, available at the instructor's discretion.
- **provider**: N/A
- **weight**: Maximum points available for this question.
- **video**: If True, question includes a video.
- **image**: If True, question includes an image.

### answers:
- **id**: The unique identifier of this specific answer attempt.
- **participant_id**: Identifier of each student in a session.
- **question_id**: Question ID (see *questions* table).
- **session_id**: Session ID (see *sessions* table).
- **is_correct**: True or False regarding question correctness.
- **rating**: Number of points the student recieved for this response. The maximum this can be is **weight**.
- **created_at**: Timestamp of answer response.
- **social_user_id**: Authentication User ID that can be used to track participants across sessions. This is only available for students who sign in using an authentication platform.

### reflection_questions:
- **id**: The unique identifier of this reflection question. Referred to in *reflection_answers* as **reflection_question_id**.
- **title**: Title of the reflection question.
- **kind**: Type of reflection question (yesno or emoji).

### reflection_answers:
- **id**: The unique identifier of this reflection question response.
- **participant_id**: The identifier of each student in a session.
- **session_id**: Session ID (see *sessions* table).
- **reflection_questions_id**: Reflection Question ID (see *reflection_questions* table).
- **value**: student response to the reflection question (yes/no, happy/neutral/upset).

The *questions* and *answers* tables can be joined using **session_id**, **participant_id** and **question_id**. Each teacher has a unique **teacher_id**. Participant responses can be tracked over a single session in the *answers* and *reflection_answers* tables using **participant_id**, and over sessions using **social_user_id**.


In [None]:
# Import the tables of the data set as dataframes.
# Note that this cell will likely take ~10 minutes to run.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

DATA_DIR = './data' #You many change the directory

sessions = pd.read_csv('{}/classtime_sessions.csv.gz'.format(DATA_DIR))
questions = pd.read_csv('{}/classtime_questions.csv.gz'.format(DATA_DIR))
answers = pd.read_csv('{}/classtime_answers.csv.gz'.format(DATA_DIR), nrows=10000000)
reflection_questions = pd.read_csv('{}/classtime_reflection_questions.csv.gz'.format(DATA_DIR))
reflection_answers = pd.read_csv('{}/classtime_reflection_answers.csv.gz'.format(DATA_DIR))

## Task 1: Simple Statistics

In this task you are asked to do a first coarse exploration of the data set, using simple statistics and visualizations.

#### a) How many distinct sessions do we have in the data set? How many distinct participants do we have in the data set?

In [None]:
# Your code here

#### b) What is the total number of questions ("questions") that each user has answered? Please provide a visualization and discuss the distribution.

In [None]:
# Your code for computing the feature and the visualization here

*Your discussion/interpretation goes here*

#### c) How many different question types ("kind") do we have in the data set? How many distinct questions do we have for each question type? Please provide a visualization and discuss/interpret your results.

In [None]:
# Your code for computing the feature and the visualization here

*Your discussion/interpretation goes here*

## Task 2: Static Analysis

In this second task, you will do a univariate an multivariate exploration of some aggregated features.

#### a) Build a data frame containing one row per participant: 
``[participant_id, session_mode, num_questions, percentage_correct, num_reflections, feeling_of_learning]``

The features are defined as follows:
- **session_mode**: mode of the session the participant belongs to - round_based (class together) or flexible (participants work on their own)
- **num_questions**: total number of questions the participant answered in this session
- **percentage_correct**: percentage of correct answers (number of correct answers/total number of answers)
- **feeling_of_learning**: how did the participant feel about their learning progress (happy, neutral, upset, not_answered)

In [None]:
# Your code for building the data frame here

#### b) Perform a univariate analysis (including descriptive statistics and visualizations) for the five features (session_mode, num_questions, percentage_correct, num_reflections, feeling_of_learning) of your data frame. Please check the lecture slides regarding information on how to perform a univariate analysis for categorical and numerical features. Discuss your results: how are the features distributed? Are there any anomalities?

In [None]:
# Your code for univariate analysis here

*Your discussion/interpretation goes here*

#### c) Create one additional feature on your own and add it to the data frame. Please provide an explanation/description of your feature as well as an argument/hypothesis of why you think this feature is interesting to explore.

In [None]:
# Your code for computing the feature and adding it to the data frame.

*Your feature description and argument/hypothesis goes here*

#### d) Perform a univariate analysis of your feature (including descriptive statistics and visualization). What can you observe? Do the results confirm your hypothesis?

In [None]:
# Your code for univariate analysis goes here

*Your discussion goes here*

#### e) Perform a multivariate analysis for two pairs of features of your choice. Please provide a metric and a visualization for both pairs. Please discuss: why did you choose these two pairs? What was your hypothesis? Do the results confirm your hypothesis?

In [None]:
# Your code for univariate analysis goes here

*Your discussion goes here*

## Task 3: Time-Series Analysis
In the last task, you will perform a time-series analysis.

#### a) Build a data frame that is structured as follows (by answered question):  
``[answer, participant_id, session_mode, answer_correctness, answer_time, feeling_of_learning, your feature]``

For each participant, limit the number of answered questions to 10 (i.e. include only the first 10 answers in the data frame)

The features are defined as follows:
- **session_mode**: mode of the session the participant belongs to - round_based (class together) or flexible (participants work on their own)
- **answer_correctness**: is the answer correct?
- **answer_time**: time in seconds to answer the question (i.e. time stamp of next answer - time stamp of this answer)
- **feeling_of_learning**: how did the participant feel about their learning progress (happy, neutral, upset, not_answered)

In [None]:
# Your code for building the data frame goes here

#### b) Select two features and analyze their behavior over time (aggregated over participants). Please provide a hypothesis and visualization for both features. For ideas on how to perform a time series exploration, please check the lecture slides and notebook. Discuss your results: what do you observe? Do the results confirm your hypotheses?

In [None]:
# Your code for the time series analyses of the features goes here

*Your discussion goes here*

# Task 4: Creative extension 

Please provide **one** new hypothesis you would like to explore with the data and provide a visualization for it. Discuss your results: what do you observe? Do the results confirm your hypotheses?

In [None]:
# Your creative visualization here

*Your discussion goes here*