# Programming Lesson and Exercises - DAS

The goal of these programming lesson and exercises is to teach you how to filter and aggregate time series data.


This notebook has the following structure:

- The first part introduces the concepts for this week. The theory is interleaved with small exercises, which have the goal of letting you practice the concepts that were just intruduced.
- At the end there are one or more larger exercises, which have the goal to test what you have learned earlier. These will be more difficult and will require more independent work than the exercises in the first part.

All exercises can be solved with the concepts that were introduced earlier. Since there are often more than one correct way to solve a programming problem, we try to accept various correct anwers. However, many of the automatic tests in Momotor (see _How to submit your work_ below) assume that your answers are constructed using the concepts introduced in these notebooks. If you look for answers on the Internet (e.g. if you import other libraries) you run the risk that your answers will be rejected.

Some of the small exercises can be solved by copy-pasting code from the examples. However, it is up to you to try to solve the exercises yourself, which will help you learn, before copy-pasting the answers. The ease of looking up answers is meant to provide guidance when you get stuck, especially for those of you who are new to programming.

For your convenience, in the `support` directory you will find a summary of the Python methods introduced in this notebook.



# Introduction to This Template Notebook

* This is a **personal** notebook.
* Make sure you work in a **copy** of `...-template.ipynb`,
**renamed** to `...-yourIDnr.ipynb`,
where `yourIDnr` is your TU/e identification number.

<div class="alert alert-danger" role="danger">
<h3>Integrity</h3>
<ul>
    <li>In this course you must act according to the rules of the TU/e code of scientific conduct.</li>
    <li>All the exercises and the graded assignments are to be executed individually and independently.</li>
    <li>You must not copy from the Internet, your friends, books... If you represent other people's work as your own, then that constitutes fraud and will be reported to the Examination Committee.</li>
    <li>Making your work available to others (complicity) also constitutes fraud.</li>
</ul>
</div>

You are expected to work with Python code in this notebook.

The locations where you should write your solutions can be recognized by
**marker lines**,
which look like this:

>`#//`
>    `BEGIN_TODO [Label]` `Description` `(n points)`
>
>`#//`
>    `END_TODO [Label]`

<div class="alert alert-warning" role="alert">Do NOT modify or delete these marker lines.  Keep them as they are.<br/>
NEVER write code <i>outside</i> the marked blocks.
Such code cannot be evaluated.
</div>

Proceed in this notebook as follows:
* **Read** the text.
* **Fill in** your solutions between `BEGIN_TODO` and `END_TODO` marker lines.
* **Run** _all_ code cells (also the ones _without_ your code),
    _in linear order_ from the first code cell.

**Personalize your notebook**:
1. Copy the following three lines of code:

  ```python
  AUTHOR_NAME = 'Your Full Name'
  AUTHOR_ID_NR = '1234567'
  AUTHOR_DATE = 'YYYY-MM-DD'
  ```
1. Paste them between the marker lines in the next code cell.
1. Fill in your _full name_, _identification number_, and the current _date_ (i.e. when you first modified this notebook, e.g. '2022-02-06') as strings between the `Author` markers.
1. Run the code cell by putting the cursor there and typing **Control-Enter**.


In [None]:
#// BEGIN_TODO [Author] Name, Id.nr., Date, as strings (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [Author]

AUTHOR_NAME, AUTHOR_ID_NR, AUTHOR_DATE

## Table of Contents

- [Learning Objectives](#Learning-Objectives)
- [Loading the Libraries](#Loading-the-Libraries)
- [1. Filtering](#1.-Filtering)
    - [Gaussian Filter](#Gaussian-Filter)
    - [Rolling Windows](#Rolling-Windows)
    - [Filtering Mouse Trajectories](#Filtering-Mouse-Trajectories)
        - [Exercise 1.a](#Exercise-1.a)
        - [Exercise 1.b](#Exercise-1.b)
        - [Exercise 1.c](#Exercise-1.c)
        - [Exercise 1.d](#Exercise-1.d)
        - [Exercise 1.e](#Exercise-1.e)
    - [Computing Changes and Finding Maxima](#Computing-Changes-and-Finding-Maxima)
        - [Exercise 1.f](#Exercise-1.f)
        - [Exercise 1.g](#Exercise-1.g)
        - [Exercise 1.h](#Exercise-1.h)
    - [Approximating Derivatives and Finding Trends](#Approximating-Derivatives-and-Finding-Trends)
        - [Exercise 1.i](#Exercise-1.i)
- [2. Data Aggregation](#2.-Data-Aggregation)
    - [Data Preparation](#Data-Preparation)
        - [Data: Mouse Trajectories](#Data:-Mouse-Trajectories)
        - [Multi-level Indexing](#Multi-level-Indexing)
        - [Data: User Properties](#Data:-User-Properties)
        - [Conversion to Readable Table Entries](#Conversion-to-Readable-Table-Entries)
        - [Data: User Trial Properties](#Data:-User-Trial-Properties)
            - [Exercise 2.a](#Exercise-2.a)
            - [Exercise 2.b](#Exercise-2.b)
    - [Data Aggregation](#Data-Aggregation)
        - [Computing Several Aggregated Quantities at Once](#Computing-Several-Aggregated-Quantities-at-Once)
        - [Joining Two Dataframes](#Joining-Two-Dataframes)
            - [Exercise 2.c](#Exercise-2.c)
            - [Exercise 2.d](#Exercise-2.d)
- [3. Empirical Cumulative Distribution Functions](#3.-Empirical-Cumulative-Distribution-Functions)
    - [Exercises: Compare ECDFs for Mouse a Trackpad Trajectories](#Exercises:-Compare-ECDFs-for-Mouse-a-Trackpad-Trajectories)
        - [Exercise 3.a](#Exercise-3.a)
        - [Exercise 3.b](#Exercise-3.b)
        - [Exercise 3.c](#Exercise-3.c)
        - [Exercise 3.d](#Exercise-3.d)
        - [Exercise 3.e](#Exercise-3.e)
        - [Exercise 3.f](#Exercise-3.f)
        - [Exercise 3.g](#Exercise-3.g)
- [4. Exercise: Find Ballistic Motion](#4.-Exercise:-Find-Ballistic-Motion)
    - [Exercise 4.a](#Exercise-4.a)
    - [Exercise 4.b](#Exercise-4.b)
    - [Exercise 4.c](#Exercise-4.c)
    - [Exercise 4.d](#Exercise-4.d)
    - [Exercise 4.e](#Exercise-4.e)

## Learning Objectives

In these lessons, we are going to extract features from the mouse trajectories recorded in the mouse experiment. We will first extract features by filtering signals, where a signal represents a quantity that changes with respect to another variable variable (e.g. time). For this, we will need to learn:

* how to filter signals in _Python_
* how to work with some new _Pandas_ functions

Next, we will extract more features and perform aggregation over the recorded data and will learn

* how to use the _Pandas_ **`agg()`** for convenient aggregation and computation of simple features
* how to combine/join two _Pandas_ tables with the function **`join()`**
* how to compute, use and plot empirical cumulative distribution functions

## Loading the Libraries

To show examples, we load some Data Analytics libraries first:

In [None]:
import numpy as np  # import auxiliary library, typical idiom
import pandas as pd  # import the Pandas library, typical idiom

from statsmodels.distributions.empirical_distribution import ECDF

from scipy.interpolate import interp1d
from scipy.ndimage.filters import gaussian_filter1d
from scipy import stats

# next command ensures that plots appear inside the notebook
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()  # set Seaborn defaults
plt.rcParams['figure.figsize'] = 10, 5  # default hor./vert. size of plots, in inches
plt.rcParams['lines.markeredgewidth'] = 1  # to fix issue with seaborn box plots; needed after import seaborn

from mouse_experiment import MouseExperiment

# reveal a hint only while holding the mouse down
from IPython.display import HTML
HTML("<style>.h,.c{display:none}.t{color:#296eaa}.t:active+.h{display:block;}</style>")

## 1. Filtering

To illustrate how to filter signals, we will apply filtering to the (time series in the) _stocks_ data set that we have also seen in the visualization exercises. In the exercises that follow, you will apply what you have learned to recorded mouse trajectories.

This time, the _stocks_ data set contains the _daily_ (rather than the monthly) closing value of the <a href="https://en.wikipedia.org/wiki/NASDAQ_Composite">NASDAQ Composite</a> index. We first load the data into a dataframe `df_nasdaq`.

In [None]:
df_nasdaq = pd.read_csv('datasets/NASDAQ.csv', parse_dates=[0])
df_nasdaq = df_nasdaq.set_index('Date')[['Close']]
df_nasdaq.columns = ['close']
df_nasdaq.head()

Let us plot the daily closing index.

In [None]:
ax = df_nasdaq['close'].plot()
ax.set_ylabel('Index')
ax.set_title('Closing NASDAQ Composite index',fontsize=16);

### Gaussian Filter

As you can see, the closing price is very volatile. If we want to approximate the graph by a curve that is more smooth, we can do so by _filtering_. In this case, we apply a Gaussian filter. We store the filtered data in a column called `'close_filtered'`. The Gaussian filter needs a parameter `sigma`, the standard deviation, which gives an indication of the width of the applied filter. We choose `sigma` to be `30` (days). 

(There are different ways to deal with the boundaries. We choose the mode `'nearest'`, since for the later exercises it is the most natural choice.)

In [None]:
df_nasdaq['close_filtered'] = gaussian_filter1d(df_nasdaq['close'], sigma=30, mode='nearest')
df_nasdaq.head()

Let us plot the filtered data together with the original prices in one figure.

In [None]:
ax_nasdaq = df_nasdaq['close'].plot(color='orange')
df_nasdaq['close_filtered'].plot(ax = ax_nasdaq, color='black');
ax_nasdaq.set_title('Closing NASDAQ Composite index',fontsize=16);
ax_nasdaq.set_ylabel('Index')

### Rolling Windows

There are many more ways to filter signals, each with their own advantages and disadvantages. A **median filter** is not so sensitive to outliers as a Gaussian filter.

The Pandas function **`rolling()`** lets you compute rolling-window statistics, such as a rolling median. To use it, you need to specify the size of the window. In the following example, we used a window size of `5`. You can then apply a statistic to the result: in our median-filter example, we calculate the `median`. 

In [None]:
df_nasdaq['median_filtered'] = df_nasdaq['close'].rolling(5).median()
df_nasdaq.head(10)

In every row of the resulting series, you get the median of the values in the window. If you have a window size of `k`, the result in the `n`th row is the median of the rows `n-k+1`, `n-k+2` up to `n`. Note that for rows `0` to `k-1` the median of the window cannot be computed, resulting in `NaN` values in those rows.

Let us plot the filtered data together with the original prices in one figure.

In [None]:
ax_nasdaq = df_nasdaq['close'].plot(color='orange')
df_nasdaq['median_filtered'].plot(ax=ax_nasdaq, color='black');
ax_nasdaq.set_title('Closing NASDAQ Composite index',fontsize=16);
ax_nasdaq.set_ylabel('Index');

### Filtering Mouse Trajectories

Now we will apply the filtering to recorded mouse trajectories stored in the file `'datasets/paths.csv'`.

In [None]:
df_paths = pd.read_csv('datasets/paths.csv', parse_dates=[0])

<div class="alert-info alert" role="alert-info">

You may also like to experiment on trajectories that you record yourself. The following code will launch the mouse experiment and store the recorded trajectories in the dataframe <code>df_paths</code>:

<code>experiment     = MouseExperiment()
_, df_paths, _ = experiment.start()
</code>

When you execute this code (by uncommenting it in the code cell below) the application will start in a new window. Note, that it may be minimized, in which case you may need to find it in the task bar. Next, draw some trajectories. When you close the window, the trajectories will be stored in the dataframe <code>df_paths</code>.

**Before you submit your notebook, make sure that your code also works on the trajectories provided in <tt>datasets/paths.csv</tt>.**

> **Hint:** It is easier to understand the exercises if you **make a drawing, such as a triangle or a square**, instead of drawing straight lines.

In [None]:
#experiment = MouseExperiment()
#_, df_paths, _ = experiment.start()

We print below the first five rows of `df_paths`.

In [None]:
df_paths.head()

> **Note:** We will use the term *trajectory* and *path* interchangeably.

The meaning of the different columns is as follows:

* The **`'trial'`** column contains a unique number per trajectory
* The **`'t'`** column contains contains the time since the start of the trial in ms
* The **`'x'`** column contains the $x$-coordinate on the trajectory at time `'t'` in pixels
* The **`'y'`** column contains the $y$-coordinate on the trajectory at time `'t'` in pixels

We make sure that the `'trial'` column is of integer type, which is needed for the later exercises:

In [None]:
df_paths['trial'] = df_paths['trial'].astype(int)

#### Exercise 1.a

Plot the trajectories obtained from the experiment, i.e. those stored in `df_paths`, in one figure, with both axes ranging from -300 to 300. Recall that different trajectories have a different `'trial'` number.

Here is an example of a possible result for two mouse trajectories:

![example-trajectories.png](attachment:example-trajectories.png)

> **Hint:** You can plot two columns labeled `'col_1'` and `'col_2'` from a dataframe or a groupby object against each other by providing the arguments `x='col_1'` and `y='col_2'` to the plot function.

In [None]:
#// BEGIN_TODO [DAS_1a] Visualize the paths (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1a]

#### Exercise 1.b

Select the data from the last recorded trajectory using a boolean mask, and store it in a dataframe named `df_last_path`.

> **Hint:** The `'trial'` numbers are increasing, i.e. later trajectories have larger `'trial'` numbers.

In [None]:
#// BEGIN_TODO [DAS_1b] Select data (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1b]

df_last_path.head()

#### Exercise 1.c

Plot the last recorded trajectory in the `df_last_path` dataframe. Set the limits of both axes from -300 to 300.

In [None]:
#// BEGIN_TODO [DAS_1c] Select data (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1c]

> **Hint:** After doing this exercise, go back and record some new mouse trajectories. Then, execute the code that you wrote for the previous exercises again, **without modifying it**. It should select the trajectory that you drew last.

#### Exercise 1.d

Use a Gaussian filter with standard deviation of $25$ ms, to approximate the $x$- and $y$- coordinates of the trajectory in `df_last_path`. Store the $x$- and $y$-coordinates of the filtered path in columns labeled `'filt_x'` and `'filt_y'` respectively. Then use a Gaussian filter with standard deviation of $200$ ms, and store the filtered path in columns labeled `'filt_x_coarse'` and `'filt_y_coarse'`.

Finally, plot the original and the two filtered paths in one figure, with both axes ranging from -300 to 300.

<span class="t">Hint<span class="c">:</span></span>
<span class="h">
Consider setting the `figsize` argument of the `plot()` function to something larger (e.g. `figsize=(10,10)`) to better visualize the differences between the filtering options.
</span>

In [None]:
#// BEGIN_TODO [DAS_1d] Gaussian filter (1 point)

# ===== =====> Replace this line by your code. <===== ===== #


In [None]:
#// END_TODO [DAS_1d]

Which filtered trajectory approximates the original curve best?

#### Exercise 1.e

Now use a median filter, with a window size of `80` to approximate the $x-$ and $y-$ coordinates of the trajectory in `df_last_path`. Store the $x$- and $y$-coordinates of the filtered path in columns labeled `'med_filt_x'` and `'med_filt_y'` respectively. Similarly, use a median filter with window size of `200` and store the coordinates of the filtered path as columns `'med_filt_x_coarse'` and `'med_filt_y_coarse'`. 

Finally, plot the original and the two filtered paths in one figure, with both axes ranging from -300 to 300.

In [None]:
#// BEGIN_TODO [DAS_1e] Median filter (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1e]

What differences do you see with respect to the Gaussian-filtered trajectories? You can experiment by drawing different trajectories.

### Computing Changes and Finding Maxima

We will now introduce several new _Pandas_ functions. To illustrate their use, we consider the following example question:

_At what day was the biggest drop in the NASDAQ Composite?_

We approach this as follows. First, we calculate the differences between consecutive rows in a _Series_ with the _Pandas_ function **`diff()`**. We use it to add a new column, labeled `'close_diff'`, to the dataframe `df_nasdaq`.

In [None]:
df_nasdaq['close_diff'] = df_nasdaq['close'].diff()

df_nasdaq[['close', 'close_diff']].head()

In the example above, the value of `'close_diff'` on `2007-01-04` equals to `'close'` on `2007-01-04` minus `'close'` on `2007-01-03`, etc. The value on `2007-01-03` is a `NaN`, because **`diff()`** is not defined for the first row.

For all rows, except the first, you can see that the value in the `'close_diff'` column is exactly the value in the `'close'` column in the same row, minus the value in the `'close'` column in the previous row. In the first row, you see a `NaN`, because _Pandas_ cannot calculate the difference between the first row and the row before that, because the latter does not exist.

Next, we want to know at which date the drop of NASDAQ Composite was the largest. In other words, we are interested for which `'Date'`, the change in closing value was the smallest (i.e. the most negative). We can find this out with the Pandas function **`idxmin()`**.

In [None]:
date_min = df_nasdaq['close_diff'].idxmin()
date_min

Apparently, the value drop was the largest on the 24th of June in 2016. Let us see what the drop actually was.

In [None]:
df_nasdaq.loc[date_min, 'close_diff']

Similarly, we can find out the date with the largest increase of NASDAQ Composite by using the Pandas function **`idxmax()`**.

In [None]:
date_max = df_nasdaq['close_diff'].idxmax()
date_max

If we sum all the differences up to a particular row, we should get back the difference between the value in that row and the value in the initial row. We can compute such a _cumulative sum_ with the _Pandas_ function **`cumsum()`**. We add the column with cumulative sums to the dataframe as a new column labeled `'close_diff_cumulative'`. We also add a column labeled `'change_since_beginning'` in which we directly calculate the difference between the current value and the first value. The last two columns should then have the same values.

In [None]:
df_nasdaq['close_diff_cumulative'] = df_nasdaq['close_diff'].cumsum()
df_nasdaq['change_since_beginning'] = df_nasdaq['close'] - df_nasdaq['close'].iloc[0]
df_nasdaq[['close','close_diff','close_diff_cumulative','change_since_beginning']].head()

The values in the last two columns indeed coincide (except for the first row due to the `NaN`s). Note that we used `.iloc[0]` to refer to the first row in `df_nasdaq`, which is equivalent to `.loc[df_nasdaq.index[0]]`.

To get an impression of how volatile the NASDAQ Composite is, we may also be interested in the total absolute change up to a particular date. To get the absolute change from `'close_diff'` we could use the _Pandas_ function `abs()`, however, for the exercises later it is helpful if we show an alternative way.

In [None]:
df_nasdaq['abs_change'] = (df_nasdaq['close_diff']**2)**(1 / 2)
df_nasdaq['total_abs_change'] = df_nasdaq['abs_change'].cumsum()
df_nasdaq[['close','close_diff','abs_change','total_abs_change']].tail()

Note that we used here `.tail()` to show the last 5 rows of the dataframe.

#### Exercise 1.f

Add a new column to the dataframe `df_last_path`, with the label `'segment_length'`, containing in every row the length of the line segment between the $(x,y)$-coordinate of the current row and the previous row. In the first row, manually set the value of `'segment_length'` to `0`.

> **Hint:** The first row in a dataframe `df` is not necessarily the one at index = 0, but rather the one at index = `df.index[0]`.

In [None]:
#// BEGIN_TODO [DAS_1f] Lengths of line segments (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1f]

df_last_path.head(20)

#### Exercise 1.g

Find the index in `df_last_path` of the first occurrence at which the `'segment_length'` was longest. Assign your answer to the variable `i_longest_segment`.

In [None]:
#// BEGIN_TODO [DAS_1g] Longest segment (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1g]

i_longest_segment

#### Exercise 1.h

Add a new column named `'path_length'` to `df_last_path`, containing in every row the cumulative sum of the line segments so far.

In [None]:
#// BEGIN_TODO [DAS_1h] Compute path length (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1h]

df_last_path.head()

The next step will be to estimate the speed of the trajectory at any given moment. For this, we need to know more about how to approximate derivatives.

### Approximating Derivatives and Finding Trends

Suppose now that we are interested in finding some trends in the stock data. For instance, we want to find periods in which the NASDAQ index was generally increasing and periods in which it was decreasing. We could try to plot the differences between consecutive days in closing prices, but this signal is so volatile that it is impossible to extract some useful information out of it:

In [None]:
ax_marg = df_nasdaq['close_diff'].plot()
ax_marg.set_ylabel('Change in index')
ax_marg.set_title('Change in index between consecutive days');

You can use filtering to find an averaged derivative of the signal over a certain time window. For this, we apply a higher-order Gaussian filter to the original time signal. We indicate that we want a first-order filter by specifying the keyword argument `order=1`. (In general, `order=n` gives an approximation of the `n`th derivatives. The default value is `order=0`, which just gives an approximation to the original function.) We choose a standard deviation `sigma=30` for the Gaussian filter to average the derivatives over a time period of approximately a month.

In [None]:
df_nasdaq['close_filtered_deriv'] = gaussian_filter1d(df_nasdaq['close'], sigma=30, order=1, mode='nearest')
df_nasdaq.head()

We visualize the approximate derivative in the graph below.

In [None]:
ax_deriv = df_nasdaq['close_filtered_deriv'].plot()
ax_deriv.set_title('Approximate derivative of NASDAQ index', fontsize=16);
ax_deriv.set_ylabel('Change in index per day');

#### Exercise 1.i

Approximate the _speed_ (in pixels per ms) of the trajectory in `df_last_path` by applying a Gaussian filter of order $1$ to the path length. Store the result in a column `'approximate_speed'`. Use a standard deviation of $25$ ms.
Plot the result.

In [None]:
#// BEGIN_TODO [DAS_1i] Compute approximate speed (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_1i]

## 2. Data Aggregation

In this section, we will compute more features of the data and introduce some very powerful _Pandas_ techniques:

* **Multi-level indexing**
* **Advanced data aggregation** with the function **`agg()`**
* **Joining** two dataframes with the function **`join()`**
* **Applying** a function, Series or dictionary to another Series with the function **`map()`**

Master these techniques, and you will be a truly skilled _Pandas_ user.

We work with the mouse trajectory data that was collected during the mouse experiment, which we load and prepare first. 

## Data Preparation

### Data: Mouse Trajectories

The actual trajectories are stored in `'datasets/fitts.csv'`, which we read into the dataframe `df_fitts`. To create this file, we have processed recorded trajectories as in the previous section.

In [None]:
df_fitts = pd.read_csv('datasets/fitts.csv')
df_fitts.head()

The dataframe is very similar to the one of the previous section. The biggest difference is that it contains paths recorded by more than one user. Each user is assigned a unique number, which is stored in the **`'user'`** column.

### Multi-level Indexing

Every row is _uniquely_ identified by the **triple** of values in the `'user'`, `'trial'` and `'t'` column. The combination of this triple would make for an ideal index for the dataframe. Such an index that is made out of multiple components is called a **multi-level index** (sometimes called a hierarchical index or MultiIndex). We make the triple of columns `'user'`, `'trial'` and `'t'` into a multi-level index, by providing their names as a list to the `set_index()` function. 

In [None]:
df_fitts.set_index(['user','trial','t'], inplace=True)
df_fitts.head()

Rows can be retrieved from a dataframe with a multi-level index by supplying a value for all the index components to `.loc[]`. For example, the following will return the third row in the above dataframe:

In [None]:
df_fitts.loc[1164, 5, 0.04]

In the remainder of this notebook, we will use multi-level indexes mainly for joining dataframes.

### Data: User Properties

The file `'datasets/user_props.csv'` contains settings that are constant for each user. It contains a table with the following columns:

* **`'user'`**: An integer number identifying each user
* **`'use_tue_laptop'`**: Whether user used a TU/e laptop
* **`'right_handed'`**: Whether user is right-handed or not
* **`'platform'`**: The operating system of the user
* **`'platform_version'`**: The version of the operating system of the user

We load it into the dataframe `df_user_props`.

In [None]:
df_user_props = pd.read_csv('datasets/user_props.csv')
df_user_props.head()

Every row in this table is uniquely identified by the integer in the `'user'` column. Therefore, we set this column as the index of the dataframe.

In [None]:
df_user_props.set_index('user',inplace = True)
df_user_props.head()

### Conversion to Readable Table Entries

Some of the columns contain integers that do not have a clear meaning. The interpretation of these integers is encoded by the following dictionaries.

In [None]:
dict_use_tue_laptop = {0: False, 1: True}
dict_right_handed = {0: False, 1: True}

You can use these dictionaries to create more readable columns. But first there is a technical point that the values in for instance the column `'use_tue_laptop'` may be floats and not integers. So before you can use the dictionary, you would have to convert the column to integers.

In [None]:
df_user_props['use_tue_laptop'] = df_user_props['use_tue_laptop'].astype(int)

#### The _Pandas_ Function **`map()`**

You can now use the _Pandas_ function **`map()`** with the argument `dict_use_tue_laptop` to map every `0` to a `False` and every `1` to a `True`.

In [None]:
df_user_props['use_tue_laptop'] = df_user_props['use_tue_laptop'].map(dict_use_tue_laptop)
df_user_props.head()

### Data: User Trial Properties

The file `'datasets/user_trial_props.csv'` contains settings that are specific to each user trial. It contains a table with the following columns:

* **`'user'`**: an integer that identifies the user who drew the trajectory
* **`'trial'`**: an integer that identifies the trajectory. The pair of `'user'` and `'trial'` identify a trajectory uniquely
* **`'delay'`**: the time in seconds between the user moving the mouse on the red square in the origin, until the target appeared
* **`'input_method'`**: whether the user used a trackpad or a mouse
* **`'target_radius'`**: the radius of the target in pixels
* **`'target_x'`**: the $x$-coordinate of the target in pixels
* **`'target_y'`**: the $y$-coordinate of the target in pixels
* **`'total_time'`**: the total time of the trial in seconds

The meaning of the values in the `'input_method'` columns is encoded by the following dictionary.

In [None]:
dict_input_method = { 0 : 'trackpad', 1 : 'mouse' }

We read those properties into the dataframe `df_user_trial_props`.

In [None]:
df_user_trial_props = pd.read_csv('datasets/user_trial_props.csv')
df_user_trial_props.head()

Each row in the dataframe `df_user_trial_props` is uniquely determined by the pair of values in the `'user'` and `'trial'` column. This pair of values would therefore make for an ideal **multi-level index** of the dataframe.

#### Exercise 2.a

Set the index of the dataframe `df_user_trial_props` to a multi-level index consisting of the `'user'` and `'trial'` columns.

In [None]:
#// BEGIN_TODO [DAS_2a] Multi-level index (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_2a]

df_user_trial_props.head()

#### Exercise 2.b

Use the dictionary `dict_right_handed` to convert the data in the column `'right_handed'` in `df_user_props` to more readable values. Similarly, use the dictionary `dict_input_method` to convert the data in the column `'input_method'` in the dataframe `df_user_trial_props` to more readable values.

In [None]:
#// BEGIN_TODO [DAS_2b] Convert columns in dataframes (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_2b]

df_user_trial_props.head()

## Data Aggregation

In the EDA exercises you have already encountered various ways to aggregate data. One very useful method was to _group_ the data. For instance, if we want to compute the average approximate speed in every trial, we can do this by first grouping on `'user'` and `'trial'`, selecting the `'approximate'` column and computing the mean.

In [None]:
df_speeds = df_fitts.groupby(['user','trial'])[['approximate_speed']].mean()
df_speeds.head()

Note that the resulting dataframe is indexed by the pair of `'user'` and `'trial'` as a multi-level index.

### Computing Several Aggregated Quantities at Once

The _Pandas_ library provides a very convenient function **`agg()`**, which can compute several aggregated quantities on grouped data at once. As an argument, you can supply a dictionary, which maps column names to either

* a name of a function (string), e.g. `'mean'`, `'median'`, `'sum'`, or `'count'`
* a function
* a list of names of functions, e.g. `['sum','count']`

We present an example of the last usage. We compute both the **mean** and the **median** of `'approximate_speed'` and the **max** value of `'x'`

In [None]:
df_features = df_fitts.groupby(['user','trial']).agg({'approximate_speed':['mean','median'], 'x':['max']})
df_features.head()

In the result, not only do the rows have a multi-level index, but also the columns. For our next step, this is not so useful, so we first rename the columns.

In [None]:
df_features.columns=['appr_speed_mean', 'appr_speed_median', 'x_max']
df_features.head()

### Joining Two Dataframes

We now want to add the computed features as columns to the old dataframe with properties `df_user_trial_props`. We can do this by _joining_ two dataframes with the _Pandas_ function **`join()`**. We store the result in a dataframe `df_results`. 

In [None]:
df_results = df_user_trial_props.join(df_features)
df_results.head()

> **Important:** For the **`join()`** function to behave properly, it is important that the two dataframes are indexed in the same way. For example, it was essential for joining `df_user_trial_props` with `df_features` that we used the same multi-level index for both dataframes.

Because only the paths of a few users are stored in `df_fitts`, for most users we could not compute the features, and they appear as `NaN`s in the table above. However, if we remove those, we can see that the features were added.

In [None]:
df_results.dropna().head()

#### Exercise 2.c

Compute the length for each trajectory in `df_fitts` and join the result with the dataframe `df_results`. Call the resulting dataframe `df_fitts_results`. Make sure that the column containing the total length is called `'length'`.

In [None]:
df_fitts.head()

In [None]:
#// BEGIN_TODO [DAS_2c] Compute the length per trajectory (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_2c]

df_fitts_results.dropna().head()

#### Exercise 2.d

Compare the mean and median trajectory length per input method. The result should be a dataframe indexed by input method and have columns containing the mean and the median total length.

> **Hint:** Use the `df_fitts_results` dataframe.

In [None]:
#// BEGIN_TODO [DAS_2d] Mean and median per input method (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_2d]

Do you see any differences in path lengths between mouse trajectories and trackpad trajectories? 

Let us now compare total times instead. Let's not only compute two statistics of the data, the median and the mean, but instead look at the *distribution* of the total times.

## 3. Empirical Cumulative Distribution Functions

To better compare the total times for mouse and trackpad trajectories, we will now look at their empirical cumulative distribution functions (ECDFs).

Intuitively, $\mathrm{ECDF}$ is a function of $x$ that for a given value of $x$ returns the fraction of observations that are *lower or equal* to $x$.

More precisely, if $x_1, \dots, x_N$ are the $N$ outcomes of an experiment, the ECDF, evaluated in $x$, is defined as the number $N_x$ of indices $i$ such that $x_i \leq x$, divided by $N$:

$$
\mathrm{ECDF}(x) = \frac{N_x}{N}= \frac{1}{N} \# \{ i \ |\ x_i \leq x \}. 
$$

The following figure shows an example of an ECDF for an experiment where the outcomes ranged between 0 and 50 and were sampled from a normal distribution:

![ecdf.png](attachment:ecdf.png)

We will show how to compute and plot the ECDF. 

Although the Python library _statsmodels_ provides a function `ECDF()` that can compute the ECDF for you, we present how to compute the ECDF by hand. This way we will get to know and practice with very useful _Pandas_ functions **`value_counts()`** and **`sort_index()`** and we will get more insight in what the ECDF actually is. Moreover, plotting an ECDF remains a bit tricky: using the library function provides almost no advantage.

Let us consider a made-up experiment sampling integer numbers between 0 and 10 with outcomes `[5,3,5,7,1]`.

In [None]:
df_experiment = pd.DataFrame([5,3,5,7,1], columns=['outcome']) 
df_experiment

To determine the ECDF, we first want to know, for every value appearing in the `'outcome'` column, how often it occurs. For this, we use the _Pandas_ function **`value_counts()`**.

In [None]:
df_experiment['outcome'].value_counts()

We interpret the result of `value_counts()` as follows: The value `5` appears `2` times in the `'outcome'` column, the value `7` appears `1` time, as do the values `3` and `1`. We rename the Series to `'counts'`, and convert it to a dataframe `df_counts`.

In [None]:
df_counts = pd.DataFrame( df_experiment['outcome'].value_counts().rename('counts') )
df_counts

Next, we want to sort the index. This will later allow us to compute the ECDF efficiently.

In [None]:
df_counts.sort_index(inplace=True)
df_counts

The ECDF can now be computed easily. To find the ECDF evaluated in a point $x$, we need two pieces of information
* The number $N_x$ of outcomes smaller than or equal to $x$
* The total number of outcomes $N$

Then $\mathrm{ECDF}(x) = N_x/ N$.
Therefore, the ECDF evaluated at the `index` in the dataframe above, is equal to the cumulative sum of the `'counts'` column, divided by $N$.

In [None]:
df_counts['ecdf'] = df_counts['counts'].cumsum() / df_counts['counts'].sum()
df_counts

Plotting the ECDF is in fact quite tricky. A graph of an ECDF is a step function, i.e. piecewise flat. To achieve this, we provide the keyword argument `drawstyle ='steps-post'`. 

In [None]:
ax = df_counts[['ecdf']].plot(drawstyle='steps-post')
ax.set_xlim(0,10)

The graph has a major flaw: it suddenly starts at the smallest recorded outcome, and stops at the largest recorded outcome, even though the ECDF is actually defined on the whole real line. We want to draw the graph for the whole plot range, which for our example above means that we want to plot the ECDF from `x=0` to `x=10`.

Therefore, we copy the `'ecdf'` column of the dataframe `df_counts` to a dataframe `df_ecdf` and add two more $x$-values to the dataframe `df_ecdf` together with the corresponding value of the ECDF: 
* one value of $x$ much smaller than the smallest recorded outcome, e.g. $x = -2000$, so the ECDF in that point is $\mathrm{ECDF}(-2000) = 0$. 
* one value of $x$ much larger than the largest recorded outcome, e.g. $x = 2000$, so the ECDF in that point is $\mathrm{ECDF}(2000) = 1$.

> **Note:** The precise values $x=-2000$ and $x=2000$ are not important. For the plotting it only matters that the first value is smaller than the left boundary of the plot, and the second is larger than the right boundary. We choose our values _much_ smaller and _much_ larger respectively, so that we don't have to change the values if we change our mind about our desired plot range.

In [None]:
df_ecdf = df_counts[['ecdf']].copy()
df_ecdf.loc[-2000,'ecdf'] = 0
df_ecdf.loc[2000,'ecdf'] = 1
df_ecdf

We sort once more the dataframe by the index.

In [None]:
df_ecdf.sort_index(inplace=True)
df_ecdf

Now we are finally ready to plot the ECDF. 

In [None]:
ax = df_ecdf['ecdf'].plot( drawstyle ='steps-post' )
ax.set_xlim(0, 10)
ax.set_xlabel('outcome')
ax.set_title('ECDF', fontsize=14);

### Exercises: Compare ECDFs for Mouse a Trackpad Trajectories

To get more insight into the differences in total times needed to reach a target by using a mouse vs. a trackpad, we aim to plot both the ECDF for the total times for the *mouse* trajectories and the ECDF for the *trackpad* trajectories in one figure. This is a rather big task, so we split it up in several exercises for the mouse trajectories. Afterwards, you can apply the same steps for the trackpad trajectories.

#### Exercise 3.a

Use the dataframe `df_user_trial_props` to create a dataframe `df_mouse`. The dataframe `df_mouse` should contain one column, labeled `'total_time'`, with the total time for each trajectory that was recorded with a **mouse**.

In [None]:
#// BEGIN_TODO [DAS_3a] Total times mouse trajectories (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_3a]

df_mouse.head()

#### Exercise 3.b

Create a dataframe `df_mouse_counts` indexed by all total times occurring in the dataframe `df_mouse`. The dataframe `df_mouse_counts` should contain one column, labeled `'counts'`, which contains how often that total time occurs in `df_mouse`. Make sure that the indices are sorted.

In [None]:
#// BEGIN_TODO [DAS_3b] Count occurences total times (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_3b]

df_mouse_counts.head()

#### Exercise 3.c

Add a column labeled `'ecdf'` to the dataframe `df_mouse_counts`, containing the value of the ECDF evaluated at the time in the index.

In [None]:
#// BEGIN_TODO [DAS_3c] Add column with ECDF (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_3c]

df_mouse_counts.head()

#### Exercise 3.d

Copy the `'ecdf'` column of the dataframe `df_mouse_counts` to a new dataframe `df_mouse_ecdf`. Add two new values of the ECDF to this dataframe, one at a very large negative time, and one at a very large positive time, let's say at time $-2000$ s and $2000$ s. Make sure that the index of the dataframe is sorted.

In [None]:
#// BEGIN_TODO [DAS_3d] Add new values to ECDF dataframe (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_3d]

df_mouse_ecdf.head()

> **Note:** When you are working towards a goal (such as plotting an ECDF) and you need to make multiple steps, it is good to regularly display some intermediate results to see whether everything looks as expected. Exactly for this reason, we have called the `head()` function at the end of each code cell. However, after this exercise, we would also like to know whether the value of the ECDF at time $2000$ s is inserted correctly.

#### Exercise 3.e

Display the last five rows of the dataframe `df_mouse_ecdf`. 

> **Hint:** Just like the function `head()` displays the first five rows of a dataframe, the function `tail()` displays the last five rows.

In [None]:
#// BEGIN_TODO [DAS_3e] Display last five rows ECDF dataframe (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_3e]

Does everything look okay? Then let's follow the same steps for the trackpad trials.

#### Exercise 3.f
    
Create a dataframe `df_trackpad_ecdf` indexed by all total times occurring in **trackpad** trajectories, an additional very large negative time and a very large positive time, and with a column `'ecdf'` containing the value of the ECDF at these times. We advise you to use multiple code cells and regularly display output to check whether everything looks as expected.

In [None]:
#// BEGIN_TODO [DAS_3f] Create ECDF for trackpkad trajectories (2 points)

# ===== =====> Replace this line by your code. <===== ===== #


In [None]:
#// END_TODO [DAS_3f]

df_trackpad_ecdf.head()

Time to see the results... 

#### Exercise 3.g

Plot in one figure:

* the ECDF of the total time used for the **mouse** trials
* the ECDF of the total time used for the **trackpad** trials

Make sure the $x$-axis runs from $0$ to $4$ (s), and include a legend explaining which graph corresponds to the mouse, and which corresponds to the trackpad.

<span class="t">Hint<span class="c">:</span></span>
<span class="h">
Use dataframes `df_mouse_ecdf` and `df_trackpad_ecdf` defined above.
</span>

In [None]:
#// BEGIN_TODO [DAS_3g] Plot ECDFs (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_3g]

## 4. Exercise: Find Ballistic Motion


In the earlier exercises you were closely guided. In the following exercises we will put your knowledge to the test.

Note that these exercises may seem more difficult as you will need to work more indepedently. When you struggle with an exercise then go back to the corresponding earlier section and make sure you really understand the introduced concepts. Do not hesitate to experiment with your own code!


In this exercise, we will use the data in `'datasets/path.csv'`, loaded into the dataframe `df_path`:

In [None]:
df_path = pd.read_csv('datasets/path.csv')
df_path.head()

It contains several features that were derived from the raw data of a single mouse trajectory, similarly to the exercises in Section 1. The `'filt_x'` and `'filt_y'` columns contain the approximate $𝑥$- and  $𝑦$- coordinates computed by applying a Gaussian filter to the raw coordinates, and the  `'approximate_speed'` column contains the approximate speed (in pixels per ms) computed by applying a Gaussian filter of order 1 to the path length.

In this exercise we will extract one more feature, which was already mentioned in the lecture: we are going to extract the _ballistic part_ of the mouse trajectory. This is quite challenging, so do not hesitate to ask your tutor for hints.

The ballistic part of the motion is the motion restricted to a certain time-interval (from `i_left` to `i_right`) around the time `i_max` at which the speed is maximal, as illustrated by the following picture.

![ballistic-illustration.png](attachment:ballistic-illustration.png)

Here is a picture of the corresponding path, where the ballistic part of the motion is indicated in black.

![ballistic-superimposed.png](attachment:ballistic-superimposed.png)

We want to use _Pandas_ to extract this ballistic part of the trajectory. For that, we need a _very precise definition_. The precise definition of the _ballistic_ part of the mouse trajectory is as follows: we first find the index `i_max` for which the (approximate) _speed_ is maximal (if there are multiple such indices, we take the smallest). Then, we define the range of indices from `i_left` to `i_right` as the largest range of indices containing `i_max` such that the speed is _increasing_ from `i_left` to `i_max` and _decreasing_ from `i_max` to `i_right`. The _ballistic_ part of the mouse trajectory is defined as the mouse trajectory restricted to the range from `i_left` to `i_right`.

In other words, `i_left` is the smallest index such that for every index `i` between `i_left`  and `i_max` the approximate speed at `i` is _larger than or equal to_ the approximate speed at `i-1`. Similarly, `i_right` is the largest index such that for every `i` between `i_max + 1` and `i_right`, the approximate speed at `i` is _smaller than or equal to_ the approximate speed at `i-1`.

Our goal will be to find `i_left` and `i_right` for the trajectory in `df_path`. We will afterwards define the dataframe `df_ballistic` as `df_path[i_left:i_right]`.

### Exercise 4.a

Find the index in `df_path` for which the `'approximate_speed'` is maximal. Assign your answer to the variable `i_max`.

In [None]:
#// BEGIN_TODO [DAS_4a] Find i_max (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_4a]

df_path[i_max-2:i_max+3]

In the above dataframe, the approximate speed should be maximal for the row in the middle, with index equal to `i_max`:

In [None]:
i_max

### Exercise 4.b

Add a column labeled `'speed_diff'` to the dataframe `df_path`, containing in every row the difference between the approximate speed in that row and the approximate speed in the previous row.

In [None]:
#// BEGIN_TODO [DAS_4b] Add speed_diff (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_4b]

df_path.head()

### Exercise 4.c

Find `i_left`, the smallest index in `df_path` such that for every index `i` between `i_left` and `i_max` (i.e. `i_left` $\leq$ `i` $\leq$ `i_max`) the approximate speed at `i` is _larger than or equal to_ the approximate speed at `i-1`.

<span class="t">Hint<span class="c">:</span></span>
<span class="h">
Note that a trajectory consists of alternating chunks for which `i` is _larger than or equal to_ the approximate speed at `i-1`, and those for which its opposite holds. We can get `i_left` by finding the last chunk for which the opposite holds (i.e. where speed at `i` is _strictly less than_ the speed at `i-1`).
</span>

<span class="t">Hint<span class="c">:</span></span>
<span class="h">
First use slicing and the index `i_max` to make a new dataframe `df_first_part` which only contains the rows up to (and including) the index `i_max`. Next, use the column `'speed_diff'` created before to select only those rows `i` for which the speed at `i` is *strictly less than* the speed at `i-1`. You can use this last dataframe to find out what `i_left` should be.
</span>

In [None]:
#// BEGIN_TODO [DAS_4c] Find i_left (1 point)

# ===== =====> Replace this line by your code. <===== ===== #


In [None]:
#// END_TODO [DAS_4c]

i_left

The following code cell can give you a quick (although not a full) check of your work. If everything worked out, the value in the column `'speed_diff'` should be less than zero in the first row, and larger than or equal to zero in the other rows.

In [None]:
df_path.loc[i_left-1:i_left+3]

Note that an interval `.loc[a:b]` includes both `a` and `b`.

### Exercise 4.d

Find `i_right`, the largest index such that for every index `i` between `i_max + 1` and `i_right` (`i_max + 1` $\leq$ `i` $\leq$ `i_right`) the approximate speed at `i` is _smaller than or equal to_ the approximate speed at `i-1`.

In [None]:
#// BEGIN_TODO [DAS_4d] Find i_right (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_4d]

i_right

The next code cell can give you a quick (although not a full) check of your work. If everything worked out, the value in the column `'speed_diff'` should be larger than zero in the last row, and less than or equal to zero in the other rows.

In [None]:
df_path.loc[i_right - 3:i_right + 1]

Now we are ready to define `df_ballistic`:

In [None]:
df_ballistic = df_path.loc[i_left:i_right]
df_ballistic.head()

In [None]:
df_ballistic.tail()

### Exercise 4.e

Plot the full, filtered, mouse trajectory (with coordinates `'filt_x'` and `'filt_y'`) in `df_path` in blue, and on top of it indicate the ballistic motion with a thick, black curve (i.e. with `linewidth` equal to `3`).

In [None]:
#// BEGIN_TODO [DAS_4e] Plot ballistic motion (1 point)

# ===== =====> Replace this line by your code. <===== ===== #

#// END_TODO [DAS_4e]

# Feedback

Please fill in this questionaire to help us improve this course for the next year. Your feedback will be anonymized and will not affect your grade in any way!

### How many hours did you spend on these Exercises?

Assign a number to `feedback_time`.

In [None]:
#// BEGIN_FEEDBACK [Feedback_1] (0 point)

#// END_FEEDBACK [Feedback_1] (0 point)

import numbers
assert isinstance(feedback_time, numbers.Number), "Please assign a number to feedback_time"
feedback_time

### How difficult did you find these Exercises?

Assign an integer to `feedback_difficulty`, on a scale 0 - 10, with 0 being very easy, 5 being just right, and 10 being very difficult.

In [None]:
#// BEGIN_FEEDBACK [Feedback_2] (0 point)

#// END_FEEDBACK [Feedback_2] (0 point)

import numbers
assert isinstance(feedback_difficulty, numbers.Number), "Please assign a number to feedback_difficulty"
feedback_difficulty

### (Optional) What did you like?

Assign a string to `feedback_like`.

In [None]:
#// BEGIN_FEEDBACK [Feedback_3] (0 point)

#// END_FEEDBACK [Feedback_3] (0 point)

### (Optional) What can be improved?

Assign a string to `feedback_improve`. Please be specific, so that we can act on your feedback. For example, mention the specific exercises and what was unclear.

In [None]:
#// BEGIN_FEEDBACK [Feedback_4] (0 point)

#// END_FEEDBACK [Feedback_4] (0 point)




## How to Submit Your Work

1. **Before submitting**, you must run your notebook by doing **Kernel > Restart & Run All**.  
   Make sure that your notebook runs without errors **in linear order**.
1. Remember to rename the notebook, replacing `...-template.ipynb` with `...-yourIDnr.ipynb`, where `yourIDnr` is your TU/e identification number.
1. Submit the executed notebook with your work
   for the appropriate assignment in **Canvas**.
1. In the **Momotor** tab in Canvas,
  you can select that assignment again to find some feedback on your submitted work.
  If there are any problems reported by _Momotor_,
  then you need to fix those,
  and **resubmit the fixed notebook**.

In case of a high workload on our server
(because many students submit close to the deadline),
it may take longer to receive the feedback.




---

In [None]:
# List all defined names
%whos

---

# (End of Notebook) <span class="tocSkip"></span>

&copy; 2017-2023 - **TU/e** - Eindhoven University of Technology