---
title: "Lab: Exploring a Dataset"
toc: true
output-file: lab_explore_dataset.html
---

We want to take a look at this real-world dataset: [https://github.com/OpenNeuroDatasets/ds005420](https://github.com/OpenNeuroDatasets/ds005420)

## Download data

### Clone the repository 
Inside the `/pycourse/data` folder:
```bash
git clone https://github.com/OpenNeuroDatasets/ds005420
```

### Install `git-annex`
The files with the actual data are not there, but we have the references to them so that we can pull them down.
We will need the tool [git-annex tool](https://git-annex.branchable.com/install/).


### Pull the data
Open a terminal inside `ds005420` and run:
```bash
git-annex get .
```
You should see a progress dialog showing `...from s3-PUBLIC...`  
After that, you're ready to go with the exercises.

:::{ .callout-tip }
Jupyter Notebooks are ideal for this kind of exploratory tasks.
Make a directory called `/python/notebooks` and open there a jupyter lab instance.
Having the notebooks there will help us keeping things tidy for later reproducibility of our workflow.
:::

## Explore files
1) List only the sub-directories in path.  
2) List only the sub-directories with subject data.
3) Write a function that lists sub-directories with subject data.

## Validating the data
We will start by making sure our data/metadata contains the information we expect at a high level.

1) Write a unit test (inside `/pycourse/tests/test_data.py`) to make sure the number of subject sub-directories corresponds to actual the number of subjects. **Hint:** Look at the metadata.
2) Verify that all subject directories have a eeg sub-directory.  
3) Verify that all data in a subject directories matches with the subject number.  
4) Assert that EEG data for all subjects was taken using 20 channels and sampling frequency 500.  
5) (Optional) Write a file (`discarded_subjects.txt`) with the subject numbers that do not match that criterion.  

## Exploratory data analysis 
Now we want to look at the data.
We find that the data is in a particular format `.edf` that we cannot directly read in python.  
**Hint:**
We need to install a third-party library `mne` to read `.edf` files.  
You can check out the [library documentation here](https://mne.tools/dev/)

:::{ .callout-tip }
It's a *very* good idea to first take a look at the documentation of a tool before installing it.
Executing someone else's code is a potential risk so you should try to find out if you can actually trust the source.
:::

1) Plot one time series.  
2) Plot all time series with labels according to channel name.  
3) Plot the channels that start with "T" and "O".  
4) Plot a correlation plot of the "T" and "O" channels as a heatmap.
5) Plot a histogram of `RecordingDuration` across all subjects.  

## Process data
After having taken this quick look at the data, we want to start processing the data.

1) Clean the column names removing "EEG", eg "EEG C4-A1A2" -> "C4-A1A2"
2) Substract the mean from each channel  
3) Plot correlation matrix of all-vs-all channels. **Hint:** Look at seaborn documentation on heatmaps.
4) Save the correlation plot as vector graphics.