## Running Data Pipelines In a Jupyter Notebook

A Jupyter Notebook is an open-source web application that allows you to create and share documents containing live code, equations, visualizations, and narrative text. It is useful for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more, enabling interactive data science and scientific computing across over 40 programming languages. In this workshop we will be mostly using Jupyter Notebooks to write our Python code for data analyses and visualization.

Here are some of the key usecases of Jupyter Notebooks:
  - Documentation (using Markdown Cells)
  - Code (using Code Cells)
  - Running Code in other files (using the `%run` command)
  - Installing Python Packages (using the `%pip install` command)

By combining them together, one can create a documented series of steps explaining the analyses and/or how someone can run the code!

### Data for this course

The analyses in our course is focused on a dataset provided by Nicholas Steinmetz and his colleagues out of their 2019 publication in Nature ([Steinmetz et al, 2019](https://www.nature.com/articles/s41586-019-1787-x)).  

In this notebook there are several steps containing code for:
  - Downloading the paper
  - Downloading example videos
  - Downloading the data provided by the authors
  - Converting the data into easier-to-analyze formats for this course


Each Step has two parts, each of which need to be run:
  1. **Download the Dependencies**. Some of the steps might need certain packages; to download and install them into your Python environment, run the cells with `%pip install` in them.
  2. **Run the Code**.  Some of the cells have the code written directly here, while others run scripts found in other files.  Just run the cell and when the code is finished running you will see "Success!" printed below it!

### Exercise: Run the Data Pipeline   

Please run the code in the following sections to download the data and save it as a netcdf file (`.nc` extsion). **Note** that it is not necessary to understand the details of the code here, but rather to get a feel for using and running Jupyter notebooks. 

You can find some of the useful keyboard shortcuts for Jupyter Notebook in [this cheatsheet](jupyter_cheatsheet.md).

---

### Step 1: Download The Nature Paper

Nature Paper: https://www.nature.com/articles/s41586-019-1787-x

In [None]:
import os
import urllib.request

url = 'https://www.nature.com/articles/s41586-019-1787-x.pdf'
filename = 'references/steinmetz2019.pdf'

os.makedirs('references', exist_ok=True)
urllib.request.urlretrieve(url, filename);
print('Success!')

---

#### Step 2: Download the Trimmed Videos

  - *Inputs From*: iBOTS-Hosted Sciebo Folder
  - *Outputs To*: `vids/*.avi`

##### Install Dependencies

In [None]:
%pip install --upgrade requests tqdm

##### Run Code

In [None]:
import os
import requests
from tqdm import tqdm

def download_from_sciebo(public_url, to_filename, is_file = True):
    """
    Downloads a file or folder from a shared URL on Sciebo
    """
    r = requests.get(public_url + "/download", stream=True)
    progress_bar = tqdm(desc=f"Downloading {to_filename}", unit='B', unit_scale=True, total=int(r.headers['Content-Length'])) if is_file else tqdm()
    with open(to_filename, 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192):
            f.write(chunk)
            progress_bar.update(len(chunk))


os.makedirs("vids", exist_ok=True)
download_from_sciebo("https://uni-bonn.sciebo.de/s/oMoBlis0VvAsblG", "vids/eyetracking_example_steinmetz2019.avi")
download_from_sciebo("https://uni-bonn.sciebo.de/s/fDY3V8JnZEOPnCR", "vids/mouse_wheel_example_steinmetz2019.avi")
print('Success!')

--- 

### Step 3: Download the Steinmetz et al, 2019 Dataset

From the [Neuromatch Academy Data Archive](https://osf.io/hygbm), hosted by the [Center for Open Science](https://osf.io/)

  - *Input from*: The Internet
  - *Output to*: `data/raw`

##### Install Dependencies

In Python you can make a single-line comment using the "`#`" symbol. For instance, the package installation in the cell below is commented out (since we already installed these packages in the last step).

In [None]:
# %pip install --upgrade requests tqdm

##### Run Python script

In [None]:
%run scripts/1_download_data.py
print('Success!')

**This step takes around 10 minutes**: While the data is being downloaded feel free to watch [this video](https://youtu.be/WXn4-FpVaOo?si=JtCNyh6Xf102zOHg) from the Neuromatch academy in which Nicholas Steinmetz breifly explains the dataset. If you have doubts feel free to discuss them in your group!

---

### Step 4: Process the Data

  - *Inputs from*: `data/raw/*`
  - *Outputs to*: `data/processed/*.nc`


##### Install Dependencies

In [None]:
%pip install --upgrade numpy pandas xarray netCDF4 pyarrow

##### Run Python script

In [None]:
%run scripts/2_convert_to_netcdf.py
print('Success!')

---

### Step 5: Extract Tables for Today's Analysis

  - *Inputs from*: `data/processed/*.nc`
  - *Outputs to*: `data/final/*.csv`

##### Install Dependencies

In [None]:
# %pip install pandas xarray

##### Run Python script

In [None]:
%run scripts/3_extract_to_csv.py

---

### Step 6: Explore VS Code's CSV extensions

VS Code is a highly customizable Integrated Development Environment (IDE) and you can find many useful extensions from the extensions marketpalce. 

To explore the extensions, click on the Extensions icon on the left side bar (usually the 5th icon from the top) and you can simply type "csv" in the search box to see all the csv-related extensions available to be added to your VS Code. 

Feel free to explore some of these extensions (e.g. 🌈 Rainbow CSV) and see if they make it easier to view or interact with a CSV file.