# Exploratory Data Analysis: Part 1

Aim: demonstrate a **standard project setup** which makes analyses easy to **run**, **understand** **reproduce** and **share**

## Agenda

🏔️ Project Layout

📘 Notebooks

📦 Packages

🌳 Project Environment 

📸 Version Control

Follow along here: https://github.com/brown-ccv/ccv-bootcamp-python-2023

# The core project layout (based on 'cookiecutter') is a reasonable starting point 
🏔️ **Project Layout**

```
├── notebooks          <- Interactive analysis notebooks (we are here!)
├── data
│   ├── raw            <- Original, immutable data dump
│   ├── interim        <- Transformed data
│   └── processed      <- Final data sets for modeling
└── src                <- Scripts to...
    ├── data           <- Download or generate data  
    ├── features       <- Turn data into features for modeling
    ├── models         <- Train models and make predictions
    └── visualization  <- Visualize data & results
```

You can set up a project like this with the instructions here: [cookiecutter data science](https://github.com/drivendata/cookiecutter-data-science)

# Notebooks combine code, narrative text, visualizations and mathematics
📘 **Notebooks**

> [(Jupyter) Notebook](https://jupyter-notebook.readthedocs.io/en/stable/) is a web-based notebook environment for interactive computing.

Notebooks combine:

- code (with results including visualizations),
- formatted text including mathematics and links, and 
- interactive widgets.

Adapted from [the 2022 bootcamp session](https://github.com/compbiocore/ccv_bootcamp_python/blob/main/notebooks/Using_jupyter.ipynb) – thanks August and Ashok!

# Notebooks are great for presenting analyses but not for writing reusable code
📘 **Notebooks**: When to use them (and when not)

### ❤️ Good
Jupyter notebooks are good for presenting analyses (like a paper) or describing code ([Literate programming](https://www-cs-faculty.stanford.edu/~knuth/lp.html)).

### ⚡️ Challenging

[Extensions (nbdev)](https://nbdev.fast.ai) are required for writing packages with them 
(it's tricky to [import functions from other notebooks](https://ipynb.readthedocs.io/en/latest/)).

Larger analyses with several notebooks might [need scripting](https://nbconvert.readthedocs.io/en/latest/execute_api.html#executing-notebooks-from-the-command-line) to ensure consistency.

# The menu bar shows all the main ways of interacting with the notebook
📘 **Notebooks**: Menu Bar

![](images/menu-bar.png)

# Code cells contain code to be executed
📘 **Notebooks**: Cells: Code

A code cell looks like this (and shows the final output immediately afterwards):

In [2]:
print("Hello, World!")

Hello, World!


# Markdown cells contain text, mathematics and images to be displayed
📘 **Notebooks**: Cells: Markdown
... whereas [markdown](https://daringfireball.net/projects/markdown/) cells looks like this:

### Mathematical expression

Here's an inline expression $e^{i\pi}=-1$
and here's a displayed equation:
$$–
\sum_{i = 0}^\infty \frac{1}{i!} = \frac{1}{0!} + \frac{1}{1!} + \frac{1}{2!} + \frac{1}{3!} + \frac{1}{4!} + \cdots = e
$$

### Table
| Syntax      | Description |
| ----------- | ----------- |
| Header      | Title       |
| Paragraph   | Text        |

### Image

![](images/brown.svg).

# View a variable's value by putting it at the end of the cell
📘 **Notebooks**: Variable Viewing

In [11]:
k = 0
for i in range(7):
    k += i
k

21

# ⚠️ Out of Order Notebooks are Bad. Notebooks should without errors from top to bottom
📘 **Notebooks**: Cell Ordering Problems

These cells are out-of-order:

In [9]:
# Setup for the coming demonstration – delete the variable "a" 
# – run this first, but only if you've already run the second next cell!
del(a)

In [7]:
# Increment a by one – run this third!
a += 1

In [6]:
# Initialize a to zero – run this first (the first time around) or second (thereafter)!
a = 0

In [8]:
# Print the current value of a – run this last!
print(a)

1


... so I had to *document* how to run them.

# ✅ In-Order Notebooks are Good
📘 **Notebooks**: Cell Ordering Solutions

It's much easier, and more understandable, to have them correctly ordered. 

### Pro-Tip: restart and rerun often
Get into the habit of regularly
- restarting the ipython "kernel", then
- running all cells of your notebook, then
- checking for and fixing any errors.

# Notebooks can use IPython's "magic" functions ("magics")
📘 **Notebooks**: IPython magics

Jupyter has access to all the Magics from the IPython kernel, which can make life easier!

Some examples are 

- `%%writefile`: Writes the content of a cell into a python file
- `%pycat`: shows you (in a popup) the syntax highlighted contents of an external file.
- `%load`: This will replace the contents of the cell with an external script. You can either use a file on your computer as a source, or alternatively a URL.

We'll not go into them today. Check out the 
[IPython Magics Documentation](https://ipython.readthedocs.io/en/stable/interactive/magics.html) 
for more info.

# Python has thousands of freely available packages which add to its functionality
📦 **Packages**

Python packages provide functionality beyond the standard library, e.g.:
- pandas (for data analysis)
- scikit-learn (for machine learning)
- matplotlib (for plotting)
- numpy (for array programming)
- thousands more...

You need a **package manager** like `pip` or `conda` to resolve dependencies and find packages.

# We use imports to leverage python's extensive package ecosystem
📦 **Packages**: Imports

In [None]:
# Based on https://matplotlib.org/stable/gallery/lines_bars_and_markers/simple_plot.html
import numpy as np
import matplotlib.pyplot as plt

# Data for plotting
t = np.linspace(0.0, 2.0, 101)  # values [0.0, 0.01, 0.02, ..., 2.0]
s = 1 + np.sin(2 * np.pi * t)  

# Plot it
plt.plot(t, s)
plt.xlabel('time (s)')
plt.ylabel('voltage (mV)')
plt.title('About as simple as it gets, folks')
plt.grid()
plt.show()

# Find information on packages using `help(...)` and the web
For instance, you might want to find out more about `numpy`. 

In [None]:
import numpy

Once you've imported it, you can show its built-in documentation like this:

In [None]:
help(np)

You can do `help(...)` on functions too:

In [None]:
help(numpy.linspace)

... and many packages offer online documentation:
https://numpy.org/doc/stable/

# Complex package dependencies mean that package managers are vital
📦 **Packages**: Conflict Resolution ∴ Package Manager

Problem: Packages might depend on each other, e.g. 
- `pandas` needs `numpy` ≥ 1.21.6
- `scikit-learn` needs `numpy` ≥ 1.17.3

If you want both, you need `numpy` ≥ 1.21.6

For projects with many dependencies, manual resolution is very time consuming. 

A **package manager** resolves these conflicts recursively for you (or tells you if they aren't satisfiable).

# Complex package dependencies mean that package managers are vital
📦 **Packages**: Registries ∴ Package Manager

Python packages are published to **registries**, like [PyPI.org](https://pypi.org) and [Anaconda.org](https://anaconda.org).

A **package manager**: 
- searches a registry for package versions you need
- determines which are compatible with your system
- installs the right ones.

# Package managers include `pip` (a good default) and `conda` (great for science)
📦 **Packages**: Package managers

The main package managers for python are:

- `pip`:
  - bundled with python 
  - uses the [PyPI.org](https://pypi.org) registry by default
- `conda`:
  - installs packages beyond python, like binaries [`graphviz`](https://graphviz.org). 
  - part of the [Anaconda distribution](https://anaconda.org) for scientific computing.

# Each project has a "virtual environment" and a package manager
🌳 **Project Environment**

Every project has its own dependencies ∴ each project needs its own "virtual environment"  

You need an **environment manager** like `venv` (a fine default), `conda` (the standard for science), or `poetry`/`hatch`/ `pdm` (for package development).

# Different projects may have different dependencies, and project dependencies can't be allowed to conflict
🌳 **Project Environment**: Problems

Different project may require different "enviroments": with different packages and/or package versions.

Coders often work...:

- on more than one project over time ∴ environments must be precisely reproducible
- on more than one project at once ∴ multiple environments mustn't conflict
- on more than one computer ∴ environments must be transferrable
- in groups ∴ environments must be shareable

# You need virtual environments to isolate the dependencies of different projects
🌳 **Project Environment**: Virtual environments

The solution is separate **virtual environments**, which: [[ref]](https://docs.python.org/3/library/venv.html)

- have their own Python binary (& standard library)
- can have their own independent set of installed packages
- come with a way to "freeze", share and reproduce a set of packages precisely, elsewhere and/or later

# You need virtual environments to isolate the dependencies of different projects
🌳 **Project Environment**: Environment managers

The default environment manager for python is 
[`venv`](https://docs.python.org/3/library/venv.html) (or its extensions
[`virtualenv`](https://virtualenv.pypa.io/en/latest/) & 
[`virtualenvwrapper`](https://virtualenvwrapper.readthedocs.io/en/latest/)). It works `pip`.

`conda` is the best choice for science. 
It works with python *plus* R, julia, system binaries...

`poetry`, `hatch` and `pdm` provide special utilities for package development.

# Version control allows you to reproduce *every* analysis (if commits are small)
📸 **Version control**
Both solo and collaborative coding are made easier with a version control system.
In short, source control:

- Is a time machine allowing you to revisit every analysis as it was
- Makes collaboration easier – simplifying sharing code and merging changes

This project uses **git**.

For more info, see the [workshop session on **git**](https://brownccv.notion.site/Coding-Collaboration-using-Git-GitHub-and-Pull-Requests-afdc0e8c48a449f2864f0e3e8b5b4a59?pvs=4).
