# Data analysis and visualization in Jupyter Notebooks


## Use cases
- Experimenting with new ideas, testing new libraries/databases 
- Interactive code, data analysis and visualization development
- Interactive work on HPC clusters
- Sharing and explaining code to colleagues
- Learning from other notebooks
- Keeping track of interactive sessions, like a digital lab notebook
- Supplementary information with published articles
- Teaching (programming, experimental/theoretical science)
- Presentations with slides using [Reveal.js](https://github.com/damianavila/RISE)

## Exploring a library

- Tab completion and question marks can be used to learn about a new library

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# numpy.

In [None]:
# numpy.sum?

## Interactive plotting

Jupyter supports interactive plotting with matplotlib and other visualization libraries (including for other languages). 

In [None]:
%matplotlib inline

x = np.linspace(0,2*np.pi,100)
y = np.sin(x)
plt.plot(x,y, 'r-')
plt.show()

## Widgets

Widgets add more interactivity to Notebooks, allowing one to visualize and control changes in data, parameters etc.

In [None]:
from ipywidgets import interact

#### Use `interact` as a function

In [None]:
def f(x, y, s):
    return (x, y, s)

interact(f, x=True, y=1.0, s="Hello");

#### Use `interact` as a decorator

In [None]:
@interact(x=True, y=1.0, s="Hello")
def g(x, y, s):
    return (x, y, s)

## More interactive plotting using widgets

In [None]:
from ipywidgets import interact 
@interact
def plot(n=(1,6)):
    x = np.linspace(0,2*np.pi,100)
    y = np.sin(n*x)
    plt.plot(x,y, 'r-')
    plt.show()

## Magics

Magics are a simple command language which significantly extend the power of Jupyter 

Two kinds of magics:

  - **Line magics**: commands prepended by one % character and whose arguments only extend to the end of the current line.
  - **Cell magics**: use two percent characters as a marker (%%), receive as argument the whole cell (must be used as the first line in a cell)

Other features:
  - Use `%lsmagic` magic to list all available line and cell magics
  - Question mark shows help: `%cd?`
  - Additional magics can be created, see below for example

In [None]:
%lsmagic

You can capture the output of line magic (and shell) commands

In [None]:
ls_out = %ls
ls_out

In [None]:
%sx?

In [None]:
ls_out = %sx ls
ls_out

### %timeit
- Timing execution
- Both Line and Cell level

In [None]:
%timeit import time ; time.sleep(1)

In [None]:
%%timeit 
a = np.random.rand(100, 100)
np.linalg.eigvals(a)

### %%writefile
Writes the cell contents as a named file

In [None]:
%%writefile foo.py
print('Hello world')

### %run 
 - Executes python code from .py files 
 - Can also execute other jupyter notebooks

In [None]:
%run foo

### %load
 - Loads code directly into cell. File either from local disk or from the internet
 - After uncommenting the code below and executing, it will replace the content of cell with contents of file.

In [None]:
# %load https://matplotlib.org/_downloads/annotate_transform.py

### %debug
Activate interactive debugger

Let's try using `%debug` to hunt down a bug. We first execute the cell, and then run the `%debug` magic.

In [None]:
def calc_reciprocal(x):
    inv_x = []
    for i in x:
        inv_x.append(1.0 / i)
    return inv_x

x = [1,5,2,0,5]
y = calc_reciprocal(x)

Run the debugger post-mortem. If an exception has just occurred, the debug magic lets you inspect its stack frames interactively

In [None]:
%debug

**Don't forget to exit the debugger by typing `q` and `Enter`!**  
If you don't, the background process will not be ready for your next command.

### %prun
 - Python code profiler
 - Cell and Line magic

In [None]:
%%prun 
a = np.random.rand(1000, 1000)
np.linalg.eigvals(a)

## Exercises

> Possible solutions can be found in the [solutions.ipynb](solutions.ipynb) notebook

### <font color="red"> *Exercise 1: Widgets for interactive data fitting* </font>

1. Execute the cell below. It fits a 5th order polynomial to a gaussian function with some random noise 
2. Use the `@interact` decorator together with the function `fit`, such that you can visualize fits with polynomial orders `n` ranging from, say, 3 to 30


In [None]:
# gaussian function
def gauss(x,param):
    [a,b,c] = param
    return a*np.exp(-b*(x-c)**2)

# gaussian array y in interval -5<x-5 
nx = 100
x = np.linspace(-5.,5.,nx)
p = [2.0,0.5,1.5] # some parameters
y = gauss(x,p)

# add some noise
noise = np.random.normal(0,0.2,nx)
y += noise

# we fit a 5th order polynomial to it

def fit(n):
    pfit = np.polyfit(x,y,n)
    yfit = np.polyval(pfit,x)
    plt.plot(x,y,"r",label="Data")
    plt.plot(x,yfit,"b",label="Fit")
    plt.legend()
    plt.ylim(-0.5,2.5)
    plt.show()
    
# call function fit
# these lines are unnecessary when you use the interact widget
n=5
fit(n)

### <font color="red"> *Exercise 2a: Cell profiling* </font>

1. Load the random_walk.py code (in the current directory) into a cell below with the appropriate magic command 
    - note that you have to rerun the cell after the content is loaded
2. Split up the functions over cells (either via Edit menu or keyboard shortcut `Ctrl-Shift-minus`). 
3. Initializating `n` and calling `walk()` doesn't need to be in a main function, and you can remove the `__name__` stuff.
4. Plot the random walk trajectory.
5. Time the execution of `walk()` with a line magic.
6. Run the prun cell profiler.
7. Can you spot a little mistake which is slowing down the code?
8. In the next exercise you will install a line profiler which will more easily expose the performance mistake.

### <font color="red"> *Exercise 2b: Installing a magic command for line profiling* </font>



Magics can be installed using `pip` and loaded like plugins using the `%load_ext` magic. You will now install a line-profiler to get more detailed profile, and hopefully find insight to speed up the code from the previous exercise.

1. First install the line profiler using `!pip install line_profiler`.
2. Next load it using `%load_ext line_profiler`.
3. Have a look at the new magic command that has been enabled with `%lprun?`
3. Load the `random_walk.py` into a new cell, and execute it.
4. In a new cells, run the line profiler on each function of the example code using something like:   
`%lprun -f <func1> -f <func2> -f <func3> main()`
5. Inspect the output. Can you more easily see the mistake now?

### <font color="red"> *Exercise 3: Data analysis with `pandas`* </font>

Data science and data analysis are key use cases of Jupyter. In this exercise we will familiarize ourselves with dataframes and various inbuilt analysis methods in the high-level `pandas` data exploration library. We will use a dataset containing information on Nobel prizes.

1. Start by navigating in the File Browser to the `data/` subfolder, and double-click on the `nobels.csv` dataset. This will open JupyterLab's inbuilt data browser.
2. Have a look at the data, column names, etc.
3. In a your own notebook, import the `pandas` module and load the dataset into a *dataframe*:  

```python
import pandas as pd
nobel = pd.read_csv("data/nobels.csv")
```

4. The "share" column of the dataframe contains the number of Nobel recipients that shared the prize. Have a look at the statistics of this column using  

```python
nobels["share"].describe()
```

5. The `describe()` method is smart about data types. Try this:  
```python
nobel["bornCountryCode"].describe()
```

    - What country has received the largest number of Nobel prizes, and how many?
    - How many countries are represented in the dataset?
6. Now analyze the age of prize recipients. You first need to convert the "born" column to datetime format: 

```python
nobel["born"] = pd.to_datetime(nobel["born"], 
                               errors ='coerce')
```

7. Next subtract the birth date from the year of receiving the prize and insert it into a new column "age", and then print the first 10 entries using `head()`

```python
nobel["age"] = nobel["year"] - nobel["born"].dt.year
nobel[["surname","age"]].head(10)
```

8. Now plot results in two different ways:

```python
nobel["age"].plot.hist(bins=[20,30,40,50,60,70,80,
                             90,100],alpha=0.6);
nobel.boxplot(column="age", by="category")
```

9. Which Nobel laureates have been Swedish? See if you can use the `nobel.loc[CONDITION]` statement to extract the relevant rows from the `nobel` dataframe using the appropriate condition.

10. Finally, try the powerful `groupby()` method to analyze the number of Nobel prizes per country, and visualize it with the high-level `seaborn` plotting library. 
 - First add a column "number" to the `nobel` dataframe containing ones (to enable the counting below).
 - Then extract any 4 countries (replace below) and create a subset of the dataframe:
```python
countries = np.array([COUNTRY1, COUNTRY2, COUNTRY3, COUNTRY4])
nobel2 = nobel.loc[nobel['bornCountry'].isin(countries)]
```
 - Next use `groupby()` and `sum()`, and inspect the resulting dataframe:
```python
nobels_by_country = nobel2.groupby(['bornCountry',"category"], 
                                   sort=True).sum()
```
 - Next use the `pivot_table` method to reshape the dataframe to a spreadsheet-like structure, and display the result:
```python
table = nobel2.pivot_table(values="number", index="bornCountry", 
                           columns="category", aggfunc=np.sum)
```
 - Finally visualize using a heatmap:
 ```python
import seaborn as sns
sns.heatmap(table,linewidths=.5, annot=True);
```


### <font color="red"> *Exercise 4: Defining your own custom magic command* </font>


It is possible to create new magic commands using the `@register_cell_magic` decorator from the `IPython.core` library. Here you will create a cell magic command that compiles C++ code and executes it.


> This example has been borrowed from the [IPython Minibook](http://ipython-books.github.io/), by Cyrille Rossant, Packt Publishing, 2015.


1. First import `register_cell_magic`

```python
from IPython.core.magic import register_cell_magic
```

2. Next execute the cell below here to register the new cell magic command. You can now start using the magic using `%%cpp`.

3. Write some C++ code into a cell and try executing it.

4. To be able to use the magic in another notebook, you need to add the following function at the end and then write the cell to a file in your PYTHONPATH. If the file is called `cpp_ext.py`, you can then load it by `%load_ext cpp_ext`.

```python
def load_ipython_extension(ipython):
    ipython.register_magic_function(cpp,'cell')
```


In [3]:
@register_cell_magic
def cpp(line, cell):
    """Compile, execute C++ code, and return the standard output."""

    # We first retrieve the current IPython interpreter instance.
    ip = get_ipython()
    # We define the source and executable filenames.
    source_filename = '_temp.cpp'
    program_filename = '_temp'
    # We write the code to the C++ file.
    with open(source_filename, 'w') as f:
        f.write(cell)
    # We compile the C++ code into an executable.
    compile = ip.getoutput("g++ {0:s} -o {1:s}".format(
        source_filename, program_filename))
    # We execute the executable and return the output.
    output = ip.getoutput('./{0:s}'.format(program_filename))
    print('\n'.join(output))


Overwriting cpp_ext.py


## Mixing in other languages (assuming that they're installed)

The `%%script` magic is like the #! (shebang) line of a Unix script,
specifying a program (bash, perl, ruby, etc.) with which to run.  
But one can also directly use these:
- %%ruby
- %%perl
- %%bash
- %%html
- %%latex
- %%R

Why would you want to mix programming languages in the same notebook?
 - leverage strengths from different languages
 - using code from colleagues
 - a fantastic library exists in another language than your favorite one

In [None]:
%%ruby
puts 'Hi, this is ruby.'

In [None]:
%%script ruby
puts 'Hi, this is also ruby.'

In [None]:
%%perl
print "Hello, this is perl\n";

In [None]:
%%bash
echo "Hullo, I'm bash"

In [None]:
%%html
<table>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
<tr>
<td>row 1, cell 1</td>
<td>row 1, cell 2</td>
</tr>
<tr>
<td>row 2, cell 1</td>
<td>row 2, cell 2</td>
</tr>
</table>

In [None]:
%%latex
\begin{align}
\nabla \times \vec{\mathbf{B}} -\, \frac1c\, \frac{\partial\vec{\mathbf{E}}}{\partial t} & = \frac{4\pi}{c}\vec{\mathbf{j}} \\
\nabla \cdot \vec{\mathbf{E}} & = 4 \pi \rho \\
\nabla \times \vec{\mathbf{E}}\, +\, \frac1c\, \frac{\partial\vec{\mathbf{B}}}{\partial t} & = \vec{\mathbf{0}} \\
\nabla \cdot \vec{\mathbf{B}} & = 0
\end{align}

### R

The R world already has a powerful IDE, RStudio, where one can annotate code using Markdown and export to HTML.  
A key difference between RStudio and Jupyter is that in Jupyter one can modify and rerun individual cells, without having to rerun everything.

In [None]:
# first we need to install the necessary packages
#!conda install -c r r-essentials 
#!conda install -y rpy2

To run R from the Python kernel we need to load the rpy2 IPython extension

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
myString <- "Hello, this is R"
print ( myString)

Inline plotting in R is straightforward 

In [None]:
%%R 
# Define the cars vector with 5 values
cars <- c(1, 3, 6, 4, 9)

# Graph cars using blue points overlayed by a line 
plot(cars, type="o", col="blue")

# Create a title with a red, bold/italic font
title(main="Autos", col.main="red", font.main=4)

Data in R cells is of course persistent

In [None]:
%%R 
barplot(cars)

We can plot a Python pandas dataframe with R code

In [None]:
import pandas as pd
df = pd.DataFrame({
    'cups_of_coffee': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
    'productivity': [2, 5, 6, 8, 9, 8, 0, 1, 0, -1]
})

In [None]:
%%R -i df -w 6 -h 4 --units cm -r 200
# the first line says 'import df and make default figure size 5 by 5 inches 
# with resolution 200. You can change the units to px, cm, etc. as you wish.
library(ggplot2)
ggplot(df, aes(x=cups_of_coffee, y=productivity)) + geom_line()

## Sharing notebooks

- You can enter a URL, GitHub repo or username, or GIST ID in [`nbviewer`](https://nbviewer.jupyter.org/) and view a rendered Jupyter notebook
    - try entering just "coderefinery" and see if you can find this current notebook
- Read the Docs can render Jupyter Notebooks via the [nbsphinx package](https://nbsphinx.readthedocs.io/)
- [Binder](https://mybinder.org/) creates live notebooks based on a GitHub repository
- [CoCalc](https://cocalc.com/) (formerly SageMathCloud) allows collaborative editing of notebooks in the cloud 
- Google's [colaboratory](https://colab.research.google.com/) lets you work on notebooks in the cloud, and you can [read and write to notebook files on Drive](https://colab.research.google.com/notebooks/io.ipynb)
- [Microsoft Azure Notebooks](https://notebooks.azure.com/) also offers free notebooks in the cloud
- [JupyterLab](https://github.com/jupyterlab/jupyterlab) supports sharing and collaborative editing of notebooks via Google Drive 
- [Notedown](https://github.com/aaren/notedown), [Jupinx](https://github.com/QuantEcon/sphinxcontrib-jupyter) and [DocOnce](https://github.com/hplgit/doconce) can take Markdown or Sphinx files and generate Jupyter Notebooks
- The `jupyter nbconvert` tool can convert a (`.ipynb`) notebook file to:
    - python code (`.py` file) 
    - an HTML file
    - a LaTeX file
    - a PDF file
    - a slide-show in the browser

Note: the Google, Microsoft and CoCalc platforms are free but have paid subscriptions for faster access to cloud resources

## Lesson key points

- Keyboard shortcuts simplify using Jupyter
- Magics allow you to
 - access the filesystem
 - time, debug and profile your code
 - run shell commands in underlying system
- You can also create your own magics
- You can add inline plots, and widgets provide more interactivity
- The json format of Jupyter Notebooks is not optimal for version control with Git, but the nbdime tool helps
- Jupyter can run many kernels, among them Python, Octave, Julia and R (assuming they are installed on the host running Jupyter)