<div class="licence">
<span>Licence CC BY-NC-ND</span>
<span>Valérie Roy</span>
<span><img src="media/ensmp-25-alpha.png" /></span>
</div>

# data visualization in python

   - https://github.com/rougier/matplotlib-tutorial#introduction
   - https://www.labri.fr/perso/nrougier/python-opengl/

# matplotlib

## 1) introduction

   - the project started in $\approx$ **2003**
   - it is inspired by **MATLAB**
   - it was the first **Python data visualization** library
   - and it is **for the time being** the **most popular** library
   - there is an **active developer community**
   - its **license** is **based** on the **Python Software Foundation** (PSF) **license**
   - https://matplotlib.org/

   - it is a **2D plotting** library
   - it can be used with the **Jupyter notebook**
   
   
   - the **3d plotting** is a **mpl toolkit** (*from mpl_toolkits.mplot3d import Axes3D*)

   - it has a **concise syntax**
   - it is rather **simple** and **powerful**
   - it makes **heavy use** of **numpy** to have **good performance** for **large arrays**
   - some other libraries are **built** on top of **matplotlib** (e.g. **seaborn**)
   - **pandas** has **wrappers** over **matplotlib**

   - it offers the **classic** functionnalities: **plots**, **histograms**, **bar charts**, **scatterplots**, ...
   - https://matplotlib.org/gallery/index.html
   
   
   - that you can **customize** with **texts**, **grids**, **labels**, **legends**, ...   
   - **parameters** control **colors**, **line styles**, **font properties**, **axes properties**, ...
   
   
   - https://matplotlib.org/api/pyplot_summary.html

   - **pyplot** is the interface to the matplotlib plotting library

In [None]:
#import matplotlib as mpl

In [None]:
#mpl.__version__

## 2) **plots**

#### a) a simple plot

   - we **import** the useful **libraries**
   - **plots** are done by the *matplotlib.pyplot* **functions**
   - **by convention***matplotlib.pyplot* is **named** *plt*

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np

   - we **create** an **array** *x* with **values** linearly spaced between $0$ and $2\pi$
   - we **get** a *numpy.ndarray* 

In [None]:
x = np.linspace(0, 2*np.pi, 50)

   - we **create** an **array** *y* by **computing** the **sinus** of the values of *x*
   - we **create** an **array** *z* by **computing** the **cosinus** of the values of *x*
   - we **get** two *numpy.ndarray* 

In [None]:
y = np.sin(x)
z = np.cos(x)

   - *pyplot.plot(x, y)*  **plot** **y** versus **x** with varying **linesize**, **color**, etc.
   - *pyplot.scatter* **scatter plot** of **y** versus **x** with varying **marker** **shape**, **size** and **color**

In [None]:
plt.plot(x, y)
plt.scatter(x, z)

   - we have **plot** with the **default** settings

#### b) improving the plot

with **parameters** and **methods**, we can **add** in the **drawing**:
   - **title**
   - **legends** to the **axis**
   - **labels** to the **plots**, ...
   - with different **fontsize**

In [None]:
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)

plt.title('trigonometric functions of angles between 0 and $2 \pi$', fontsize=20)

plt.xlabel('x coordinate', fontsize=18) # legend to axis x
plt.ylabel('y coordinate') # legend to axis y

plt.plot(x, y, label='sinus')
plt.scatter(x, z, label='cosinus')

plt.legend(fontsize=12) # make the legend appear

plt.show() # not mandatory in jupyter notebooks !

   - we can **vary** marker, color, size, linewidth

In [None]:
x = np.arange(1, 10)
y = np.power(x, 2)

plt.plot(x, y, color='orange', linestyle='--', linewidth=3)

In [None]:
plt.plot(x, y, 'g--', linewidth=4)    # green, dashed line
plt.plot(x, y, 'rs', markersize=15)   # red, square marker
plt.plot(x, y, 'y^', markersize=6)    # yellow, triangle marker

   - **varying** **colors** and **size** depending of values **c** and **s**

*plt.scatter* 
   - you can pass to the parameter **c** a **sequence** of **numbers** to be **mapped** to **colors**
   - you can pass to the parameter **s** a **sequence** of **numbers** to be **mapped** to **shapes**
   

the **c** values will go through a **colormap** (list can be seen on the documentation at https://matplotlib.org/users/colormaps.html)

In [None]:
x = np.arange(10)
y = x + 10 * np.random.randn(10)

z = np.random.randint(100, 10000, 10) # random values for colors
v = np.random.randint(100, 5000, 10)  # random values for size 

plt.scatter(x, y, marker='o', c=z, s=v, cmap='Blues')

####  we can set the **limits** and the **ticks** of the **axes** 

In [None]:
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y)

   - setting the **abscissa**
   - here from $-2\pi$ to $2\pi$ 

In [None]:
plt.xlim(-2*np.pi, 2*np.pi)
#plt.plot(x, y)

   - setting the number of **tick** on the **abscissa**

In [None]:
plt.xticks(np.linspace(-2*np.pi, 2*np.pi, 10, endpoint=True));

   - we can set **tick labels**

In [None]:
plt.xticks([-2*np.pi, -np.pi, 0, np.pi/2, np.pi, 2*np.pi],
           ['$-2\pi$', '$-\pi$', 0, '$\pi/2$', '$\pi$', '$2\pi$']);

   - we can do the same for the **ordinate**

   - setting the **ordinate**
   - here $-2$ and $2$

In [None]:
plt.ylim(-2.0, 2.0)

   - setting the number of **tick** on the **ordinate**

In [None]:
plt.yticks(np.linspace(-2, 2, 15, endpoint=True));

   - the **whole** figure

In [None]:
x = np.linspace(-np.pi, np.pi, 50)
y = np.sin(x)

plt.xlim(-4, 4) 
plt.xticks(np.linspace(-4, 4, 10))

plt.ylim(-1, 1)
plt.yticks(np.linspace(-2, 2, 10))

plt.plot(x, y);

#### c) writing **text**

In [None]:
plt.text(0.5, 0.5, 'I wrote here !', fontsize=20, bbox=dict(facecolor='red', alpha=0.5));

  - with *plt.annotate* **text** can be used to **annotate** some **feature** of the **plot**
  - you give **two points**:
     - the **location** being **annotated** (parameter *xy*)
     - the **location** of the **text**(parameter *xytext*
     - you can add an **arrow** that will point toward the **point**

In [None]:
plt.scatter([0, 1, 2], [0, 1, 2], color='magenta')

plt.annotate('a point', xy=(0, 0), xytext=(0.25, 0.251),
             arrowprops=dict())

plt.annotate('not a point', xy=(1, 2), xytext=(0.15, 1.75),
             arrowprops=dict(arrowstyle='fancy'))

#### d) boxplot

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html

   - we generate a **dataframe** of **people** with random **age**, **gender**, **height** and **weight**

In [None]:
N = 20 # number of elements
df = pd.DataFrame({'age':     np.random.randint(15, 45, size=N),
                   'gender':  np.random.choice(['S', 'M'], size=N),
                   'height' : np.random.randint(140, 189, size=N)/100,
                   'weight':np.random.randint(350, 890, size=N)/10},)

In [None]:
df.boxplot(['age', 'weight'])

   - we add some **outliers**
   - read the help

In [None]:
df.loc[0, 'height'] = 2.5
df.loc[1, 'height'] = 2.6
df.loc[2, 'height'] = 0.8
df.loc[3, 'height'] = 0.6

In [None]:
df.boxplot(['height'])

#### d) histograms

In [None]:
df.hist();

In [None]:
df.hist(['height'], grid=False, bins = 10);

#### d) barchart

In [None]:
df = pd.DataFrame({'speed' : [0.1, 17.5, 40, 48, 52, 69, 88],
                   'lifespan' : [2, 8, 70, 1.5, 25, 12, 28]},
                  index = ['snail', 'pig', 'elephant',
                           'rabbit', 'giraffe', 'coyote', 'horse'])

In [None]:
df.plot.barh();

In [None]:
ax = df.plot.barh(x='lifespan', y='speed')

#### e) you can **save** the **plots**

In [None]:
#plt.savefig?

In [None]:
x = np.linspace(-10, 10, 50)
y = np.power(x, 2)
plt.title('$y = x^2$', fontsize=20)
plt.xlabel('x')
plt.ylabel('$x^2$')
plt.plot(x, y, label='$x^2$')
plt.legend(fontsize=12)

plt.savefig('my_figure.png')

#### f) ploting an array like an **image** (on a 2D regular raster)

   - we create an array

In [None]:
i = np.random.random((50, 100)) # numbers between [0, 1[

   - we **plot** the array like we plot an **image**

In [None]:
my_map = plt.imshow(i)

In [None]:
plt.imshow(i, cmap=plt.cm.Blues, alpha=0.5) # color map, transparency (alpha)
plt.colorbar()

### g) ploting a 2D function using a grid of points

   - **suppose** you want to plot a **function** *foo(x, y)*

   - you create all the **couples** of *(x, y)* to **cover** the area
   - you **compute** the function on **each** point
   - you plot the **image**

exampe of the Gaussian function: on variable $x$ and $y$:
   - $foo(x, y)=
\dfrac{1}{(2 \pi \sigma^2)}e^{-\dfrac{[(x-\mu_x)^2+(y-\mu_y)^2]}{(2 \sigma^2))}}$ 

In [None]:
def foo (x, y, mu_x, mu_y, sigma):
    return (1/(2*np.pi*sigma**2))*np.exp(-(np.power(x - mu_x, 2) + np.power(y - mu_y, 2))/(2*sigma**2))

In [None]:
x, y = np.mgrid[-50:50, -50:50]
z = foo(x, y, 0.5, 0.4, 20)

In [None]:
plt.imshow(z)
plt.axis('on')  # to how the axis 
cc = plt.colorbar()

## 3) figures

https://matplotlib.org/faq/usage_faq.html

  - with **functions** of *matplotlib.pyplot*
  - there is the **notion** of a **current figure** and **current axes**

in this exemple:
   - the **first call** *plt.plot(x, y)* **creates** the **axe**
   - the **second call** *plt.plot(x, z)* **adds** the plot on the **same axe**

In [None]:
x = np.linspace(0, np.pi, 100)
y = np.cos(4*x)+ np.sin(-x)
z = -2 * np.cos(x)+ 4*np.sin(5*x)

plt.plot(x, y)
plt.plot(x, z);

   - we can create **figure** with one or many **drawings**
   - the **drawings** are called **Axes**

   - by default the **plot** has $1$ line, $1$ column and its number is $1$

**parameters** of **figures**

   - the **current** figure can be **accessed**

In [None]:
fig = plt.gcf()

In [None]:
fig.number   # number of figure

In [None]:
fig.dpi         # resolution in dots per inch

In [None]:
 # figure size in inches (width, height)

In [None]:
fig.frameon # draw figure frame or not

In [None]:
fig.get_figheight(), fig.get_figwidth() # fig size in inch

   - you can give the **size** and the **dpi**

In [None]:
# size is 8x6 inches, 80 dots per inch
plt.figure(figsize=(8, 6), dpi=80);

### a) create a **figure** with a unique **Axe** (a unique drawing)

In [None]:
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
fig, ax = plt.subplots()

ax.plot(x, y);

   - you can add **title**, **labels**, **axis labels** to the **Axe**

In [None]:
ax.set_title('sinus')
ax.set_xlabel('x coordinate')
ax.set_ylabel('y coordinate');

   - getting the **current** axis 

In [None]:
x = np.linspace(-2*np.pi, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y)
ax = plt.gca() # gca stands for 'get current axis'

   - giving a size to a figure

In [None]:
fig = plt.figure(figsize=(10, 2))
x = np.linspace(-2*np.pi, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y)

   - the **lines** at the **borders** of the figure are called **spines**
   - there are four **spines**: **top**, **left**, **right** and **bottom** 
   - you can move the **spines** of an **axis**

   - to discard a spine, **set** its **color** to **None**
   - to move a spine, give its **new** position in the **data** space coordinates

   - to move its **spines** you need to **refer** to the **axis**

   - we move the **left** and **bottom** **spines**
   - of the **current axis**
   - to the **origin** of the **data**

In [None]:
x = np.linspace(-np.pi, np.pi, 50)
y = np.sin(x)
plt.plot(x, y)

ax = plt.gca()  # gca stands for 'get current axis'

ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

ax.spines['bottom'].set_position(('data', 0))
ax.spines['left'].set_position(('data', 0))


### b) **create** a **figure** with **several** **Axes** with *pyplot.subplots(nrows, ncols)*

   - you give the **number** of **rows** and the **number** of **cols**

In [None]:
fig, axes = plt.subplots(2, 3) # 2 rows, 3 columns

   - the **axes** is a *numpy.ndarray*

In [None]:
axes.shape # 2 lines and 3 columns

   - we can **attach** **plots** to **Axes**

In [None]:
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
z = np.cos(x)

In [None]:
axes[0, 0].plot(x, y) 
axes[1, 2].plot(x, z) 
fig

   - we can give a **global title** to the **figure**

In [None]:
fig.suptitle('trigonometric functions')
fig

### c) **create** a **figure** with **several** **Axes** with *pyplot.subplots(pos)*

you give the **position** by a **three digit integer**:
   - the **first** digit is the number of **rows**
   - the **second** digit is the number of **columns**
   - the **third** digit the **index** of the subplot
   - **index** from **1** (upper left corner) to the (lower right corner)

   - the **previous** example will be

In [None]:
plt.figure()
ax = plt.subplot(231)
ax.plot(x, y)

ax = plt.subplot(232)
ax = plt.subplot(233)
ax = plt.subplot(234)
ax = plt.subplot(235)

ax = plt.subplot(236)
ax.plot(x, z)

### d) to **create** **multiple** figures

   - call the constructor *plt.figure(i)* with **increasing** figure number

In [None]:
fig1 = plt.figure(1)
plt.subplot(121)
plt.subplot(122)

fig2 = plt.figure(2)
plt.subplot(211)
plt.subplot(212)

## 5) use **matplotlib** for data science in **python**

**matplotlib**
   - it has **two** dual **interfaces**: a **state-based** interface and an **object-oriented** interface
   - furthermore, you can use **tools** (like pandas or seaborn) built **on top of** matplotlib
   - **find out** which **solution** to use is **challenging** and can be **confusing**


   - **but** to **work** with the **python data science stack**
   - you **need** to have **basic** knowledge of **matplotlib**

  - the *plt.figure* is the **whole** image
  - a **figure** can contain $1$ or more **axes**
  - **axes** are the **individual** plots
  
  
  - data-visualisation in **matplotlib** is based on **figures** and **axes**


  - when dealing with **pandas** **data frames** use **pandas** plotting

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd

### a) simple ploting a **dataframe**

   - the **dataset** contains the french nuclear and renewable **electricity production** between 1970 and 2011

   - we **read** the file
   - the **index** is set to the **first** column (the years)

In [None]:
file = 'france_prod_elec_enr_nucl.csv'

In [None]:
df = pd.read_csv(file, index_col=0)

In [None]:
df.head()

   - a *pandas.DataFrame* has a **method** to  **plot**
   - the **method** is **built on** **matplotlib.pyplot.plot**

In [None]:
df.plot();

   - a *pandas.Series* has a **method** to  **plot**
   - the **method** is **built on** **matplotlib.pyplot.plot**

In [None]:
(df['prod brute elec nucleaire'] + df['prod brute elec primaire renouv']).plot()

### b) ploting columns 2 by 2

- **2D plots** can show **several** information **thanks to**: **colors**, **shapes**, ...

   - we use the well-known **iris flowers** dataset
   - it describes three types of **iris** **virginica**, **versilolor** and **setosa**
   - by the **length** et the **width** of their **sepals** and **petals**
   - we have $50$ iris of each type

In [None]:
df = pd.read_csv('iris123.csv')

In [None]:
df.head(2)

In [None]:
df.dtypes

   - we can plot one column in function of one other

In [None]:
df.plot.scatter(x='sepal length', y = 'sepal width')

   - it will be more **informative** of you add the **type**
   - with the **c** parameters of **scatter** function

In [None]:
plt.scatter(df['sepal length'], df['sepal width'], c=df['type'])

   - we can **use** some other column to **shape** the **markers**

In [None]:
plt.scatter(df['sepal length'], df['sepal width'], c=df['type'], s=df['petal width']*50)

   - a simple plot can give you a lot of information on your data

---

   - we can **use** the **codes** as **colors levels** (see complements)

In [None]:
df['type'].value_counts(), df.dtypes['type']

   - **type** is integer we need a **categorical type**

In [None]:
df['type'] = df['type'].astype('category')

   - the **type** column contains now values of the **categorical** type

In [None]:
df['type'].cat.codes[47:52]

## 4) ploting 3D figures

   - you construct **two** **arrays** of points $x$ and $y$
   - you **cover** the **area** $(x \times y)$ by a **grid**
   - you have pairs $(x_i, y_j)$
   - you compute a **function** on the **grid** $z = f(x, y)$
   - you **plot** the surface

   - you **import** the **library**

In [None]:
from mpl_toolkits.mplot3d import Axes3D

   - you create a **figure**

In [None]:
fig = plt.figure()

   - you create the **axis** (the plot) of the **figure**

In [None]:
ax = Axes3D(fig)

   - we re-use the exampe of the Gaussian function
   - $foo(x, y)=
\dfrac{1}{(2 \pi \sigma^2)}e^{-\dfrac{[(x-\mu_x)^2+(y-\mu_y)^2]}{(2 \sigma^2))}}$ 

In [None]:
def foo (x, y, mu_x, mu_y, sigma):
    return (1/(2*np.pi*sigma**2))*np.exp(-(np.power(x - mu_x, 2) + np.power(y - mu_y, 2))/(2*sigma**2))

   - you create the arrays $x$ and $y$

In [None]:
x = np.arange(-30, 30, 0.5)   # x is an array from -20 to +20 (a pt each 0.25)
y = x                          # y is the same array

In [None]:
x[0:10]

In [None]:
x.shape

   - you create the **grid**

In [None]:
x_mg, y_mg = np.meshgrid(x, y)

In [None]:
x_mg.shape

In [None]:
x_mg, y_mg

   - you compute the function

In [None]:
z = foo(x_mg, y_mg, mu_x = 0.5, mu_y = 0.4, sigma = 20)

   - you plot the **function**

In [None]:
ax.plot_surface(x_mg, y_mg, z, cmap='winter')

  - all together

In [None]:
x = np.arange(-30, 30, 0.5)   # x is an array from -20 to +20 (a pt each 0.25)
y = x                          # y is the same array

x_mg, y_mg = np.meshgrid(x, y)  # we create a grid that cover the area of side x and y

z = foo(x_mg, y_mg, 0.5, 0.4, 20) # we apply the function foo (it a gaussian)


fig = plt.figure()    # we create a figure
ax = Axes3D(fig)      # we create a plot 3D

ax.plot_surface(x_mg, y_mg, z,    # we plot the 3D surface 
                alpha = 0.8,      # alpha is the transparency factor
                cmap = 'spring'); # the color map

###### exercice:
   1. create two arrays $x$ and $y$ from $-d$  to $d$ with a step $0.5$ 
   1. compute the **meshgrid**
   1. compute the **euclidian distance** of the points of the grid
   1. compute the **sinus** of the distances
   1. create the **figure** and the **axe** (the plot)
   1. draw the **sinus** in **3D**
   1. do the same but create the function **distance** and call it

In [None]:
# 1
d_x = 4
d_y = 5
step = 0.25

x = np.arange(-d_x, d_x, step)
y = np.arange(-d_y, d_y, step)

# 2
x_mg, y_mg = np.meshgrid(x, y)

# 3
d = np.sqrt(x_mg**2 + y_mg**2)

# 4
z = np.sin(d)

# 5
fig = plt.figure()
ax = Axes3D(fig)

# 6
ax.plot_surface(x_mg, y_mg, z, cmap='Reds')
d.shape

In [None]:
# 1
d_x = 4
d_y = 5
step = 0.25

x = np.arange(-d_x, d_x, step)
y = np.arange(-d_y, d_y, step)

# 2
x_mg, y_mg = np.meshgrid(x, y)

# 7 
def dist (a, b):
    return np.sqrt(a**2 + b**2)

# 4
z = np.sin(dist(x_mg, y_mg))

# 5
fig = plt.figure()
ax = Axes3D(fig)

# 6
ax.plot_surface(x_mg, y_mg, z, cmap='Blues')
d.shape

## xxx) customizing **matplotlib** 

   - **matplotlib** provides **pre-defined** **styles**
   - that you can **set globally**

### a) listing the available styles

In [None]:
from matplotlib import pyplot as plt
import numpy as np

In [None]:
print(plt.style.available)

   - you can see a lot of **seaborn** styles
   - we will speak of **seaborn** later on

### b) setting a global style in your current environment

In [None]:
plt.style.use('ggplot') # a popular plotting-style
plt.plot(np.sin(np.linspace(0, 4*np.pi, 100)))

   - you can **isolate** the effect of a style
   - in a **context**

In [None]:
with plt.style.context(('dark_background')):   # set a style locally (in the body of the with)
    plt.plot(np.sin(np.linspace(0, 2 * np.pi)))
plt.show()
plt.plot(np.sin(np.linspace(0, 2 * np.pi)))   # use the previously globally-set style

   - you can create your own **custom** styles
   - (outside the scope of this presentation) 
   - see https://matplotlib.org/users/customizing.html

### xxx) the **xkcd** style
   - https://xkcd.com/676/
   - https://matplotlib.org/xkcd/gallery.html
   
   
   - installing new **fonts** in **matplotlib**
   - is **outside** the **scope** of this **introduction**
   - (**refer** to the documentation)

In [None]:
with plt.xkcd():
    plt.plot(np.sin(np.linspace(0, 4*np.pi)))
    plt.title('Whoo Hoo!!!');

# Seaborn
http://seaborn.pydata.org/index.html

   - it is built on top of Matplotlib
   - default styles and color palettes are more sophisticated than Matplotlib
   - it is a higher-level library than matplotlib i.e. it iss easier to generate certain kinds of plots
      
  - **seaborn** is for more complex statistical visualizations.
Use matplotlib to customize the pandas or seaborn visualization.



   - a **library** dedicated to **statistical graphics** in Python
   - built **on top** of **matplotlib**
   - **closely** integrated with **pandas**

   - a **library** dedicated to **statistical graphics** in Python
   - built **on top** of **matplotlib**
   - **closely** integrated with **pandas**
   - i.e. **plotting functions** operate on **dataframes** and **arrays**


   - it offers API specialized on **statistics**
   - it eases **visualising** the structure of complex datasets (through abstractions)
   - it eases **controling** **matplotlib** figures **styling** with **built-in themes**
   - it makes **visualization** a **central part** of **exploring** and **understanding** data

**pip install seaborn**

In [None]:
import seaborn as sns

   - **seaborn** offers many **functionnalities**
   - but **further** customization requires **matplotlib directly**

basic 

In [None]:
sns.set()
tips = sns.load_dataset("tips")
sns.relplot(x="total_bill", y="tip", col="time",
            hue="smoker", style="smoker", size="size",
            data=tips);

In [None]:
import seaborn