# Data visualisation




[Matplotlib](https://matplotlib.org) is the most commonly used visualisation library in Python. It provides 2D basic, statistical, coordinate and 3D plots. We shortly introduce matplotlib in this session, however, most visualisations will be done with [Seaborn](https://seaborn.pydata.org) which is based on `Matplotlib` with more sophisticated plot methods. Seaborn is built on top of Matplotlib and hides most low-level interactions with Matplotlib. With Seaborn similar visualisations can be achieved in different ways, which may seem confusing, but it also provides flexibility.


### Outline

- Data preparation :
    - Reshape : long and wide format
    - Join DataFrames : left/right/inner/outer
- Matplotlib basics
- Seaborn
    - Univariate plots
    - Bivariate plots
    - Axes-level and Grid-level
    - FacetGrid


In [None]:
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)

## Join DataFrames

We have already seen how to concatenate (`pd.concat`) DataFrames on a certain axis. There are situations where we would like to join DataFrames based on values of variable(s) also known as `key(s)`. This can be done with the function `DataFrame.merge`. Let `df1` and `df2` be DataFrames with common key(s) then:

**Synopsis: &nbsp; &nbsp;**<tt>df1.merge(df2, on=None, how='inner',...)</tt>
- on: variable(s) in both DataFrames, known as key(s)
- how: {left, right, outer, inner}, with 'inner' as default

will return a Dataframe with all columns in df1 and df2 where a match was found on the given key(s).  The merge result can be controlled by `how` argument:

| Join Type | Description |
| --- | --- |
| **Inner** | Returns only rows that have matching keys in both DataFrames. |
| **Outer** | Returns all rows from both DataFrames. Fills in missing values as `NaN` for keys that don't overlap. |
| **Left** | Returns all rows from the left DataFrame and matches rows from the right DataFrame. |
| **Right** | Returns all rows from the right DataFrame and matches rows from the left DataFrame. |



In [None]:
import names
name_pool =  [names.get_first_name() for _ in range(10)]
df1 = pd.DataFrame({'name': rng.choice(name_pool,5, replace=False) , 'age':  rng.choice(range(18,80),5) })
df2 = pd.DataFrame({'name': rng.choice(name_pool,5, replace=False) , 'height':  rng.choice(range(150,190),5) })

In [None]:
df1.merge(df2, on='name') # default how='inner'
df1.merge(df2, how='outer', on='name') # default how='inner'

The `key` or `keys` are constrained by their uniqueness. A non-unique key value when merging data 'may' lead to inconsistencies. You can check uniqueness with `duplicated` method:

In [None]:
df1.name.duplicated().sum(), df2.name.duplicated().sum()

## Data preparation

We will use the dataset [Framingham Heart Study](https://www.framinghamheartstudy.org/) ([Wikipedia](https://en.wikipedia.org/wiki/Framingham_Heart_Study)) with 4434 observations:


  - categorical :
    - general: SEX, CURSMOKE, EDUC,
    - events : ANGINA, HOSPMI,  STROKE, CVD, HYPERTEN, DEATH
  - discrete : AGE, RANDID, HEARTRTE
  - continuous : SYSBP, DIABP, BPMEDS, BMI

See also the end of this document for variables descriptions.

In [None]:
fmh = pd.read_csv("data/framingham.csv")

### Reshape

Same data may be organised in different ways depending on the context. Columns may become categories (long format) and vice versa, categories will become columns (wide format). Often data must be transformed into the proper shape for visualisation.


#### Wide to long : `pd.melt`

**Synopsis: &nbsp; &nbsp;**<tt>pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value')</tt>
- frame: DataFrame to reshape
- id_vars : variables to be kept
- value_vars : variables to be collected as a new categorical variable
- var_name : category name
- value_name : category values

To illustrate we will take a small sample of three events {ANGINA,CVD,DEATH} along with the `RANDID`:

In [None]:
df = fmh[["RANDID", "ANGINA", "CVD", "DEATH"]].head(3)
df

In [None]:
df_long = pd.melt(frame=df, id_vars='RANDID', value_vars=["ANGINA", "CVD", "DEATH"], var_name='EVENT', value_name='VALUE')
df_long

#### Long to wide : `pivot`


**Synopsis: &nbsp; &nbsp;**<tt>DataFrame.pivot(index=None, columns=None, values=None)</tt>
- index : column to set as index
- columns : variable containing the column names
- values : variable containing the values all

In [None]:
df_wide = df_long.pivot(index='RANDID', columns='EVENT', values='VALUE') # .reset_index().rename_axis(None,axis=1)
df_wide

## Matplotlib basics

In [None]:
import matplotlib.pyplot as plt

In order to quickly plot the data on default figure and axes we can use [pyplot.*](https://matplotlib.org/stable/api/pyplot_summary.html) plots. It will create a figure and axes silently.

In [None]:
# single plot ; line
x_ = rng.standard_normal(100)
plt.plot(x_);                   # linestyle='--', color='orange', linewidth=2, alpha=0.5
# plt.plot(range(len(x_)),x_)   # <=> plot(x_)
# plt.hist(x_, bins=10)
# plt.boxplot(x_);
# plt.ecdf(x_)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Not my first plot')
plt.legend(['Standard Normal'])

You may need to manage multiple figures simultaneously and be able to have more control over the figure attributes such as figure size and resolution (DPI).


In [None]:
# single figure with a single plot
fig = plt.figure()
ax = fig.add_subplot()
ax.plot(x_)
ax.set_title('Not my first plot')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_ylim([-5,5]);

In [None]:
# multiple plots
x_ = rng.standard_normal(50)
y_ = rng.standard_normal(50)
fig, axes = plt.subplots(2,2) # sharex, sharey, figsize, dpi
axes[0,0].plot(x_)
axes[0,1].hist(x_)
axes[1,0].scatter(x_,y_)
axes[1,1].boxplot([x_,y_]);

## Exercise : merge/reshape

1) The `diamonds` dataset  is included in the data folder of this session.

 - a) Plot price against volume (x\*y\*z).
 - b) The same plot as (a) but only for entries with `volume` >0 and <600.
 - c) Set point size=.5 and transparency=.5.
 - d) Colour data points per `cut`. You can use the `c` argument for colour.
 - e) Add legend  with `plt.legend` for the cut and the corresponding colours. (advanced, optional)
 - f) Set labels and title


## Seaborn

The Seaborn library is organised with modules focussing on  [relational](https://seaborn.pydata.org/tutorial/relational.html), [distributional](https://seaborn.pydata.org/tutorial/distributions.html) and [categorical](https://seaborn.pydata.org/tutorial/categorical.html) topics. These modules provide the so-called figure-level plots whereas corresponding plots in the base Seaborn are axes-level plots.


**Axes-level Synopsis:** nbsp; &nbsp;**<tt>sb.{plot-func}(data, x, y, hue, ...)</tt>

**Figure-level Synopsis:** nbsp; &nbsp;**<tt>sb.{relplot | catplot | displot}(data, x, y, hue, kind,  ...)</tt>

    - data: DataFrame, ...
    - x,y: are variables inside data to be plotted
    - hue: grouping colours
    - kind : for each category a set of plots is defined, e.g. relplot(kind='line', ...)

### Axes vs figure-level plots

| **Aspect** | **Axes-Level Plots** | **Grid-Level Plots** |
| --- | --- | --- |
| **Definition** | Focuses on creating individual plots on a single `Axes`. | Manages multiple subplots using a grid layout. |
| **Examples of Functions** | `sns.boxplot()`, `sns.violinplot()`, `sns.histplot()` | `sns.catplot()`, `sns.relplot()`, `sns.lmplot()` |
| **Use Case** | Simple, single plots (one Axes). | Facet-based plots split across categorical variables. |
| **Figure/Layout Control** | Requires manual control using Matplotlib (`plt.figure`, `plt.subplots`). | Automatically manages figure size and layout. |
| **Facet/Grid Support** | Does not support splitting into facets. | Supports splitting data into facets using `col`, `row`. |
| **Customization Scope** | Customize using Matplotlib functions directly, e.g., `plt.title()`, `plt.xlabel()`. | Use Seaborn’s `.set()` method or access `.fig` for full figure adjustments. |
| **Size Control** | Use Matplotlib's `plt.figure(figsize=(w, h))`. | Use `height` (per subplot height) and `aspect` (aspect ratio). |
| **Example Functionality** | `sns.boxplot(x='var', y='value', data=dataset)` | `sns.catplot(x='var', y='value', col='category', data=dataset)` |
| **Output Object** | Returns an `Axes` object. | Returns a grid object (e.g., `FacetGrid` or `RelationalGrid`). |

In [None]:
import seaborn as sns

### Univariate plots

- categorical : countplot
- continious: histogram, boxplot, violinplot, kdeplot, ecdfplot


In [None]:
ax = sns.countplot(data=fmh,x='SEX', hue='DEATH') # y='SEX', stat={'precent',...}

### sns.hist

Seaborn is built on top of Matplotlib, and the two interact seamlessly:

In [None]:
# Create a 2x2 grid plots
fig, axes = plt.subplots(2,2)

# add labels and titles
fig.suptitle("Diastolic and Systolic blood pressure")
axes[0,0].set_title("Systolic")
axes[0,1].set_title("Diastolic")

# place different plots inside the grid
sns.histplot(data=fmh,x='SYSBP', hue='SEX', ax=axes[0,0])
sns.histplot(data=fmh,x='DIABP', hue='SEX', ax=axes[0,1])
sns.histplot(data=fmh,x='SYSBP', hue='HYPERTEN', ax=axes[1,0])
sns.histplot(data=fmh,x='DIABP', hue='HYPERTEN', ax=axes[1,1])

# adjust horizontal space between plots see subplots_adjust for more options.
fig.subplots_adjust(hspace=.4) # [subplots_adjust](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots_adjust.html)

**Quiz:** change the labels in the legends for SEX to {male, female}.

This can be achieved by converting the variable `SEX` type to `category`:

In [None]:
# convert 'SEX' [1,2]=>['male','female']
fmh['SEX'] = fmh.SEX.astype('category').cat.rename_categories(['male', 'female'])
# Density line can be added by setting `kde=True`
sns.histplot(data=fmh,x='SYSBP', hue='SEX', kde=True);

### sns.boxplot

In [None]:
fig, axes = plt.subplots(1,2, figsize=(10,3))
sns.boxplot(data=fmh,y='SYSBP', hue='SEX', ax=axes[0])
sns.boxplot(data=fmh,y='DIABP', hue='SEX', ax=axes[1])
# legends

axes[0].legend(loc="upper left")
axes[1].legend(loc="upper left");

Via `axes` plot attributes can be modified see [Axes](https://matplotlib.org/stable/api/axes_api.html) for more details.

### sns.kdeplot

In [None]:
# kernel density estimate
ax = sns.kdeplot(data=fmh,x='SYSBP', hue='SEX', fill=True)
ax.set_title('My density plot')

In [None]:
# empirical cumulative distribution function
sns.ecdfplot(data=fmh,x='SYSBP', hue='SEX');

### Bivariate plots

- continious: scatter, lmp, regplot, hexplot, pair, kde (bivariate),
- categorical : line, bar, point

In [None]:
# lineplot
sns.lineplot(data=fmh, x='AGE', y='HEARTRTE');

**Exercise:** Implement `sns.ecdfplot` by calculating [ecdf](https://en.wikipedia.org/wiki/Empirical_distribution_function) and plotting using `sns.lineplot`.

### sns.violinplot

In [None]:
# violinplot
fig, axes = plt.subplots(1,2)
fig.tight_layout()
sns.violinplot(data=fmh,y='SYSBP', hue='SEX', ax=axes[0])
sns.violinplot(data=fmh,y='DIABP', hue='SEX', ax=axes[1]);

axes[0].legend(loc="upper left")
axes[1].legend(loc="upper left");

### sns.barplot

In [None]:
# barplot
plt.figure(figsize=(8,6))
ax = sns.barplot(data=fmh, x='AGE', y='HEARTRTE')

# # Overlapping labels
# ax.tick_params(axis='x', labelsize=10, labelrotation = 45)
# ax.tick_params(axis='y', labelsize=10)

### sns.scatterplot

Scatterplot can also be created via `replot` which will be shown later, but to illustrate some arguments it is included here.

**Synopsis:** nbsp; &nbsp;**<tt>scatterplot(data, x, y, hue, style, palette, s, c, alpha)</tt>
- data : DataFrame
- x,y :  variables of interest
- hue :  categorical variable for colouring
- style : point style
- palette : colour scheme, e.g.  deep, muted, bright, pastel, dark, colorblind
- kwargs : s, alpha

In [None]:
sns.scatterplot(fmh, x='SYSBP', y='DIABP', hue='SEX',style='ANGINA', palette='colorblind', s=20, alpha=0.5);

### sns.histplot

In [None]:
# histplot
sns.histplot(data=fmh,x='SYSBP',y='DIABP',  hue='SEX')

### sns.kdeplot

In [None]:
sns.kdeplot(fmh, x='SYSBP', y='DIABP', cmap='Greys', fill=True); # cmap='Greys', fill=True

In [None]:
# combine kdeplot and histplot
sns.kdeplot(fmh, x='SYSBP', y='DIABP', cmap='Greys')
sns.histplot(data=fmh,x='SYSBP',y='DIABP',  hue='SEX', palette="colorblind")


### Figure of grid-level plots


In [None]:
# axes-level lm
p =sns.regplot(fmh.sample(100), x='SYSBP', y='DIABP')

In [None]:
# FacetGrid
p = sns.lmplot(fmh.sample(100), x='SYSBP', y='DIABP') # lowess requires statsmodels module ; line_kws={'color': 'red'}

In [None]:
# FacetGrid
p = sns.relplot(fmh.sample(100), x='SYSBP', y='DIABP', row='SEX', col='EDUC', hue='ANGINA', style='CURSMOKE')

In [None]:
# JointGrid
jg = sns.jointplot(fmh, x='SYSBP', y='DIABP', hue='SEX', kind='scatter') # 'scatter', 'hist', 'kde'
                                                                        # see jg.plot_joint(...) and jg.plot_marginals(...)

### FacetGrid

In [None]:
fg = sns.FacetGrid(data=fmh,row='SEX', col='EDUC', hue="DEATH")
fg.map_dataframe(sns.scatterplot, x='SYSBP', y='DIABP')
fg.add_legend()

In [None]:
sns.pairplot(data=fmh[['AGE','BMI','SYSBP','DIABP']])

In [None]:
sns.heatmap(fmh[['SYSBP','DIABP','AGE','BMI','HEARTRTE']].corr(), cmap=sns.color_palette('colorblind'))

See [palettes](https://seaborn.pydata.org/tutorial/color_palettes.html) for more  options.

In [None]:
diamonds = pd.read_csv("data/diamonds.csv")
sns.heatmap(diamonds.select_dtypes(include=np.number).corr(),annot=True, linewidth=.01, cmap=sns.color_palette('colorblind'))

## Exercise

### Data : Natural gas consumption

In the exercises on `pandas` we used the *Natural gas consumption in the Netherlands* dataset [CBS Open data StatLine](https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS).  We will continue with the same dataset here for visualisation. We repeat the solution for the last exercise to include the `term` and `date` in our dataframe, but now all other columns of the data are also included. Some missing values are represented as `'       .'`, replace these with np.nan.


1. Plot lines:

- a) Plot yearly `TotalSupply_1` against `date` on yearly basis.
- b) Draw a horizontal line to mark `TotalSupply_1` at the point where `TotalSupply_1` is just below to the latest observation.
- c) Repeat (b) by taking the 2021 as the latest observation.


2. The import/export variables are those with names starting with `Import` and `Export`. Plot Import/Export  against `date` for all import/export variables (Hint: reshape data). Make sure the legend is correctly placed. Set X and Y axis labels to `Year` (JJ) and `Natural gas (MCM)` respectively and set legend's label to `Import/Export`.

3. Plot (point and lines) `TotalSupply_1` against `month` of all time. Note that you will need to summarise (use groupby/sum) on months of the entire dataset (MM only). Set X and Y axis labels to `Month` and `Natural gas (MCM)` respectively. Set `Month` axis ticks to represent month abbreviations. Hint: use calendar module to get month abbreviations.

4. Plot boxplots of `TotalSupply_1` against `month` of all time. Set X and Y axis labels to `Month` and `Natural gas (MCM)` respectively. Set `Month` axis ticks to represent month abbreviations.

5) Boxplots of import/export variables on year (JJ) basis in log10 scale. Set X and Y axis labels to `Natural gas (MCM)` and `Import/Export` respectively.


## Appendix

###  Framingham Heart Study variables description

- `RANDID`:  Unique identification number for each participant
- `SEX`: Participant sex (1 = male, 2 = female)
- `AGE`: age at examination (years)
- `SYSBP`:  Systolic Blood Pressure (mean of last two of three measurements) (mmHg)
- `DIABP`: Diastolic Blood Pressure (mean of last two of three measurements) (mmHg)
- `BPMEDS`: Use of Anti-hypertensive medication at examination (0=not currently used, 1=currently in use)
- `CURSMOKE`: Smoking (0=not currently, 1=currently)
- `EDUC`:   Attained Education (1=school, 2=High school diploma, 3=some college, 4=college degree)
- `BMI`: Body Mass Index ($weight_{kg}/height_{m}^{2}$)
- `HEARTRATE`: heart rate (bpm)
- `ANGINA`: Angina Pectoris
- `HOSPMI`: Hospitalized Myocardial Infarction
- `STROKE`:  Atherothrombotic infarction, Cerebral Embolism, Intracerebral Hemorrhage, Fatal Cerebrovascular Disease
- `CVD`: Myocardial infarction, Fatal Coronary Heart Disease or Cerebrovascular Disease
- `HYPERTEN`: Hypertensive
- `DEATH`: Death from any cause


### Useful links


| Library | Best For | Official Link |
| --- | --- | --- |
| Matplotlib | General-purpose 2D plotting | [Matplotlib](https://matplotlib.org/) |
| Seaborn | Statistical visualizations | [Seaborn](https://seaborn.pydata.org/) |
| Plotly | Interactive, web-based plots | [Plotly](https://plotly.com/python/) |
| Bokeh | Interactive visualizations & dashboards | [Bokeh](https://bokeh.org/) |
| Altair | Declarative plots for dataframes | [Altair](https://altair-viz.github.io/) |
| Pandas | Fast, simple visualizations for pandas data | [Pandas](https://pandas.pydata.org/) |
| Plotnine | Grammar-of-graphics plots | [Plotnine](https://plotnine.readthedocs.io/) |
| HvPlot | High-level, interactive plotting | [hvPlot](https://hvplot.holoviz.org/) |
| Holoviews | Simplified Python plotting | [Holoviews](https://holoviews.org/) |
| Dash | Apps & dashboards with Plotly | [Dash](https://dash.plotly.com/) |
| Pygal | SVG-based interactive charts | [Pygal](http://www.pygal.org/) |
| Geopandas | Geographical data visualizations | [Geopandas](https://geopandas.org/) |
| Cartopy | Geographic mapping | [Cartopy](https://scitools.org.uk/cartopy/docs/latest/) |
| Datashader | Massive datasets visualization | [Datashader](https://datashader.org/) |
| Mayavi | 3D scientific visualizations | [Mayavi](https://docs.enthought.com/mayavi/mayavi/) |
| VisPy | High-performance visualizations | [VisPy](https://vispy.org/) |



