# Visualisation of BubbleColumn testing data and Covid19 data

Now we will look a bit deeper into different visualisation options.

Again we will start by reading in the files with read_excel and read_csv.

## Our data for plotting:
1. Bubble column test data (combined data from 3 testruns)
2. RKI Covid19 data

In [None]:
import pandas as pd

### Read bubblecolumn excel files and combine into one pandas dataframe

In [None]:
df_bub1 = pd.read_excel("../Data/BubbleColumn/Test_01.xlsx",header=[0,1])
df_bub2 = pd.read_excel("../Data/BubbleColumn/Test_02.xlsx",header=[0,1])
df_bub3 = pd.read_excel("../Data/BubbleColumn/Test_03.xlsx",header=[0,1])

In [None]:
df_bub=pd.concat([df_bub1,df_bub2,df_bub2],keys=["Test1","Test2","Test3"],axis=0,names=["Param","Row_Index"],ignore_index=False)
df_bub

In [None]:
df_bub.index.get_level_values(level=0)

In [None]:
# We convert the Param multiindex into an additional column (easier for filtering)
df_bub.reset_index(level=0,inplace=True)
df_bub

Now we do some data exploration to check for non-numerical or missing values and to check whether the data is as we expect it.

In [None]:
df_bub.describe()

In [None]:
# Let's check the multiindex column names
df_bub.columns.values

### We also read in RKI Covid19 data as an example for timeseries and categorical data

In [None]:
df_rki=pd.read_csv("https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data")
df_rki

In [None]:
# Some preprocessing of RKI data to get official results:
# Is the data up to date?
print(df_rki["Datenstand"].unique())
df_rki_temp = df_rki[((df_rki["NeuerFall"]==0) | (df_rki["NeuerFall"]==1))]

## Some general tips:
Always very useful:
- Data Dictionary: Metadata for your column names. Explanations, units, etc
- Data Catalogue: Catalogue for with metadata and storage paths for your testing data.

## Overview of python visualisation packages 

- matplotlib

    Widely used visualisation library. Easy to use and has a good online community presence.


- pandas built-in plotting library
    
    Single line command to plot the dataframe. Easier to plot scatterplotmatrix using this library compared to matplotlib and bokeh. 


- bokeh

    Visualisations are more appealing and has built in plot configuration tools (zoom in, pan, etc). But takes time to load the visualisation and it is more suited for creating dashboard. Moreover, the documentation is not clear
    
- seaborn

    Plotting based on matplotlib, but with lots of nice preformatting. Optimized for statistical, dataframe plotting

- plotly

    Can do contourplots and 3D plots

- altair / Vega-Lite

    Statistical visualization library, preformatted. Minimum amount of code required for nice plotting results
    
### Keep in mind:

- Check the documentation of the module by using help() function or the ? in front of the function call!
- Questions, Problems? --> Google! --> One of the best sources is stackoverflow
- Module features are dependent on the module version! Check your version:
```
import matplotlib
matplotlib.__version__
```


In [None]:
# import neessary libraries:
import matplotlib.pyplot as plt
import bokeh.plotting as bp # another plotting option 
from bokeh.plotting import figure,output_notebook,show # for plotting
import seaborn as sb # yeat another plotting option
import plotly.express as px
import altair as alt

# interactive widgets
import ipywidgets as widgets # interactive notebooks - make selection etc
from IPython.display import display # to display the widgets in notebook

# some more useful stuff:
import os
import datetime 

## Plot histograms
In order to understand the typical distributions of values, you can always start with a histogram

We start with our BubbleColumn testing data

We wil compare the histogram plots from matplotlib library and pandas built-in

In [None]:
# 1. df_test - matplotlib

plt.figure(figsize=(15,5))
plt.hist(df_bub['cam0', 'Max Feret Diameter'].dropna(),bins=25, color='green',alpha=0.7) # Remember to dropna!
plt.xlabel('Max Feret Diameter')
plt.title('Max Feret Diameter')
plt.show()

# 2. df_test - matplotlib: Also applying some filtering  to zoom into a smaller range 

plt.figure(figsize=(15,5))
plt.hist(df_bub[((df_bub['cam0', 'Max Feret Diameter']>1.2) & (df_bub['cam0', 'Max Feret Diameter']<6.0))]['cam0', 'Max Feret Diameter'].dropna(),bins=25, color='grey',alpha=0.7) # Remember to dropna!
plt.xlabel('Max Feret Diameter')
plt.title('Max Feret Diameter')
plt.show()

In [None]:
# 2. df_test - pandas built-in 
df_bub.plot(y=("cam0",'Max Feret Diameter'),kind="hist",bins=25,color="green",alpha=0.7,figsize=(15,5),title='Max Feret Diameter')

## How can histograms be extremely valuable? 
With the help of histograms you can already get an idea about outliers:

If you have the data from multiple tests and you want to know how one specific test compares to the overall amount of tests.

In [None]:
help(plt.hist)

In [None]:
# Plot 2 overlaying histograms for comparison. 
# To be able to do so, we also need to add the density keyword! Otherwise the bins of the one test will be much much smaller.
plt.figure(figsize=(15,5))
plt.hist(df_bub['cam0', 'Max Feret Diameter'].dropna(),bins=20,density=True, color='blue',label="All tests")
plt.hist(df_bub[df_bub["Param"]=="Test1"]['cam0', 'Max Feret Diameter'].dropna(),bins=20, density=True,color='orange',alpha= 0.35, label="Test 1")
plt.xlabel('Max Feret Diameter')
plt.title('Comparison of one test with the overall amount of test')
plt.legend()
plt.show()

## Some simple scatter / line plots
### Created in a loop with filtering of a large dataframe

1. Example: Bubble testing data
2. Example: RKI Covid10 cases for different Landkreise

Advantages of a scatter plot over a line plot:
Whenever you look at a distribution or a change over time, you are not able to see the intensity / density of the datapoints, if you just do a line plot. To get a feeling for the data, it is always better to start with 'point' as marker instead of 'line'

In [None]:
df_bub[df_bub["Param"]=="Test3"]["erg","z_bild "]

In [None]:
plt.figure(figsize=(15,5))

for i in df_bub["Param"].unique():
    print(i)
    df_temp=df_bub[df_bub["Param"]==i]
    x=df_temp["erg","Zeit [ms]"]
    y=(df_temp["erg","z_bild "].shift(1)-df_temp["erg","z_bild "])/(df_temp["erg","t_Bilder LabV"].shift(1)-df_temp["erg","t_Bilder LabV"])
    plt.scatter(x,y,label=i)
    
plt.legend()
plt.ylim(0,0.5)


## Interactive selection widgets:
Another option to get the plots for different tests interactively:

In this minimal example you have to run the plot command every time yoiu have changed the Dropdwon values. But of course you can also add a so-called callback to renew the plot automatically, when a dropdown value changes.
check it out: widget.observe

In [None]:
# At first we create the selection widget for the Testrun
Test_selection=widgets.Dropdown(options=df_bub["Param"].unique(), value="Test2", description="Select one test")
display(Test_selection)

In [None]:
# Then we create the selection widget for the 0th level of the multiindex columns:
Parameter1_selection=widgets.Dropdown(options=df_bub.columns.get_level_values(level=0).unique(), value="cam0", description="Select one parameter")
display(Parameter1_selection)

In [None]:
# Then we create the selection widget for the 1st level of the multiindex columns:
Parameter2_selection=widgets.Dropdown(options=df_bub.loc[:,pd.IndexSlice[["cam0"], :]].columns.get_level_values(1).unique(),
                                      description="Select one parameter")
display(Parameter2_selection)

In [None]:
plt.figure(figsize=(15,5))

df_temp=df_bub[df_bub["Param"]==Test_selection.value]
x=df_temp["erg","Zeit [ms]"]
y=df_temp[Parameter1_selection.value,Parameter2_selection.value]
plt.scatter(x,y,label=str(Parameter1_selection.value)+", "+str(Parameter2_selection.value))
    
plt.legend()

## Now lets have a look at the same plot with different packages
### Bokeh --> Interactive plots
Try the different menu options you can see at the right side of the plot

In [None]:
# This commands let you visualise bokeh below the execution cell
output_notebook()

In [None]:
df_temp=df_bub[df_bub["Param"]==Test_selection.value]
x=df_temp["erg","Zeit [ms]"]
y=df_temp[Parameter1_selection.value,Parameter2_selection.value]

# 1. Bokeh
p = figure(title="Parameter Selection {}, {} for {}".
           format(Parameter1_selection.value,Parameter2_selection.value,Test_selection.value),x_axis_type='datetime',
          width=800,height=250)
p.circle(x=x,
         y=y)
show(p)

### Seaborn
Not interactive, but preformatted for a nice appearance

In [None]:
df_temp=df_bub[df_bub["Param"]==Test_selection.value]
x=df_temp["erg","Zeit [ms]"]
y=df_temp[Parameter1_selection.value,Parameter2_selection.value]

plt.figure(figsize=(15,5))
sb.scatterplot(x,y)

### Plotly
Interactive plots. Here you can see the single values when hovering over the points.

With plotly you can also do 3D plots!

In [None]:
df_temp=df_bub[df_bub["Param"]==Test_selection.value]
x=df_temp["erg","Zeit [ms]"]
y=df_temp[Parameter1_selection.value,Parameter2_selection.value]

fig = px.scatter(x=x, y=y)
fig.show()

### And now as well Altair

In [None]:
# Here we need some additional extensions, so maybe you need to install some additional packages to be able to display the plot.
df_temp=df_bub[df_bub["Param"]==Test_selection.value]
x=df_temp["erg","Zeit [ms]"]
y=df_temp[Parameter1_selection.value,Parameter2_selection.value]

chart=alt.Chart(x=x, y=y).interactive()

chart.show()

## Doing some line plots in a loop
Lets look at the Covid19 case numbers for each Landkreis

In [None]:
# Plotting the casenumbers
fig=plt.figure(figsize=(12,10))
ax1=fig.add_subplot(111)

df_rki_lk=df_rki_temp.groupby(["Landkreis","Meldedatum"],as_index=False)[["AnzahlFall"]].sum()
for i in df_rki_lk["Landkreis"].unique():
    df=df_rki_lk[df_rki_lk["Landkreis"]==i]
    df.set_index("Meldedatum", inplace=True, drop=True)
    df.index=pd.to_datetime(df.index,format="%Y-%m-%d")
    df.sort_index(inplace=True)
    ax1.plot(df["AnzahlFall"],color="grey",alpha=0.3)

    if "Berlin" in i:
        df_b=df

ax1.plot(df_b["AnzahlFall"],color="red",label="Berlin")
plt.yscale("log")
plt.title("Casenumbers - Reporting Date - for each Landkreis in Germany")

In [None]:
df_rki_lk["Landkreis"].unique()

### Create mulitple subplots

Plot different Landkreise in subplots

In [None]:
plt.figure(figsize=(15,7))

plt.subplot(2,2,1)
plt.plot(df_b["AnzahlFall"],'.')
plt.plot(df_b["AnzahlFall"],'-', color="grey", alpha=0.5)
plt.title("Subplot1 - Berlin")

plt.subplot(2,2,2)
plt.plot(df_rki_lk[df_rki_lk["Landkreis"]=="LK Darmstadt-Dieburg"]["AnzahlFall"],'.')
plt.plot(df_rki_lk[df_rki_lk["Landkreis"]=="LK Darmstadt-Dieburg"]["AnzahlFall"],'-', color="grey", alpha=0.5)
plt.title("Subplot 2 - LK Darmstadt-Dieburg")

plt.subplot(2,2,3)
plt.plot(df_rki_lk[df_rki_lk["Landkreis"]=="LK Friesland"]["AnzahlFall"],'.')
plt.plot(df_rki_lk[df_rki_lk["Landkreis"]=="LK Friesland"]["AnzahlFall"],'-', color="grey", alpha=0.5)
plt.title("Subplot 3 - LK Friesland")

plt.subplot(2,2,4)
plt.plot(df_rki_lk[df_rki_lk["Landkreis"]=="LK Heinsberg"]["AnzahlFall"],'.')
plt.plot(df_rki_lk[df_rki_lk["Landkreis"]=="LK Heinsberg"]["AnzahlFall"],'-', color="grey", alpha=0.5)
plt.title("Subplot 4 - LK Heinsberg")

plt.show()

## Create a Correlation / Scatterplot matrix 

"A scatter plot matrix is a grid (or matrix) of scatter plots used to visualize bivariate relationships between combinations of variables. Each scatter plot in the matrix visualizes the relationship between a pair of variables, allowing many relationships to be explored in one chart."
(https://pro.arcgis.com/en/pro-app/latest/help/analysis/geoprocessing/charts/scatter-plot-matrix.htm)

In [None]:
# for simplicity we just look at a smaller df with just one level of column indices
df_temp=df_bub[df_bub["Param"]=="Test1"]
df_temp=df_temp.loc[:,pd.IndexSlice[["cam0"], :]]

In [None]:
df_temp.columns.values

In [None]:
df_temp=df_temp[[('cam0', 'Waddel Disk Diameter'),('cam0', 'equi Ellipse Minor'),('cam0', 'Max Feret Diameter'),
               ('cam0', 'equi Ellipse Minor Axis (Feret)'),('cam0', 'equi Rect Short Side (Feret)')]]

In [None]:
pd.plotting.scatter_matrix(df_temp, figsize=(15, 15), marker='o',
                        hist_kwds={'bins': 20}, s=1, alpha=.25)
plt.show()

### Links to visualisation examples and more:

* Finding the right diagram
https://www.visual-literacy.org/periodic_table/periodic_table.html
* Finding the right colormap for our visualisation
http://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3
* More visualisation examples: 
https://d3js.org/
https://docs.bokeh.org/en/latest/docs/gallery.html
* Broad overview of vsrious tools available in python 
https://github.com/EthicalML/awesome-production-machine-learning
* Need multiple y-axis?
https://matplotlib.org/3.1.1/gallery/ticks_and_spines/multiple_yaxis_with_spines.html