In [None]:
!mkdir -p data
![ -f data/style.css ] || curl -L https://raw.githubusercontent.com/giovastabile/elaborazione_statistica/main/notebooks/data/style.css -o data/style.css
!mkdir -p data
![ -f data/participants.csv ] || curl -L https://raw.githubusercontent.com/giovastabile/elaborazione_statistica/main/notebooks/data/data/participants.csv -o data/participants.csv
from IPython.display import HTML
HTML('<style>{}</style>'.format(open('data/style.css').read()))

# Plotly Express

## Introduction

In this notebook, we explore the plotly express library.  This library incorporates pandas dataframe objects and makes plotting very simple.

## Library imports

In [None]:
import plotly.express as px
import plotly.io as pio

pio.templates.default = 'none'

In [None]:
import numpy as np
import pandas as pd

## Scatter plots

In [None]:
df = pd.read_csv('data/participants.csv')
df.head()

Our first plot will simply have *Feature1* as the dependent variable on the $x$-axis and *Feature2* as the dependent variable on the $y$-axis.

The `scatter()` function creates a scatter plot.  Our first argument is the dataframe object from which we want to take our data.  We specify the column names for the `x=` and `y=` arguments.  Finally, we add the `.show()` method to show the plot.

In [None]:
px.scatter(df, x='Feature1', y='Feature2')

We can show this correlation between *Feature1* and *Feature2* specifically separated by the sample space elements of a categorical variable such as *Target2*.  We can find the sample space elements using the `.unique()` method for the `df.Target2` series object.

In [None]:
df.Target2.unique()

By adding the `color=` argment set to the specified column, the sactter plot will display the data according to our sample space elements for the chosen categorical variable.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', color='Target2').show()

We add a title to our plot with the `title=` argument.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', color='Target2', title ='Correlation between Feature 1 and Feature 2')

To make more changes to the labels of our scatter plot, we give the plot a computer variable name and update the plot with various methods.  Below, we specify the axes labels.  For the $x$-axis, we specify the title and its attributes separately.  For the $y$-axis, we do it all in the same method.

In [None]:
fig = px.scatter(df, x='Feature1', y='Feature2', color='Target2', title ='Correlation between Feature 1 and Feature 2')
fig.update_xaxes(title_text='Values for feature 1')
fig.update_xaxes(title_font=dict(size=12, family='Courier', color='grey'))
fig.update_yaxes(title_text='Values for feature 2', title_font=dict(size=12, family='Courier', color='grey'))
fig.show()

For simple plot, we can use the `labels=` argument to change the axes labels.  By default, the column names in the dataframe object are used to display names for the axes.  This argument takes a dictionary as value, where the key is the column name and the value is the new name.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', color='Target2', labels={'Feature1':'Values for feature 1', 'Feature2':'Values for feature 2'}).show()

The plotly express library allows us to add more visual information along the axes.  Scatter plots make it difficult to visualize the distribution in the data.  Below, we add a histogram along the $x$-axis (displayed at the top) and a box-and-whisker plot along the $y$-axis (displayed on the right).

In [None]:
fig = px.scatter(df, x='Feature1', y='Feature2', color='Target2', title ='Correlation between Feature 1 and Feature 2', marginal_x='histogram', marginal_y='box')
fig.show()

To show a linear model in the plot, we add the `trendline=` argument.  Setting it to `'ols'` produces a linear model using ordinary least squares.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', color='Target2', labels={'Feature1':'Values for feature 1', 'Feature2':'Values for feature 2'}, trendline='ols').show()

A scatter plot can become a bit *busy*.  Instead of showing all the data on a single plot, we use the `facet_col=` argument to create plots in columns.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', facet_col='Target2', labels={'Feature1':'Values for feature 1', 'Feature2':'Values for feature 2'}, trendline='ols').show()

Scatter plot can add a third continuous numerical variable by altering the color of each marker in the plot.  This is done by setting a numerical variable as value of the `color=` argument.  We specify the color range we would like to use with the `color_continuous_scale=` argument to override the default.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', color='Feature3', labels={'Feature1':'Values for feature 1', 'Feature2':'Values for feature 2'}, trendline='ols').show()

Another ways to visually add a third numerical is by changing the size of each marker.  This is done with the `size=` argument.

In [None]:
px.scatter(df, x='Feature1', y='Feature2', size='Feature3', color='Target2', labels={'Feature1':'Values for feature 1', 'Feature2':'Values for feature 2'}, trendline='ols').show()

## Heat maps

Heat maps are also used for visualizing continuous numerical variables.

In [None]:
px.density_heatmap(
    df,
    x='Feature1',
    y='Feature2'
)

The color information displays how many of our subject fall within the bins created by the two variables.  Since there are more cases towards the middle on both axes, with fewer cases further out, we suspect a normal distribution for both variable.  We visualize this by adding histograms to the margins.

In [None]:
px.density_heatmap(
    df,
    x='Feature1',
    y='Feature2',
    marginal_x='histogram',
    marginal_y='histogram'
)

## Contour plots

Contour plots provide similar information, but instead of bins, we have continuous curves.

In [None]:
px.density_contour(
    df,
    x='Feature1',
    y='Feature2',
    facet_col='Target2'
)

## Box-and-whisker plots

Box plots remain the most used data visualization for numerical variables, especially when comparing the values for a variable along the sample space elements of a categorical variable. The`box()` function in plotly express creates box-and-whisker plots.

Below, we show the distribution of the *Feature1* variable for the three classes in the *Group* variable.

In [None]:
df.Group.unique()

In [None]:
px.box(
    df,
    y='Feature1',
    x='Group'
)

To get an idea of the actual data values, we set `'all` as the value for the `points=` argument.  We also set titles and axes labels as before.

In [None]:
px.box(
    df, 
    y='Feature1',
    x='Group',
    points='all',
    title='Comparison of featur 1 per group',
    labels={
        'Feature1':'feature 1',
        'Group':'groups'
    }
)

We use the `color=` argument set to another categorical variable for further subdivision of the visualization.

In [None]:
px.box(
    df,
    y='Feature1',
    x='Group',
    color='Target2'
)

## Histograms

Histograms give a visual represenation of the distribution of data point values for a continuous numerical variable.

Below, we create a stacked histogram showing the counts of subjects falling in the bin intervals for the *Feature1* variable divided by the sample space elements of the *Target2* variable.

In [None]:
px.histogram(
    df,
    x='Feature1',
    color='Target2',
    title='Distribution of feature 1 for targets I and II'
)

Setting the `barmode=` argument to `'overlay'` plots both distributions over each other, adding transparency so as to see both distributions.

In [None]:
px.histogram(
    df,
    x='Feature1',
    color='Target2',
    barmode='overlay',
    title='Distribution of feature 1 for targets I and II'
)

By default, the counts are shown on the $y$-axis.  Below, we change this to show a relative frequency expressed as a percentage using the `histnorm=` argument set to `percent`.

In [None]:
px.histogram(
    df,
    x='Feature1',
    color='Target2',
    barmode='overlay',
    histnorm='percent',
    title='Distribution of feature 1 for targets I and II'
)

As with graph objects in plotly, the number of bins is set using the `nbins=` argument.

In [None]:
px.histogram(
    df,
    x='Feature1', 
    nbins=10
)

The `cumulative=` argument creates a cumulative histogram.  Below, we express this as a percent.

In [None]:
px.histogram(
    df,
    x='Feature1',
    cumulative=True,
    histnorm='percent'
)