# CLINICAL RESEARCH

# Data visualization

>by Dr Juan H Klopper

<div><img src="https://drive.google.com/uc?id=1sQfuEM09OlKYGpNQ-nsXnpW8OogryOJv" width=200/></div><div><img src="https://drive.google.com/uc?id=16jSLpztk9lAMJ3X28K0rkPrdz2U0DvJ-" width=200/></div><div><img src="https://drive.google.com/uc?id=1b-pCh1IVchhcpyurIF4eQi-kAMCYNyDb" width=200/></div>

## Introduction

Visualizing data is not only a pleasing activity, but it can provide an even richer understading of the data than summary statistics.

In this notebook, we take a look at one of the myriad plotting libraries in Python called plotly.  It is an interactive and powerful plotting library.

## Library imports

In [1]:
import pandas as pd

In [4]:
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'

In [None]:
%load_ext google.colab.data_table

## Data import

In [None]:
# List files in the DATA directory of your Google Drive
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive
%cd My\ Drive/Coursera/Understanding\ clinical\ research/DATA
%ls

In [2]:
df = pd.read_csv('../DATA/data.csv')

In [3]:
df

Unnamed: 0,Name,DOB,Age,Vocation,Smoke,HR,sBP,CholesterolBefore,TAG,Survey,CholesterolAfter,Delta,Group
0,Dylan Patton,1981-10-07,43,Energy manager,0,47,145,1.2,1.2,1,0.7,0.5,Active
1,Sandra Howard,1993-01-27,53,Tax adviser,0,51,115,1.2,0.6,3,1.0,0.2,Active
2,Samantha Williams,1973-12-21,33,IT consultant,0,54,120,2.0,1.3,3,1.7,0.3,Active
3,Ashley Hensley,1981-12-01,43,"Nurse, children's",0,54,103,2.1,1.6,4,2.1,0.0,Active
4,Robert Wilson,1964-06-23,46,Clinical embryologist,0,61,138,2.8,2.1,5,2.8,0.0,Active
...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,John Curtis,1936-11-25,66,"Sales professional, IT",1,96,201,10.1,5.1,5,10.0,0.1,Control
196,Jessica Tanner,1986-07-01,54,Paramedic,1,93,183,10.1,5.3,5,10.0,0.1,Control
197,Charles Smith,1959-01-30,61,Chartered certified accountant,0,99,212,10.1,5.6,4,9.7,0.4,Control
198,Barry Porter,1979-05-30,65,Dancer,1,98,200,10.1,5.3,3,10.0,0.1,Control


## Bar plot

Bar plots are great for indicating frequency and relative frequency.  In other words, counting the sample space elements of categorical or discrete data.

One axis, usually the horizontal axis is reserved to indicate the sample space elements of the catgeorical variable.  The other axis is used to show the ferquency or relative frequency, i.e. the height of a bar.  There are spaces in between the bars to indicate that the sample space elements are indeed not a continuity.

- Create a bar plot indicating the number of patients who are non-smokers vs smokers vs ex-smokers.

In [5]:
# Calculate the frequency count
df.Smoke.value_counts()

Smoke
0    88
1    85
2    27
Name: count, dtype: int64

In [6]:
# Simple bar plot
smokers_fig = go.Figure()

smokers_fig.add_trace(go.Bar(
    x=['Non-smokers', 'Smokers', 'Ex-smokers'],
    y=[88, 85, 27]
))

smokers_fig.show()

We entered the `x` and `y` values by hand.  While this is easy enough for smaller sample element numbers, it is not always teh case.  We can get the sample space elements using the `unique()` method.  It will be returned in the order in which the method discovers them in the relevant pandas series object.

In [7]:
df.Smoke.unique()

array([0, 2, 1], dtype=int64)

This might not be the order in which we want the bars to appear in.  In the case of the *Smoke* variable, we have an additional problem in that the actual sample space elements were encoded by integers and that is not what we want in a plot.

With the default argument values of the `.value_counts()` method, we get the frequencies in descending order.

In [8]:
df.Smoke.value_counts()

Smoke
0    88
1    85
2    27
Name: count, dtype: int64

This might also not be useful as in the case of a satatistical variable such as *Month*.  The bottom line is that we have to be careful when designing the plot wth our code.

The `Bar()` function requires the frequency variable to be a list of values.  We extract that below using the `.value` property and the `.tolist()` method.

In [9]:
df.Smoke.value_counts().values.tolist()

[88, 85, 27]

In [10]:
# Simple bar plot
smokers_fig = go.Figure()

smokers_fig.add_trace(go.Bar(
    x=['Non-smokers', 'Smokers', 'Ex-smokers'],
    y=df.Smoke.value_counts().values.tolist(),
    marker={'color':['green', 'red', 'orange'],
            'line':{'color':'black', 'width':1}}
))

smokers_fig.show()

In [11]:
# Add a title
smokers_fig.update_layout(title='Number of non-smokers vs smokers vs ex-smokers')
smokers_fig.show()

In [12]:
# Add axes labels
smokers_fig.update_layout(xaxis=dict(title='Groups of smokers'),
                          yaxis=dict(title='Counts'))
smokers_fig.show()

In the bar plot below, I add many more properties as illustration of what is possibl.e

In [13]:
df.Smoke.value_counts(normalize=True) * 100  # Calculative percentage relative frequency

Smoke
0    44.0
1    42.5
2    13.5
Name: proportion, dtype: float64

In [14]:
smokers_fig = go.Figure()

smokers_fig.add_trace(go.Bar(
    x=['Non-smokers', 'Smokers', 'Ex-smokers'],
    y=df.Smoke.value_counts().values.tolist(),
    text=df.Smoke.value_counts().values.tolist(),
    textposition='outside',
    hovertext=['44% are non-smokers', '42.5% are smokers', '13.5% are ex-smokers'],
    marker={'color':['green', 'rgba(255, 0, 0, 1)', 'orange'],
            'line':{'color':'black', 'width':1},
            'opacity':0.7}
))

smokers_fig.update_layout(title='Number of non-smokers vs smokers vs ex-smokers')

smokers_fig.update_layout(xaxis = dict(title='Groups of smokers'),
                          xaxis_tickangle=-25,
                          yaxis=dict(title='Counts'))

smokers_fig.show()

We can also create horizontal bar plots.

In [15]:
smokers_fig = go.Figure()

smokers_fig.add_trace(go.Bar(
    y=['Non-smokers', 'Smokers', 'Ex-smokers'],
    x=df.Smoke.value_counts().values.tolist(),
    text=df.Smoke.value_counts().values.tolist(),
    textposition='inside',
    hovertext=['44% are non-smokers', '42.5% are smokers', '13.5% are ex-smokers'],
    marker={'color':['green', 'rgba(255, 0, 0, 1)', 'orange'],
            'opacity':0.7},
    orientation='h'
))

smokers_fig.update_layout(title='Number of non-smokers vs smokers vs ex-smokers')

smokers_fig.update_layout(yaxis = dict(title='Groups of smokers'),
                          xaxis=dict(title='Counts'))

smokers_fig.show()

We can group data by another categorical variable.

In [16]:
pd.crosstab(df.Group, df.Survey)

Survey,1,2,3,4,5
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Active,21,18,17,23,21
Control,17,32,13,14,24


In [17]:
surv_grp_fig = go.Figure()

surv_grp_fig.add_trace(go.Bar(
    x=['1', '2', '3', '4', '5'],
    y=[21, 18, 17, 23, 21],
    text=[21, 18, 17, 23, 21],
    textposition='outside',
    name='Active group',
    marker={'color':'orange', 'opacity':0.7}
))

surv_grp_fig.add_trace(go.Bar(
    x=['1', '2', '3', '4', '5'],
    y=[17, 32, 13, 14, 24],
    text=[17, 32, 13, 14, 24],
    textposition='outside',
    name='Control group',
    marker={'color':'deepskyblue', 'opacity':0.7}
))

surv_grp_fig.update_layout(title='Survey frequencies by treatment group')

surv_grp_fig.update_layout(xaxis = dict(title='Survey answer'),
                           yaxis=dict(title='Counts'),
                           barmode='group')

surv_grp_fig.show()

#### Exercise

Create the bar plot above, but group the horizontal axis by the treatment groups.

#### Solution

The idea is to ceat arrays and list so that we can loop over their elements.

In [18]:
pd.crosstab(df.Group, df.Survey)

Survey,1,2,3,4,5
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Active,21,18,17,23,21
Control,17,32,13,14,24


The `crosstab()` function create a pandas array object, from which we can extract a numpy array object.

In [19]:
# Extract an array object
vals = pd.crosstab(df.Survey, df.Group).values
vals

array([[21, 17],
       [18, 32],
       [17, 13],
       [23, 14],
       [21, 24]], dtype=int64)

In [20]:
# A list object to hold the sample space elements of the treatment groups
grps = ['Active', 'Control']
grps

['Active', 'Control']

In [21]:
# A list object to hold the sample space elements of the survey answers
nms = ['Answer 1', 'Answer 2', 'Answer 3', 'Answer 4', 'Answer 5']
nms

['Answer 1', 'Answer 2', 'Answer 3', 'Answer 4', 'Answer 5']

In [22]:
surv_grp_fig = go.Figure()

# Loop over elements in grps array object and nms list obejct
for i in range(5):
  surv_grp_fig.add_trace(go.Bar(
      x=grps,
      y=vals[i],
      text=vals[i],
      textposition='outside',
      name=nms[i],
))

surv_grp_fig.update_layout(title='Survey frequencies by treatment group')

# Using alternate dictionary syntax
surv_grp_fig.update_layout({'xaxis':{'title':'Treatment group'},
                            'yaxis':{'title':'Counts'},
                            'barmode':'group'})

surv_grp_fig.show()

Below, we turn this into a stacked bar chart.

In [23]:
surv_grp_fig = go.Figure()

for i in range(5):
  surv_grp_fig.add_trace(go.Bar(
      x=grps,
      y=vals[i],
      text=vals[i],
      textposition='inside',
      name=nms[i],
))

surv_grp_fig.update_layout(title='Survey frequencies by treatment group')

surv_grp_fig.update_layout(xaxis = dict(title='treatment group'),
                           yaxis=dict(title='Counts'),
                           barmode='stack')

surv_grp_fig.show()

For more information about bar plots click [HERE](https://plotly.com/python/bar-charts/).

## Histogram

A histogram is used for continuous numerical variables.  By creating bins, we can count how many times a value in that interval appears.

Histograms are great at showing us the distribution of the data.  Below, we plot a histogram of the age of our participants.  Thi time, we make use of the plotly express library.

In [24]:
age_hist = px.histogram(df,
                        x='Age')
age_hist.show()

We can create a **stacked** histogram by using the `color=` argument to point to another categorical variable.

In [25]:
age_hist = px.histogram(
    df,
    x='Age',
    color='Group',
    title='Histogram of participant ages',
    opacity=0.7,
    marginal='rug',
)

age_hist.show()

The graph obejcts library of plotly provides even more.

In [26]:
age_hist = go.Figure()

age_hist.add_trace(go.Histogram(
    x=df.Age
))

age_hist.show()

Let's look at the age distribution for each of the non-smokers and smokers (exlcuding the ex-smokers for brevity).  Using the `barmode='overlay'` option, we create a **non-stacked** histogram.

In [27]:
ages_smoke = go.Figure()

ages_smoke.add_trace(go.Histogram(
    x=df[df.Smoke == 0]['Age'],
    name='non-smokers',
    marker_color='orange',
    xbins=dict(start=10,
               end=90,
               size=5)
))
ages_smoke.add_trace(go.Histogram(
    x=df[df.Smoke == 1]['Age'],
    name='smokers',
    marker_color='deepskyblue',
    xbins=dict(start=10,
               end=90,
               size=5)
))

ages_smoke.update_layout(barmode='overlay',
                         title='Age distribution of non-smokers and smokers',
                         xaxis=dict(title='Age'),
                         yaxis=dict(title='Count'))

ages_smoke.update_traces(opacity=0.75)

ages_smoke.show()

For more information on histograms from plotly click [HERE](https://plotly.com/python/histograms/).

## Box plot

Box-and-whisker plots are another visual representation of the distribution of a continuous numerical variable.

- Create box plots of the ages of non-smokers, smokers, and ex-smokers.

In [None]:
# Simple box plots using express
ages_smoke_box_px = px.box(
    df,
    x='Smoke',
    y='Age',
    title='Distribution of age in non-smokers and smokers')
ages_smoke_box_px.show()

With the `Box()` function, we can take more control over each trace.

In [None]:
# Extracting list objects
non_smoker_age = df[df.Smoke == 0]['Age'].to_list()
smoker_age = df[df.Smoke == 1]['Age'].to_list()
ex_smoker_age = df[df.Smoke == 2]['Age'].to_list()

In [None]:
# Adding separate traces and configuration
ages_smoke_box = go.Figure()

ages_smoke_box.add_trace(go.Box(
    y=non_smoker_age,
    name='non-smokers',
    marker_color='green',
    boxmean=True,
    boxpoints='all'
))

ages_smoke_box.add_trace(go.Box(
    y=smoker_age,
    name='smokers',
    marker_color='red',
    boxmean='sd',
    boxpoints='all'
))

ages_smoke_box.add_trace(go.Box(
    y=ex_smoker_age,
    name='ex-smokers',
    marker_color='orange',
    boxmean='sd',
    boxpoints='all'
))

ages_smoke_box.update_layout(title='Distribution of ages',
                             xaxis={'title':'Group'},
                             yaxis={'title':'Count'})

ages_smoke_box.show()

You can learn more about bo-and-whisker plots [HERE](https://plotly.com/python/box-plots/).

## Scatter plots

Scatter plots allow us to view the difference between participants with respect to continuous numerical variables.  If we restrict ourselves to two variables, each dot on a plane with and $x$ and a $y$ axis can represent the values for a pai of continuous numerical variables for each particpant.

- Create a scatter plot of age ($x$ axis) vs systolic blood pressure ($y$ axis).

In [None]:
age_sbp = go.Figure()

age_sbp.add_trace(go.Scatter(
    x=df.Age,
    y=df.sBP,
    mode='markers'
))

age_sbp.update_layout(title="Age vs systolic blood pressure",
                      xaxis=dict(title="Age"),
                      yaxis=dict(title="Systolic blood pressurer"))

age_sbp.show()

We can add a third variable in the form of the size of the marker.  Below, we do just that and compare the age and systolic blood pressure of the participants.  To this, we add the heart rate as the size of the marker.  We also split the patients by the sample space elements of the traetment group.

We co even further and add box-and-whisker plots, together with a linear model (using ordinary-least-squares).

- Compare age vs systolic blood pressure for smoking groups with HR indicated by the size of marker (bubble chart)

In [None]:
# Box and whisker plots of the variables
age_sbp_group_px = px.scatter(
    df,
    x='Age',
    y='sBP',
    size='HR',  # Determines size of markers
    color='Group',  # Group by this variable
    marginal_y='box',
    marginal_x='box',
    trendline='ols',
    title='Comparing age vs systolic BP for each of the two groups',
    labels={'sBP':'Systolic BP'})  # Over-write column names
age_sbp_group_px.show()

Instead of box-and-whisker plots, we can also add rug plots and histograms.

In [None]:
# Rug plot and histogram as marginal plots
age_sbp_group_px = px.scatter(
    df,
    x='Age',
    y='sBP',
    size='HR',  # Determines size of markers
    color='Group',  # Group by this variable
    marginal_y='histogram',
    marginal_x='rug',
    trendline='ols',
    title='Comparing age vs systolic BP for each of the two groups',
    labels={'sBP':'Systolic BP'})  # Over-write column names
age_sbp_group_px.show()

Instead of the size of the marker as visual indicator of the third variable, we can also use color.  Below, we also add the `facet_col=` argument.  This creates sindividual plots based on the sample space elements of a categorical variable and positions them as columns.

- Separate scatter plots
- _Third dimension_ by color scale

In [None]:
age_sbp_group_px_facet = px.scatter(
    df,
    x='Age',
    y='sBP',
    color='HR',
    facet_col='Group',
    trendline='ols',
    title='Sperate scatter plots per group',
    labels={'sBP':'Systolic BP'},
    color_continuous_scale=px.colors.sequential.Viridis)
age_sbp_group_px_facet.show();

(Hover on the trendline to see the regression model and the coefficient of determination, $R^2$.  We will learn more about this when we look at linear regression.)

You can learn more about scatter plots [HERE](https://plotly.com/python/line-and-scatter/).

## Conclusion

There is so much more to learn about plotly.  Visit the homepage [HERE](https://plotly.com/python/).  For references on all the arguments click [HERE](https://plotly.com/python-api-reference/).