[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/databyjp/AcademyXI_DA/blob/main/notebooks/AcademyXi_DA_Module_4_dataviz_tutorial.ipynb)

# Data visualisation tutorial
This is an introductory notebook showing how to use the [Plotly.py](https://github.com/plotly/plotly.py) library to build charts. 

Run this notebook through Google Colab.

If you are interested in exploring Plotly.py further, check out the Plotly.py [documentation](https://plotly.com/python/).

In [None]:
# Install additional libraries required (fsspec and s3fs) to load files through AWS S3
%%capture tmp
!pip install fsspec s3fs

# Import libraries to be used
import plotly.express as px
import pandas as pd
import numpy as np

Now we are ready to load our data from a CSV file into a `pandas` DataFrame.

In [None]:
# Load data from S3
df = pd.read_csv("s3://databyjp/academyxi/healthcare-expenditure-vs-gdp-2014_clean.csv")

Check that the data has been loaded propertly - you should see the first five rows of data here

In [None]:
df.head()


---

We are ready to plot the data. 

Let's plot a bar graph below of life expectations for the first 20 countries in the DataFrame. 

Here, the **first line** creates a figure object with name `fig`, and the **second line** displays the figure with the method `show`.

In [None]:
fig = px.bar(df[:20], x="Entity", y="health_per_cap")
fig.show()

Now plot the 'gdp_per_cap' values for the last 20 countries in the DataFrame.

In [None]:
fig = px.bar(df[-20:], x="Entity", y="gdp_per_cap")
fig.show()

And the 20 countries with the highest values.

In [None]:
fig = px.bar(df.sort_values("gdp_per_cap")[-20:], x="Entity", y="gdp_per_cap")
fig.show()

We can use colours to encode an additional dimension of information. 

Let's add the healthcare costs per capita data to the bar graph.

In [None]:
fig = px.bar(df.sort_values("gdp_per_cap")[-20:], x="Entity", y="gdp_per_cap",
             color="health_per_cap", color_continuous_scale=px.colors.sequential.Agsunset)
fig.show()

You can see from above that **bar charts** are great for comparing quantities against each other.

It's relatively easy to tell which values are higher even for very similar values.


---



Now, let's build a different type of a chart - a **histogram**. A histogram is a statistical chart, for visualising a distribution of values. 

In [None]:
fig = px.histogram(df, x="health_per_cap")
fig.show()

You see that above, the dataset has been separated into a series of bars, which are called "bins". The height of each bin indicate how many data points fall in this range.

Next up is are box charts. Box charts give overviews of the distribution by displaying the dataset's quartile ranges, as well as any outliers. Take a look:

In [None]:
fig = px.box(df, x="gdp_per_cap", orientation="h", hover_data=["Entity"])
fig.show()

Notice that the GDP per capita data looks very compressed at the low end. Let's see what it looks like if we adopted a log scale.

In [None]:
fig = px.box(df, x="gdp_per_cap", orientation="h", log_x=True, hover_data=["Entity"])
fig.show()

That looks better. Some data lend themselves more to geometric scales, where the growth tends to happen in terms of *orders of magnitudes* (e.g. 10x, 100x, 1000x, etc.).


---



Next, we plot both of these variables together onto a scatter plot.

In [None]:
fig = px.scatter(df, x="gdp_per_cap", y="health_per_cap", hover_data=["Entity"])
fig.show()

In [None]:
fig = px.scatter(df, x="gdp_per_cap", y="health_per_cap", hover_data=["Entity"],
                 log_x=True, log_y=True)
fig.show()

To create a bubble chart, specify a column name for the `size` parameter.

In [None]:
fig = px.scatter(df, x="gdp_per_cap", y="health_per_cap", hover_data=["Entity"],
                 log_x=True, log_y=True, size="total_pop")
fig.show()

**Scatter plots** are fantastic for getting a quick understanding of relationships between variables, and identifying outliers very quickly.

The dataset includes data on which continent each country on the list belongs to. We can display this information using colours.

In [None]:
fig = px.scatter(df, x="gdp_per_cap", y="health_per_cap", hover_data=["Entity"],
                 log_x=True, log_y=True, size="total_pop", color="Continent",)
fig.show()

This is called "categorical" or "qualitative" information, unlike the "quantitative" use of colours above.

Plotly also lets you customise the graph in any number of ways. Here is an example with a few simple styling options applied.

In [None]:
fig = px.scatter(df, x="gdp_per_cap", y="health_per_cap", hover_data=["Entity"],
                 log_x=True, log_y=True, size="total_pop", color="Continent",
                 title="GDP vs Healthcare Expenditure",
                 template="plotly_white", labels={"health_per_cap": "Healthcare Expenditure per Capita", "gdp_per_cap": "GDP per Capita"}
                 )
fig.update_traces(marker=dict(line=dict(width=1, color='DarkSlateGrey')))
fig.show()

### Bonus exercise
A scatter matrix is a set of scatter plots for each pair of columns. 

In [None]:
fig = px.scatter_matrix(df)
fig.show()

Can you tell which of the two columns show the strongest relationship?

We will come back to this later in the course.

If you would like to learn more, check out the Plotly.py [documentation here](https://plotly.com/python/).