# Data Viz Basics

This notebook is used for generating examples of data viz basics in Plotly. For examples in other libaries, check out those notebooks.

In [1]:
import plotly.express as px
import pandas as pd



In [2]:
# Load Plotly dataset of Gapminder data
df = px.data.gapminder()
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
0,Afghanistan,Asia,1952,28.801,8425333,779.445314,AFG,4
1,Afghanistan,Asia,1957,30.332,9240934,820.85303,AFG,4
2,Afghanistan,Asia,1962,31.997,10267083,853.10071,AFG,4
3,Afghanistan,Asia,1967,34.02,11537966,836.197138,AFG,4
4,Afghanistan,Asia,1972,36.088,13079460,739.981106,AFG,4


### Scatter Plots

Scatter plots are excellent ways to compare data that is discrete, i.e. non-continuous. If there's no obvious expectation that the data may flow from one value to the next on the x-axis (i.e. time), relationships between the data sources can be visualized by matching data.

In the below image, it's clear that there's a moderately strong positive relationship between life expectancy and GDP per capita, at least in G7 countries. This same data could be shown over time, in which case a temporal component could be included and a casual relationship could be inferred, but regardless of *when* the data was collected, the relationship generally holds.

In [3]:
# Scatter plot
g7 = ['Canada', 'United States', 'United Kingdom', 'Germany', 'France', 'Italy', 'Japan']
fig = px.scatter(df[df['country'].isin(g7)], x='gdpPercap', y='lifeExp', color = 'country',
           width=500, height=400, title='G7 life expectancy as a result of GDP',
           labels={'lifeExp':'Life Expectancy (years)', 'gdpPercap':'GDP per capita (USD)', 'country':'Country'})
fig.show()

### Line Plots
Similar to scatter plots, line plots can connect sets of data, only with the implicit assumption that the data is *continuous*, or connected from one value to the next. This is obvious when the independent variable is some measure of **time**, but less obvious when it's another factor.

Though the plotted data is ostentibly discrete, as in we only have measurements for each year, the implication of the line plot is that the trend is maintained *between* the data points as well, as GDP can be measured at any given time. Though the data could just as easily be plotted as a scatter plot, the line plot implies the relationship holds even between data points, and is (nearly) always more appropriate when the X axis is time.

In [4]:
# Line plot
fig = px.line(df[df['country'].isin(g7)], x='year', y='lifeExp', color = 'country',
           width=500, height=400, title='G7 life expectancy over time',
           labels={'lifeExp':'Life Expectancy', 'year':'Year', 'country':'Country'})
fig.show()

### Bar Plots

Bar plots are excellent ways to compare either count or single values from different groups. They can be further faceted to do hierarchies of groups or different combinations of factors. What they _aren't_ great for is showing changes over time, where a line plot might be more effective, or when comparing discrete data that isn't clearly categorical.

Additionally, bar plots are one of the most basic plots you can make from data, but also one that is often misunderstood and used (or abused) improperly. Because of the general public's familiarity with bar plots, it's very easy to overlook bias that has been intentionally or unintentionally added to a bar plot. The biggest area to look at is the Y-axis, where scaling the range can magnify or obscure real differences between different groups.

In [5]:
px.bar(df[(df['country'].isin(g7)) & (df['year']==2002)], x='country', y='pop', color='country', 
       title='Population of G7 countries in 2002', labels={'pop':'Population', 'country':'Country'},
       width=500, height=400)

### Histograms

Histograms are excellent ways to initially explore the data and get an idea of the underlying distributions. By looking at the shape of the histogram, you can get an appreciation for the range of values, maximum and minimums, as well as the latent patterns that might underly how it was created.

When doing EDA, histograms are an almost requirement, at least initially, and many libraries in Python make it extremely easy to quickly generate histograms from data. Depending on the type of data and how it's distrubited, it may be necessary to change the function of the scale to get a more reasonable visualization, especially with monetary or population data. For example, the first of the below visualizations has a linear scale for count on the Y-axis, while the second has a logarithmic. Note that we are switching the Y-axis (count) to a log scale, where often you'll have to make the transformation to the data you're investigating, rather than the count.

The interactivity of Plotly also excels here, as you can hover over the bins to see the actual values in each bar.

In [6]:
px.histogram(df[df['year']==1997], 'pop', title='Histogram of country populations in 1997 (linear Y scale)')

In [7]:
px.histogram(df[df['year']==1997], 'pop', title='Histogram of country populations in 1997 (log Y scale)', 
             log_y=True)

### Box Plots

Box plots are another great way to visualize the distribution within a given variable (or compare multiple variabkes). At a glance, you can also get an idea of the basics descriptives, such as min/max, median/mean, outliers, etc. Exactly how the whiskers and outliers are calculated is usually configurable, as is whether the mean or median (or both) are prominently displayed. Note the use of the log scale on the X-axis below.

If you're more interested in the data surrounding the median, a violin plot is a similar way to plot the same information, but with a higher sensitivity to distributions around the middle.

In [8]:
px.box(df[df['year']==1972], 'gdpPercap', color='continent', title='Boxplot of country GDP per capita in 1972', orientation='h', log_x=True,
       labels={'gdpPercap':'GDP per capita', 'continent':'Continent'}, hover_data=['country'])

## Basic tenets of plotting

Here are a few 'rules of thumb' to consider when building plots:

### 1. Plots can typically only display 2-4 features at a time

2-4 features (or dimensions) is the ideal range for plots. More than 4 dimensions and the visualizations get quite crowded, and on the other end, a graph of a single feature (with no measure of it) isn't really a visualization.

In [28]:
df = px.data.gapminder()
px.scatter(df, x='gdpPercap', y='lifeExp', color='continent', size='pop', size_max=60, log_x=True,
           title='Plot of life expectancy vs GDP per capita, alongside continent <br>membership and population size',
           labels={'gdpPercap':'GDP per capita (USD)', 'lifeExp':'Life Expectancy (years)', 'pop':'Population', 'continent':'Continent'},
           hover_name='country', width=600, height=600).update_layout(xaxis_tickprefix = '$')

### How to improve a plot

Here we'll go various iterations of the same data to show how different elements can be added or modified to improve readability.

For this, we'll return to the Gapminder data.

In [10]:
import plotly.express as px
df = px.data.gapminder()
df.sample(10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
682,Hungary,Europe,2002,72.59,10083313,14843.93556,HUN,348
198,Burkina Faso,Africa,1982,48.122,6634596,807.198586,BFA,854
594,Greece,Europe,1982,75.24,9786480,15268.42089,GRC,300
58,Argentina,Americas,2002,74.34,38331121,8797.640716,ARG,32
1430,Sri Lanka,Asia,1962,62.192,10421936,1074.47196,LKA,144
574,Germany,Europe,2002,78.67,82350671,30035.80198,DEU,276
772,Italy,Europe,1972,72.19,54365564,12269.27378,ITA,380
842,"Korea, Rep.",Asia,1962,55.292,26420307,1536.344387,KOR,410
549,Gabon,Africa,1997,60.461,1126189,14722.84188,GAB,266
1612,United States,Americas,1972,71.34,209896000,21806.03594,USA,840


#### Basic plot

In [11]:
g7 = ['Canada', 'United States', 'United Kingdom', 'Germany', 'France', 'Italy', 'Japan']
g = px.line(df[df['country'].isin(g7)], x='year', y='gdpPercap',
            height=400, width=1000)
g

#### Adding title

In [12]:
g2 = px.line(df[df['country'].isin(g7)], x='year', y='gdpPercap', title='GDP per capita of G7 countries over time', color='country',
             height=400, width=1000)
g2

#### Improving labels

In [13]:
g3 = px.line(df[df['country'].isin(g7)], x='year', y='gdpPercap', title='GDP per capita of G7 countries over time', color='country',
             height=400, width=1000, labels={'year':'Year', 'gdpPercap':'GDP per capita (USD)', 'country':'Country'})
g3

#### Adding annotations

In [14]:
g4 = px.line(df[df['country'].isin(g7)], x='year', y='gdpPercap', title='GDP per capita of G7 countries over time', color='country',
             height=400, width=1000, labels={'year':'Year', 'gdpPercap':'GDP per capita (USD)', 'country':'Country'})
g4.add_annotation(x=1997, y=6000, text='*Inflation-adjusted to 2007', showarrow=False)
g4


#### Styling units

In [15]:
g5 = px.line(df[df['country'].isin(g7)], x='year', y='gdpPercap', title='GDP per capita of G7 countries over time', color='country',
             height=400, width=1000, labels={'year':'Year', 'gdpPercap':'GDP per capita (USD)', 'country':'Country'})
g5.add_annotation(x=1997, y=6000, text='*Inflation-adjusted to 2007', showarrow=False).update_layout(yaxis_tickprefix = '$')
g5

#### Customizing colour

In [16]:
colors = ['rgb(0,0,0)',
          'rgb(230,159,0)',
          'rgb(86,180,233)',
          'rgb(0,158,115)',
          'rgb(240,228,66)',
          'rgb(0,114,178)',
          'rgb(213,94,0)',
          'rgb(204,121,167)']

In [17]:
g6 = px.line(df[df['country'].isin(g7)], x='year', y='gdpPercap', title='GDP per capita of G7 countries over time', color='country',
             height=400, width=1000, labels={'year':'Year', 'gdpPercap':'GDP per capita (USD)', 'country':'Country'},
             color_discrete_sequence=colors)
g6.add_annotation(x=1997, y=6000, text='*Inflation-adjusted to 2007', showarrow=False).update_layout(yaxis_tickprefix = '$')
g6