# Data Viz Basics

This notebook is used for generating examples of data viz basics in Plotly. For examples in other libaries, check out those notebooks.

In [None]:
import plotly.express as px
import pandas as pd

In [None]:
# Load Plotly dataset of Gapminder data
df = px.data.gapminder()
df.head()

### Scatter Plots

Scatter plots are excellent ways to compare data that is discrete, i.e. non-continuous. If there's no obvious expectation that the data may flow from one value to the next on the x-axis (i.e. time), relationships between the data sources can be visualized by matching data.

In the below image, it's clear that there's a moderately strong positive relationship between life expectancy and GDP per capita, at least in G7 countries. This same data could be shown over time, in which case a temporal component could be included and a casual relationship could be inferred, but regardless of *when* the data was collected, the relationship generally holds.

In [None]:
# Scatter plot
g7 = ['Canada', 'United States', 'United Kingdom', 'Germany', 'France', 'Italy', 'Japan']
fig = px.scatter(df[df['country'].isin(g7)], x='gdpPercap', y='lifeExp', color = 'country',
           width=500, height=400, title='G7 life expectancy as a result of GDP',
           labels={'lifeExp':'Life Expectancy (years)', 'gdpPercap':'GDP per capita (USD)', 'country':'Country'})
fig.show()

### Line Plots
Similar to scatter plots, line plots can connect sets of data, only with the implicit assumption that the data is *continuous*, or connected from one value to the next. This is obvious when the independent variable is some measure of **time**, but less obvious when it's another factor.

Though the plotted data is ostentibly discrete, as in we only have measurements for each year, the implication of the line plot is that the trend is maintained *between* the data points as well, as GDP can be measured at any given time. Though the data could just as easily be plotted as a scatter plot, the line plot implies the relationship holds even between data points, and is (nearly) always more appropriate when the X axis is time.

In [None]:
# Line plot
fig = px.line(df[df['country'].isin(g7)], x='year', y='lifeExp', color = 'country',
           width=500, height=400, title='G7 life expectancy over time',
           labels={'lifeExp':'Life Expectancy', 'year':'Year', 'country':'Country'})
fig.show()

### Bar Plots

Bar plots are excellent ways to compare either count or single values from different groups. They can be further faceted to do hierarchies of groups or different combinations of factors. What they _aren't_ great for is showing changes over time, where a line plot might be more effective, or when comparing discrete data that isn't clearly categorical.

Additionally, bar plots are one of the most basic plots you can make from data, but also one that is often misunderstood and used (or abused) improperly. Because of the general public's familiarity with bar plots, it's very easy to overlook bias that has been intentionally or unintentionally added to a bar plot. The biggest area to look at is the Y-axis, where scaling the range can magnify or obscure real differences between different groups.

In [None]:
px.bar(df[(df['country'].isin(g7)) & (df['year']==2002)], x='country', y='pop', color='country', 
       title='Population of G7 countries in 2002', labels={'pop':'Population', 'country':'Country'},
       width=500, height=400)

### Histograms

Histograms are excellent ways to initially explore the data and get an idea of the underlying distributions. By looking at the shape of the histogram, you can get an appreciation for the range of values, maximum and minimums, as well as the latent patterns that might underly how it was created.

When doing EDA, histograms are an almost requirement, at least initially, and many libraries in Python make it extremely easy to quickly generate histograms from data. Depending on the type of data and how it's distrubited, it may be necessary to change the function of the scale to get a more reasonable visualization, especially with monetary or population data. For example, the first of the below visualizations has a linear scale for count on the Y-axis, while the second has a logarithmic. Note that we are switching the Y-axis (count) to a log scale, where often you'll have to make the transformation to the data you're investigating, rather than the count.

The interactivity of Plotly also excels here, as you can hover over the bins to see the actual values in each bar.

In [None]:
px.histogram(df[df['year']==1997], 'pop', title='Histogram of country populations in 1997 (linear Y scale)')

In [None]:
px.histogram(df[df['year']==1997], 'pop', title='Histogram of country populations in 1997 (log Y scale)', 
             log_y=True)