# Visualization of a Gapminder Data Set

We are going to look at Plotly's subset of data from Gapminder. 

[You may find it very interesting to explore https://www.gapminder.org/; Gapminder is an independent educational non-proﬁt ﬁghting global misconceptions.]

## First step:  import the libraries and data

https://plotly.com/python/time-series/

In [None]:
# We'll need to install plotly express (and upgrade seaborn)
!pip install -U plotly_express seaborn

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

import plotly_express as px
import plotly.graph_objects as go

In [None]:
px.data.gapminder()

In [None]:
gp = px.data.gapminder()

In [None]:
gp_usa = gp.loc[gp['country']=='United States']

In [None]:
gp_usa.plot(x='year', y='lifeExp')

In [None]:
gp_usa.plot(x='year', y='lifeExp', kind='scatter')

In [None]:
ax = gp_usa.plot(x='year', y='lifeExp')
gp_usa.plot(x='year', y='lifeExp', kind='scatter', ax=ax)

In [None]:
ax = gp_usa.plot(x='year', y='lifeExp')
gp_usa.plot(x='year', y='gdpPercap', ax=ax)

In [None]:
fig,ax=plt.subplots(2,1,sharex=True)
gp_usa.plot(x='year', y='lifeExp', ax=ax[0])
gp_usa.plot(x='year', y='gdpPercap', ax=ax[1])

In [None]:
gp_usa[['lifeExp','gdpPercap']].corr()

In [None]:
gp_usa.plot(x='gdpPercap', y='lifeExp', kind='scatter')

## Diversion:  Fitting lines and interpolating values

In [None]:
import statsmodels.api as sm
lowess = sm.nonparametric.lowess

from scipy.interpolate import UnivariateSpline

In [None]:
x = gp_usa['gdpPercap']
y = gp_usa['lifeExp']

In [None]:
ax = gp_usa.plot(x='gdpPercap', y='lifeExp', kind='scatter', color='black')

z = np.polyfit(x, y, 1)
ax.plot(x, z[1]+z[0]*x, 'blue')

In [None]:
ax = gp_usa.plot(x='gdpPercap', y='lifeExp', kind='scatter', color='black')

z = lowess(y,x)
ax.plot(z[:,0], z[:,1], 'blue')

In [None]:
ax = gp_usa.plot(x='gdpPercap', y='lifeExp', kind='scatter', color='black')

spl = UnivariateSpline(x, y)
ax.plot(x, spl(x), 'blue')

## Telling stories with animation

* Use plotly for this
* Build up our scatter plot into an animated visualization

In [None]:
px.scatter(gp_usa, 
           x="gdpPercap", 
           y="lifeExp")

The above is just for the USA.  We're going to expand to all countries now:

In [None]:
px.scatter(gp, 
           x="gdpPercap", 
           y="lifeExp")

Wait... not only do we not know which point is which country, we also don't know how the points evolve in time.

In [None]:
px.scatter(gp, 
           x="year", 
           y="lifeExp")

We could look at a plot of all values for a given year.

In [None]:
years = gp.year.unique()

i = years[0]

px.scatter(gp.loc[gp['year']==i], 
           x="gdpPercap", 
           y="lifeExp")

What's that outlier?

Let's add 'hover_name' so that we can more easily get information about points by simply moving our mouse to them.

In [None]:
i = years[0]

px.scatter(gp.loc[gp['year']==i], 
           x="gdpPercap", 
           y="lifeExp",
           hover_name='country')

One way to look at time -> manually change what time you are plotting

In [None]:
i = years[10]

px.scatter(gp.loc[gp['year']==i], 
           x="gdpPercap", 
           y="lifeExp",
           hover_name='country',
           log_x=True)

Another way to visualize change over time -> dynamically change the plot in real time

In [None]:
i = years[10]

# px.scatter(gp.loc[gp['year']==i], 
px.scatter(gp, 
           x="gdpPercap", 
           y="lifeExp",
           hover_name='country',
           log_x=True,
           animation_frame="year")

How do we know what's evolving where?

This will benefit from further customization:

* At the moment, the scale has lots of low gdpPercap.  We can stretch out this scale to make the separation more visible by making it log scale.
* How do we know what's evolving where?
  * Add color so we can keep track of individual points
  * Lots of colors.... so also add population to distinguish the dots
* Change the axes' ranges to keep all points within the visualized space
* Change the axis ratio to spread out the points
* Change the size of the points to make it easier on our eyes to see smaller points

* **Change how we look at time:**
  * One way to look at time -> manually change what time you are plotting
  * Another way to visualize change over time -> dynamically change the plot in real time

In [None]:
px.scatter(gp, 
           x="gdpPercap", 
           y="lifeExp",
           hover_name='country', color='country', size='pop',
           log_x=True,
           range_x=[100,100000], 
           range_y=[25,90],
           width=800, 
           height=600,
           size_max=60,
           template='simple_white',
           animation_frame="year",
          )

## Bringing it back around to something quantitative

In [None]:
gp_china = gp.loc[gp['country']=='China']

x = gp_china['gdpPercap']
y = gp_china['lifeExp']

ax = gp_china.plot(x='gdpPercap', y='lifeExp', kind='scatter', color='black')

z = np.polyfit(x, y, 1)
ax.plot(x, z[1]+z[0]*x, 'blue')

While the linear fit for the USA was not too bad, here the linear fit clearly looks bad.
* China's GDP per capita and life expectancy did not increase at constant rates over this time period.
* The GDP per capita evolved more slowly before ~1980 than after 1980.

In [None]:
fig,ax=plt.subplots(2,1,sharex=True,figsize=(8,6))

gp_usa.plot(x='year', y='lifeExp', ax=ax[0], color='black', label='USA')
gp_china.plot(x='year', y='lifeExp', ax=ax[0], color='blue', label='China')
ax[0].set_title('Life Expectancy')

gp_usa.plot(x='year', y='gdpPercap', ax=ax[1], color='black', label='USA')
gp_china.plot(x='year', y='gdpPercap', ax=ax[1], color='blue', label='China')
ax[1].set_title('GDP Per Capita')

plt.show()

# Stocks

Examples of a few cool time series plots to make with plotly.

In [None]:
help(px.data)

In [None]:
df = px.data.stocks()

In [None]:
df

In [None]:
sns.lineplot(data=df.set_index('date'))

In [None]:
fig = px.line(df,x='date', y='GOOG', title="GOOG stocks")
fig.show()

In [None]:
df_diff = df.copy()
df_diff.loc[:,df_diff.columns != 'date'] = df_diff.loc[:,df_diff.columns != 'date'] - 1

In [None]:
fig = px.bar(df_diff, x='date', y="GOOG")
fig.show()

In [None]:
px.data.stocks(indexed=True)

In [None]:
df_diff

In [None]:
df_diff.set_index('date')

In [None]:
df_diff.columns = df_diff.columns.rename('company')

In [None]:
df_diff

In [None]:
df_diff = df_diff.set_index('date')
df_diff

In [None]:
fig = px.area(df_diff, facet_col='company', facet_col_wrap=3)
fig.show()

In [None]:
fig = px.line(df,x='date', y='AAPL', title="AAPL stocks", 
              range_x=['2018-07-01','2019-12-31'])
fig.show()

In [None]:
fig = px.line(df,x='date', y=df.columns[1:6], title="6 company stocks plot")
fig.show()

In [None]:
fig = px.line(df,x='date', y=df.columns[1:6], title="6 company stocks plot")
fig.update_xaxes(
    dtick="M1",
    tickformat="%b\n%Y")
fig.show()

In [None]:
fig = px.line(df,x='date', y=df.columns[1:7], title="6 company stocks plot")
fig.update_xaxes(
    dtick="M1",
    tickformat="%b\n%Y")
fig.update_layout(template=go.layout.Template())
fig.show()

In [None]:
fig = px.scatter(df,x=df['date'], y=df.columns[1:7])
fig.show()

In [None]:
fig = px.scatter(df,x=df['date'], y=df.columns[1:7])
fig.update_xaxes(rangeslider_visible=True)
fig.show()

In [None]:
fig = px.area(df,x='date', y=df.columns[1:7], height=600)
fig.update_xaxes(rangeslider_visible=True)
fig.show()