# Data Visualization using Python: Solutions

## Import Packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

from pyprojroot import here

%matplotlib inline

## Import the Data Set

In [None]:
gm = pd.read_csv(here('data/gapminder.tsv'), sep='\t')

## Challenge Questions: Histograms


In [None]:
latest_year = gm['year'].max()
gm_latest = gm[gm['year'] == latest_year]

1. Create a histogram of life expectancy in the year 2007 across all 142 countries in the gapminder dataset. Play with the `bins=` parameter to find the most informative bin number.

In [None]:
# Try 17 bins
plt.hist(gm_latest['lifeExp'], bins=17)
plt.title('Distribution of Global Life Expectancy in 2007')
plt.xlabel('Life Expectancy (Years)')
plt.ylabel('# of Countries');

In [None]:
# Try 25 bins
plt.hist(gm_latest['lifeExp'], bins=25)
plt.title('Distribution of Global Life Expectancy in 2007')
plt.xlabel('Life Expectancy (Years)')
plt.ylabel('# of Countries');

In [None]:
# Try 10 bins
plt.hist(gm_latest['lifeExp'], bins=10)
plt.title('Distribution of Global Life Expectancy in 2007')
plt.xlabel('Life Expectancy (Years)')
plt.ylabel('# of Countries');

2. What can you say about the distribution of life expectancy values in 2007?

Three peaks: one at the lower end (around 50-60), one around 70, and one around 80. There are many countries with lower life expectancy, and another group with very high life expectancy.

## Challenge Questions: Bar Plots

In [None]:
countries = gm[['country', 'continent']]
countries = countries.drop_duplicates()
country_counts = countries.groupby('continent', as_index=False).agg('count')
country_counts.columns = ['continent', 'n_countries']
continents = country_counts['continent']

In [None]:
# Get the countries in Oceania in 2007
gm_latest_oceania = gm_latest[gm_latest['continent'] == 'Oceania']
gm_latest_oceania

3. Create a bar plot showing the per-capita GDP for all the countries in Oceania during 2007.

In [None]:
plt.bar(range(len(gm_latest_oceania)), gm_latest_oceania['gdpPercap'])
plt.xticks(range(len(gm_latest_oceania)), gm_latest_oceania['country']);

4. **\[OPTIONAL\]**. The above bar plot shows the counts of countries in each continent. We might be interested in the proportion of countries that exist in each of the 5 continents. Do a web search for `plt.pie` and figure out how to make a pie plot that displays proportions of all countries contained in each of the 5 continents.

In [None]:
plt.pie(country_counts['n_countries'], labels=continents, autopct='%.01f')
plt.title('Proportion of countries per continent.');

## Challenge Questions: Boxplots

5. Knowing how to interpret your plots is almost as important as knowing how to make them! Looking at the above box plot of per-capita GDP for each continent, what information do you take away from it? Where do you think the U.S.A is represented in this plot? And how could you confirm that?

In [None]:
gm_latest[gm_latest['country'] == 'United States']

## Challenge Questions: Line Plots

In [None]:
portugal = gm[gm['country'] == 'Portugal']
spain = gm[gm['country'] == 'Spain']

6. Create another line plot showing the life expectancy for Spain and Portugal across all the years in the dataset similar to the one above, but try to add some customizations (e.g., changing the font sizes, different line colors, etc.). You can use the `help()` function to see what kind of customizations are available.

In [None]:
plt.figure(figsize=(10, 5))
plt.plot(spain['year'], spain['lifeExp'], label='Spain', color='blue')
plt.plot(portugal['year'], portugal['lifeExp'], label='Portugal', color='green')
plt.title('Life Expectancy of Portugal & Spain', fontsize=18)
plt.xlabel('Time (Years)', fontsize=14)
plt.ylabel('Life Expectancy (Years)', fontsize=14)
plt.legend(prop={'size': 13});

7. Does Spain or Portugal have a higher life expectancy across all the years? How does this relate to per-capita GDP? How might we look for a relationship?

Spain has a higher life expectancy. We could create a line plot between life expectancy and per-capita GDP.

## Challenge Questions: Scatter Plots

8. We've seen that life expectancy and per-capita GDP have a positive relationship. What about the relationship between population and per-capita GDP, is there one? Create a scatter plot that compares the two across all countries in 2007. 

In [None]:
plt.scatter(np.log10(gm_latest['gdpPercap']), gm_latest['pop'], marker='.')
plt.xlabel('Per-Capita GDP (Millions $USD)')
plt.yscale('log')
plt.ylabel('Population');

9. Is that relationship between population and per-capita GDP different for the first year we have data in the dataset? Plot both first and latest years scatter next to each other in the same figure but different subplots. What can you say about any outliers you see?

(*HINT*: First you need to extract another `DataFrame` containing the data from the first year).

In [None]:
first_year = gm['year'].min()
print(first_year)
gm_first = gm[gm['year']== first_year]

In [None]:
plt.figure(figsize=(15,5))

plt.subplot(1,2,1)
plt.scatter(np.log10(gm_first['gdpPercap']), gm_first['pop'], marker='.')
plt.title('1952')
plt.xlabel('Per-Capita GDP (Millions $USD)')
plt.yscale('log')
plt.ylabel('Population');

plt.subplot(1,2,2)
plt.scatter(np.log10(gm_latest['gdpPercap']), gm_latest['pop'], marker='.')
plt.title('2007')
plt.xlabel('Per-Capita GDP (Millions $USD)')
plt.yscale('log')
plt.ylabel('Population');

#### Pandas approach

Here are a pair of plots related to (but not directly answering) the above questions, using Pandas methods again.

In [None]:
min_yr = gm.year.min()
gm_first_yr = gm[gm.year == min_yr]
gm.plot(x = 'gdpPercap', y='pop', kind='scatter', figsize=(10,8))
plt.xlabel('GDP per capita')
plt.ylabel('Population')
plt.title('Population vs. GPD per capita, first year');

In [None]:
gm.plot(x = 'gdpPercap', y='pop', c='year', cmap='spring', kind='scatter', figsize=(10,8))
plt.xlabel('GDP per capita')
plt.ylabel('Population')
plt.title('Population vs. GPD per capita, across years');

10. **\[OPTIONAL\]** Above we created a scatter plot between life expectancy and per-capita GDP colored by year. That coloring was done in a continuous way. What if we wanted to color it by decade instead, making a discrete coloring? Run the code cell below to create a new variable called `decades`. Then create another scatter plot of life expectancy vs per-capita GDP assigning the color strings in `hexsix` to data points from each of the six decades in the dataset. 

In [None]:
hexsix = np.array(['#ffffcc', '#d9f0a3', '#addd8e', '#78c679', '#31a354', '#006837'])
gm['decade'] = (gm['year'] / 10).astype(int) * 10
decades = gm['decade'].unique()
decades

You could also apply a special temporary kind of function called a `lambda` function to every item in the year column to create the decades. See [here](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) for more on lambda functions.

In [None]:
gm['decade'] = gm['year'].apply(lambda x: int(x / 10) * 10)

In [None]:
for i,cur_decade in enumerate(decades):
    cur_decade_gm = gm[gm['decade']==cur_decade]
    plt.scatter(np.log10(cur_decade_gm['gdpPercap']), 
                cur_decade_gm['lifeExp'], 
                marker='.', 
                color=hexsix[i], 
                label=str(cur_decade))
plt.legend();