# Python Data Visualization

This workshop will introduce you to how to create beautiful and functional visualizations in Python. We will start with a brief introduction to some key principles of visualization in Python, introduce `matplotlib` and `seaborn` as key packages for data visualization in Python, and explore basic visualizations and ways to customize them.

Visualization is meant to convey information.

> The power of a graph is its ability to enable one to take in the quantitative information, organize it, and see patterns and structure not readily revealed by other means of studying the data.

\- Cleveland and McGill, 1984

Certain techniques make that information easier to interpret and understand. In their 1984 paper titled, "[Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods](https://www-jstor-org.libproxy.berkeley.edu/stable/2288400?seq=1#page_scan_tab_contents)," Cleveland and McGill identify 10 elementary perceptual tasks that are used to "extract quantitative information from graphs." Their premise is:

> A graphical form that involves elementary perceptual tasks that lead to more accurate judgments than another graphical form (with the same quantitative information) will result in better organization and increase the chances of a correct perception of patterns and behavior.

Whereas graph design had, up to that point, been "largely unscientific," Cleveland and McGill took a systematic approach in analyzing human graphical perception through experimentation. Their researched helped identify the most and least accurate elementary perceptual tasks, ordered below:

1. Position
2. Length, direction, angle
3. Area, volume
4. Shading, color saturation

In 2010, [Heer and Bostock](http://vis.stanford.edu/files/2010-MTurk-CHI.pdf) confirmed these results using Amazon's Mechanical Turk.

Let's take a look at a few examples. Because we're only interested in relative sizes, we don't include a legend with size information or reference points.

Let's take a look at the circles below- what perceptual property is being used to communicate differences?

**Question:** Which value is bigger: E or L? How about L or M?

![circles](../images/circles.png)

For circles of distinctly different sizes, the comparison is simple. For example, "A" is smaller than "B." However, for circles, such as "L" and "M," that are almost the same size, it's difficult to tell which is smaller. 

Now let's look at the same data using another propery - which property is being used below?

**Question:** Which value is bigger: E or L? How about L and M?

![bars](../images/bars.png)

Focusing on "L" and "M," it is easier to see which is larger.

The way we design plots can also highlight certain properties of the data. For this example, let's suppose we're working with student English and math test scores. Here, we'll use a bar for each student, which we arbitrarily label Z-L. The question is, which bars should we use? This is a case where the answer depends on what we're trying to communicate. 

One option is to use a **stacked bar chart**.

**Question:** Let's compare the overall scores of student S and N - which one has a higher overall score? Now let's compare the math sccore of students S and N - which is higher?

![two_bars](../images/two-series-0.png)

An alternative version is the 'double-ended bar chart', shown below. Compare S and N in the workshop below: which has a higher math score? Which has a higher overall score?

**Question:** If you wanted to communicate overall score, which plot would you use? If comparing individual tests, which would you use?

![two_bars_centered](../images/two-series-1.png)

## Form and Function

A good visualization *reveals* patterns in the data, rather than simply repeating numbers. Often the aesthetics (form) of the visualization serve this goal, but it is important to center the key information communicatated (function). For example, redundant contrasts can make groupings in data more interpretable and more clearly communicated in your visualization. 

Ultimately, identify your audience and their needs and interests- your visualization should tell a story about your data, and visually highlight the most important information.



## Visualization in Python:  `matplotlib`

Now, we'll start learning how to make visualizations in Python. We'll start by using a popular Python package called `matplotlib` to explore the main basic types of plots and how to make them. `Matplotlib` is the foundation of Python visualization code, so understanding how it functions is key to using other packages effectively as well.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

### Import the data

To illustrate these packages, we will use the so-called Gapminder dataset, which was compiled by Jennifer Bryan. This dataset contains several key demographic and economic statistics for many countries, across many years. For more information, see the [gapminder](https://github.com/jennybc/gapminder) repository.

We'll use the `pandas` Python package to load the `.csv` (comma separated values) file that contains the dataset. The `pandas` package provides `DataFrame` objects that organize datasets in tabular form (think Microsoft Excel spreadsheets). We've created an alias for the `pandas` package, called `pd`, which is the convention. To read in a `.csv` file we simply use `pd.read_csv`. This `.csv` file happens to be tab-delimited, so we need to specify `sep=\t`.

In [None]:
gm = pd.read_csv('../data/gapminder.tsv', sep='\t')

`pandas` includes a type of object called a `DataFrame`, which is a tabular data format that comes with methods (or bound functions) that helps us interact with the data. Let's use the `head()` method to look at the first few rows of the data set.

**Question:** What does a row represent in the data set?

In [None]:
gm.head()

It looks like we have information about life expectancy (`lifeExp`), population (`pop`) and per-capita GDP (`gdpPercap`), across multiple years per country. 

To start off, let's say we want to explore the data from the most recent year in the dataset. First we'll find the maximum value for year, and then create a second `DataFrame` containing data from only that year. We'll do that using Boolean Masking (for an introduction on how to use `pandas`, see the DLab's ["Introduction to Pandas" workshop](https://github.com/dlab-berkeley/introduction-to-pandas))

What is the most recent year in the data set?

In [None]:
latest_year = gm['year'].max()
latest_year

Subset the data set to just the latest year using a *boolean mask*.

In [None]:
gm_latest = gm.loc[gm['year'] == latest_year,:]
gm_latest.shape

Ok, looks like we have 142 values, or rows, across our 6 variables, or columns. Let's get an idea of how per-capita GDP was distributed across all of the countries during 2007 by calculating some **summary statistics**. We'll do that using the `DataFrame`'s `.describe()` method.

In [None]:
gm_latest['gdpPercap'].describe()

Across 142 countries the mean GDP was ~\\$11680, and the standard deviation was ~\\$12860! There was a lot of deviation in GDP across countries, so these summary statistics alone don't give us the whole picture. So now, let's turn to our first basic visualization, the histogram.

### Histograms

Histograms plot the distribution of one variable across its entire range of values. This is an effective visualization for looking at central tendencies (mean, median, mode) and variance.


First, we need to get the data in the right shape for this model. Since we are plotting one variable, we need to extract one column from the data frame: 

In [None]:
gm_latest['gdpPercap']

Now let's use that with the `matplotlib` function `plt.hist()` to make our first visualization. In `matplotlib` each type of plot has a sepcific function, which contains arguments that allow you customize the function specifically - we will see an example of this in just a moment.

**Question:** What does the x and y axis stand for?

In [None]:
plt.hist(gm_latest['gdpPercap'])

This histogram tells us that many of the countries had a low GDP, which was less than 5,000. There is also a second "bump" in the histogram around 30,000. This type of distribution is known as **bi-modal**, since there are two modes, or most common values.

To make this histogram more interpretable let's add a title and labels for the $x$ and $y$ axes. We'll pass strings to `plt.title()`, `plt.xlabel()`, and `plt.ylabel()` to do so.

In [None]:
plt.hist(gm_latest['gdpPercap'])
plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

`matplotlib` can be thought of as making a visualization from a series of layers. This is achieved by running several lines of code to build a visualization, then displaying it at the end of the block of code. In Jupyter notebook, a good guide is to use one visualization per cell.

Now back to our our histogram, each bar represents a bin. The height of the bar represents the number of items (countries in this case) within the range of values spanned by the bin. In the last plots we used the default number of bins (10), now let's use more bins by specifying the `bin=30` parameter. This is a unique parameter to the histogram plot.

In [None]:
plt.hist(gm_latest['gdpPercap'], bins=30)
plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

We can see this histogram doesn't look as "smooth" as the last one. There's no "right" way to display a histogram, but some bin counts definitely are more informative than others. For example, using only 3 bins we cannot see the bi-modal nature of the GDP distribution.

In [None]:
plt.hist(gm_latest['gdpPercap'], bins=3)
plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('# of Countries');

### Using `pandas` to make plots
If you are familiar with `pandas`, you may also be familiar with the `pandas` plot method. 

These methods' plots are still produced by `matplotlib` on the backend, but can be accessed through `pandas`. The methods often require less code but they also provide less extensive customization. So they impose a **trade-off between convenience and customizability**.

Let's take a look at the `pandas` histogram equivalent to that above.
Here is the first, using the `DataFrame.hist()` method.

**Question:** What are the differences between the plot below and the equivalent plot above?


In [None]:
gm_latest.hist(column='gdpPercap', bins=3);

Notice that `pandas` plotting has some default settings and style so if you are trying to recreate a `matplotlib` plot this can take some additional code. We can also tap into all of the same title and labelling functions as in the `matplotlib` version of the histogram

In [None]:
gm_latest.hist(column='gdpPercap', bins=3, grid=False)
plt.title('Distribution of Global Per-Capita GDP in 2007')
plt.xlabel('Per-Capita GDP (Millions of USD)')
plt.ylabel('# of Countries');

---
### Challenge 1: Make a histogram for life expectancy

Let's create a histogram of life expectancy in the year 2007 in the gapminder data set. The data are already subset for you in the `gm_latest` variable, so your task is to select the appropriate column and make a histogram. Remember to add an informative axis labels and title! Change the `bins=` parameter and find your favorite number of bins for the histogram.

---

In [None]:
gm_latest

## YOUR CODE HERE
plt.hist(...)

### Bar Plots

Now let's turn to our second basic plot, a bar plot. A bar plot is used to compare a value for different groups. In this case, we are going to look at the number of countries per continent.

The first step is to get a DataFrame that has a row for each continent and a count for the number of countries. To do so we will need to follow a couple of steps. 

First, we need to change our `DataFrame` so instead of one row per country per year, there is just one row per country. We will do this by using `pd.groupby()` and `nunique` in order to count the number of unique countries per continent.

In [None]:
country_counts = gm[['continent','country']].groupby('continent', as_index=False).agg('nunique')
country_counts

Now that we have the data we want to visualize, we can plot these data as bars. 

Making a bar plot in `matplotlib` is done using `plt.bar`. This takes two arguments, the first is the location on the $x$-axis that the bars should appear- typically this is a sequence of integers equal to the number of groups- and the second is the height of each bar on the $y$-axis, which in our case is equal to the `country` column of the `country_count` variable. 

**Question:** Which continent has the highest number of countries? Where is that represented in the graph?

In [None]:
x = range(len(country_counts['continent']))
y = country_counts['country']
plt.bar(x,y)

Let's use `plt.xticks()` and the continent names to label the $x$-ticks, which are the text below each bar on the $x$-axis. The `plt.xticks()` function takes two arguments. The first is the *position* for the label and the second is the label itself.

In [None]:
plt.bar(x, y)

plt.title('Number of Countries per Continent')
plt.xticks(x, country_counts['continent']);

**Note:** It is common practice to include error bars in bar plots. While it is possible to plot error bars in `matplotlib`, this is more commonly (and simply) done in `seaborn`, which is a package that we will introduce towards the end of the workshop. 

### Boxplots

Now that we know we've seen how GDP was distributed during 2007, and how many countries are in each continent, we might want to know how GDP is distributed within each continent. While we could plot 5 histograms, we can also take advantage of a useful type of plot for just this purpose, a **boxplot**.

`plt.boxplot()` creates just that, and can take a *list of arrays* with each array representing a distribution to plot. 

For `matplotlib` visualizations, it is necessary to transform the `DataFrame` into the appropriate shape for the visualization - In this case since the number of countries in each continent is different, we will create an array for each continent that contains the GDP values of all countries in each continent. In the following cell, we use `pandas` to generate a list of arrays, one per continent.

In [None]:
continent_gdp_latest = []
for c in country_counts['continent']:
    gm_latest_cur_cont = gm_latest[gm_latest['continent'] == c]
    cur_gdp_vals = gm_latest_cur_cont['gdpPercap'].values
    continent_gdp_latest.append(cur_gdp_vals)

In [None]:
continent_gdp_latest

Now we'll use this list of arrays to make a boxplot. We'll also update the x-ticks, title, and axes labels.

In [None]:
plt.boxplot(continent_gdp_latest)
plt.title('Per-Capita GDP Distributions Per Continent')
plt.xlabel('Continents')
plt.ylabel('Per-Capita GDP (International Dollars)')
plt.xticks(range(1, len(country_counts['continent']) + 1), country_counts['continent']);


**Question:** What are some features that you notice about this figure?

---
### Challenge 2: Documentation and Arguments

Let's take a look at the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html) for the boxplot function. Take a moment to read through the documentation and answer the following questions: 

1. What does the (orange) line in the middle of the box mean?
2. How are outliers determined? How can you change that value? (hint, look at the `whis` argument)
3. Let's say you want to flip the boxplot so the boxes are horizontal, not vertical- what argument would you change? Try it out below


**Hint:** All of these answers can be found in the documentation, but sometimes it can be more helpful to look at examples or do an internet search-- feel free to use any and all resources!

---

In [None]:
## YOUR CODE HERE

### Line Plots

Our next type of plot is a line plot, which is often used to show trends. For example, let's say that we're interested in visualizing a single country's per-capita GDP over time. To make things easier, we'll create a second `DataFrame` containing just data from Portugal to look at in this exercise.

In [None]:
portugal = gm[gm['country'] == 'Portugal']

In [None]:
portugal.head()

Before we jump into plotting the function let's take another look at the documentation [Plot Types](https://matplotlib.org/stable/plot_types/index.html) to find an example of a line plot - what plot function should we use?

Now let's make our line plot.

In [None]:
x = portugal['year']
y = portugal['gdpPercap'] 
plt.plot(x,y)
plt.title('Per-capita GDP of Portugal')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)');

This plot clearly shows that Portugal's per-capita GDP has been increasing over time. Cool!

Let's say we wanted to compare the GDP between Portugal and Spain. We can make a *multi-line* plot by calling `plt.plot()` twice, once on each subset of the data. Let's look at how that visualization will look below. 

**Note:** This is a case where while it is *possible* to make it with `matplotlib`, and useful for understanding how visualizations are layered, in *practice*, packages like `seaborn` are more often used for making multi-line plots.

In [None]:
spain = gm[gm['country'] == 'Spain']

**Question:** In the plots below, which line corresponds to which country? How do you know?

In [None]:
plt.figure(figsize=(12, 5))
plt.plot(spain['year'], spain['gdpPercap'])
plt.plot(portugal['year'], portugal['gdpPercap'])
plt.title('Per-capita GDP of Portugal & Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)');

Which line represents which country? To determine that we need a legend. `matplotlib` makes it easy to create a legend. First, we need to add the `label=<country_name>` parameter to the `plt.plot()` functions, then call `plt.legend()`. Let's see it:

In [None]:
plt.figure(figsize=(12,5))
plt.plot(spain['year'], spain['gdpPercap'], label='Spain')
plt.plot(portugal['year'], portugal['gdpPercap'], label='Portugal')
plt.title('Per-capita GDP of Portugal & Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.legend();

Much better! Now we can compare how these two GDPs change over the time period.

### Scatter Plots

The last basic plot that we will cover in `matplotlib` is the **scatter plot**. This plot is most useful for showing the relationship between two variables.

To illustrate this we'll use `plt.scatter()` to visualize the relationship between per capita GDP (`gdpPercap` on the $x$-axis) and life expectancy (`lifeExp`on the $y$-axis) across all countries and all years. Specifying the `marker='.'` argument tells the plot to use small circles to indicate each data point. There are many other marker styles, see [here](https://matplotlib.org/stable/api/markers_api.html) for more.


**Question:** How would you describe the relationship between GDP and life expectancy?

In [None]:
plt.scatter(gm['gdpPercap'], gm['lifeExp'], marker='.')
plt.xlabel('Per-Capita GDP (International Dollars)')
plt.ylabel('Life Expectancy (years)');

### Transformations

The above scatter plot has some really large GDP values out to the right of the plot. When dealing with data that have large outliers like this, plotting a transformation of the data can make it more interpretable. A standard transformation is to apply the `log` function, so let's try that here.

**Note:** Let's not forget to change the x-axis label to indicate the new units being displayed! 

In [None]:
plt.scatter(np.log10(gm['gdpPercap']), gm['lifeExp'], marker='.')
plt.xlabel('Log Per-Capita GDP (International Dollars)')
plt.ylabel('Life Expectancy (years)');

Now the scatter plot shows that there is somewhat of a linear relationship between the log of GDP and life expectancy. Log transformation is a super common technique for data where there a small number of very large outliers, for example.

### Transparency

Now where points overlap it can be hard to get a clear idea of how many points are in those regions. To fix that we can change the transparency of the markers using the `alpha` parameter. This is a value from `0`-`1`, where `0` is completely transparent (e.g. it's not displayed) and `1` is completely opaque (which is the default seen in the previous plot).

And while we're at it we'll change the fill color with the `facecolor` parameter, and the border color of each marker with the `edgecolor` parameter using one of the pre-defined `Matplotlib` colors.

For *no* color, use `'None'`. For more information on colors in `Matplotlib` see [the documentation](https://matplotlib.org/stable/api/colors_api.html).

A great place to find information on color palettes is [ColorBrewer](http://colorbrewer2.org/). Matt Davis has created a great Python package called [Palettable](https://jiffyclub.github.io/palettable/) that gives you access to the ColorBrewer, Cubehelix, Tableau, and Wes Anderson palettes.

We will discuss color in more detail in the following section. In the meantime try a few different  combinations of `edgecolor` and `facecolor` in the cell below.

In [None]:
plt.scatter(np.log10(gm['gdpPercap']), 
            gm['lifeExp'], 
            marker='o',
            alpha=.75,
            facecolor='DarkBlue',
            edgecolor='None')
plt.xlabel('Log Per-Capita GDP (International Dollars)')
plt.ylabel('Life Expectancy (years)');

---
### Challenge 3: Customizing Markers

1. Try at least three different values of alpha below. Which value is your favorite?

2. What other features of the plot might you change to make it more aesthetically pleasing or interpretable? Try out changing some of the other arguments below, or make a list of properties of the plot you would like to change.
---

In [None]:
plt.scatter(np.log10(gm['gdpPercap']), 
            gm['lifeExp'], 
            marker='o',
            alpha=1, #change transparency here
            facecolor='DarkBlue',
            edgecolor='None')
plt.xlabel('Log Per-Capita GDP (International Dollars)')
plt.ylabel('Life Expectancy (years)');

Now that we've seen that there exists a relationship between GDP and life expectancy at the global scale and across the last 50 years, let's see if we can use similar scatter plots to break that relationship down as a function of both time (year) and location (continent). 

To do that we'll introduce two new techniques, the use of color and `subplots`.

### Color

   Within a scatter plot, each data point may be assigned a different color depending on it's value in a *numeric* third variable. To do so, we will use the `c` argument the same way that we define `x` or `y` in the plot. There are two ways `c` can be used:
   
   1. If `c` is a string, it will make all points in the plot that color
   2. If `c` is a column of a DataFrame (or similar) it will assign a color to each point according to the **numeric** value of the column.
   
We'll make the same exact scatter plot as we just did, but add color to represent the year the data comes from. We'll choose the color scheme for the plot by setting the `cmap` (color map) variable.


In [None]:
plt.scatter(np.log10(gm['gdpPercap']), gm['lifeExp'], marker='.', c=gm['year'], cmap = 'hot')
plt.xlabel('Log Per-Capita GDP (International Dollars)')
plt.ylabel('Life Expectancy (Years)');

While we can see there is some sort of trend dependent on color, we don't know what the color values mean. 

We can use `plt.colorbar()` to add a colorbar which will let us interpret the colors. By adding `.set_label()` to it we can set a textual label describing what the values in the colorbar represent.

We'll also increase the figure size, and the font size used in the title, xlabel and ylabel using the `fontsize=16` parameter. We'll also use a form of LaTeX called MathJax to write a subscript 10 underneath the word Log, like this: $\log_{10}$.

In [None]:
plt.figure(figsize=(10, 8))
plt.scatter(np.log10(gm['gdpPercap']), gm['lifeExp'], marker='.', c=gm['year'], cmap='hot')
plt.xlabel('$Log_{10}$ Per-Capita GDP (International Dollars)', fontsize=16)
plt.ylabel('Life Expectancy (Years)', fontsize=16)

#add a colorbar
plt.colorbar().set_label('Year', fontsize=16);

Ok, now we can see that as time has increased average life expectancy has also increased. Nice! 

Now let's use `subplots` to break this down even further and see if this trend holds across all continents.

#### Subplots

Subplots allow you to draw multiple plots within a single figure. To do so, you use `plt.subplot(<num_rows>, <num_cols>, <index>)` where the number of rows and columns you want in the figure are specified as the first two parameters, respectively. The `<index>` tells subplot which subplot subsequent calls to `plt` will draw in. It starts at `1` for the top left subplot, and increases across rows, and then down columns. 

**Note:** `plt.subplot()` uses a 1-based index (not 0-based like Python) to emulate the behavior of the Matlab version of this function. This can cause confusion!!

Let's look at a simple example plotting Spain and Portugal's GDP on separate plots next to each other in the same row.

**Question:** How does this look different from the multi-line plot above? What information is being highlighted here?

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(spain['year'], spain['gdpPercap'], label='Spain', color='blue')
plt.title('Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.legend();

plt.subplot(1, 2, 2)
plt.plot(portugal['year'], portugal['gdpPercap'], label='Portugal', color='red')
plt.title('Portugal')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.legend();

To set the y-axis values explicitly, we'll use the `vmin` and `vmax` option parameters. Let's see it:

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(spain['year'], spain['gdpPercap'], label='Spain', color='blue')
plt.title('Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

plt.subplot(1, 2, 2)
plt.plot(portugal['year'], portugal['gdpPercap'], label='Portugal', color='red')
plt.title('Portugal')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

---
### Challenge 4: Population and GDP

We've seen that life expectancy and per-capita GDP have a positive relationship. Now let's take a look at the relationshp between population and per-capita GDP. Create a scatter plot that compares the two across all countries in 2007. Modify the scatterplot using any the parameters we've discussed in the past sections (or other parameters as well!). What is the relationship between the two variables?

---

In [None]:
gm_latest
##YOUR CODE HERE

## Plot Customization

We've already changed several properties of the figures we've made, including the color and opacity of scatter plots, and font sizes for the title and axis label text. However, in cases when we want to apply the same customizations to multiple plots we will want to make those changes once, rather than to every single plot. We can do so by setting the **rc parameters** or default style settings for `matplotlib`.

Let's start with the Spain and Portugal multi-line plot from above. First let's generate the default plot again:

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(spain['year'], spain['gdpPercap'], label='Spain', color='blue')
plt.title('Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

plt.subplot(1, 2, 2)
plt.plot(portugal['year'], portugal['gdpPercap'], label='Portugal', color='red')
plt.title('Portugal')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

Now let's change some default rc settings directly using Python code. This will change the settings for the overall Python session - we will see what that means in a moment. The most common starting place is to use a **style sheet** which re-defines the default settings for `matplotlib`


#### Style Sheets

Style sheets are built-in collections of rc parameters that allow for a quick and easy way to get nice looking plots in a particular style. To use a style sheet you simply call the `plt.style.use()` function and give it the name of the style sheet you want to use. The basic `matplotlib` style that you've seen so far in this notebook is called `default`, but there's many other styles to try. Let's take a look at the "Five Thirty Eight" style sheet, the name of which you might recognize as Nate Silver's website, of New York Times data visualization fame.

For more on style sheets print all available style sheets using `plt.style.available`, or see the ([documentation](https://matplotlib.org/stable/tutorials/introductory/customizing.html)).

In [None]:
plt.style.use('fivethirtyeight')

Now we'll use the same code again, and see how it has changed.

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(spain['year'], spain['gdpPercap'], label='Spain', color='blue')
plt.title('Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

plt.subplot(1, 2, 2)
plt.plot(portugal['year'], portugal['gdpPercap'], label='Portugal', color='red')
plt.title('Portugal')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

While the default style settings are certainly sufficient for exploratory data analysis and visualizing trends in data, one of the best things about plotting with Python is extensive control over the style for more polished figures. Almost every component of the plot can be customized, from color to font type and size, to placement of the legend. And if we want to further fine tune our customization, we can set one or more `rcParams` by hand ourselves.

Let's say we like the fivethirtyeight style but we want to remove the bakcground gridlines. We can combine `rcParams` with the fivethirtyeight style to remove the grid in the background, while retaining the other aspects of the style. This means that it can be very efficient to find a style sheet that is close to the desired final product, then tune it to our needs. 

`rcParams` is a dictionary with the structure *key* = setting name and *value* = setting value. So to set a new parameter we call `plt.rcParams[setting name] = new value`. The trick is knowing the names and values that you want to change- see this handy [guide](https://matplotlib.org/stable/tutorials/introductory/customizing.html) with a description of all the settings you can change, or try an internet search!

Now let's turn off the grid in our plot by setting `axis.grid` equal to `False`. Notice that we *don't* need to set the styleshet to `fivethirtyeight` again, because changing the style sheet changes the defaults until the kernel is restarted or the style sheet is set again.

In [None]:
plt.rcParams['axes.grid'] = False


plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(spain['year'], spain['gdpPercap'], label='Spain', color='blue')
plt.title('Spain')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

plt.subplot(1, 2, 2)
plt.plot(portugal['year'], portugal['gdpPercap'], label='Portugal', color='red')
plt.title('Portugal')
plt.xlabel('Time (Years)')
plt.ylabel('Per-capita GPD (International Dollars)')
plt.ylim(2500, 30000)
plt.legend();

---
### Challenge 5: Customizing a style sheet

Now let's explore another plot customization. For the plot below:

1. Look at the [documentation](https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html) and pick a style that you find appealing.

2. Choose a [colormap](https://matplotlib.org/stable/tutorials/colors/colormaps.html) for the `cmap` parameter.

3. Use `rcParams` to customize one (or more) aspects of the style. 

4. Are there any further changes that you'd like to make and aren't sure how? 

---

In [None]:
## Choose style sheet

## Customize one or more parameters


plt.figure(figsize=(10, 8))
plt.scatter(np.log10(gm['gdpPercap']), gm['lifeExp'], marker='.', c=gm['year'], cmap='hot') #modify the cmap here
plt.xlabel('$Log_{10}$ Per-Capita GDP (International Dollars)', fontsize=16)
plt.ylabel('Life Expectancy (Years)', fontsize=16)

#add a colorbar
plt.colorbar().set_label('Year', fontsize=16);

Best practice is to specify any style sheets or parameters at the *top* of your notebook, right after package imports and before you create any visualizations. This helps your code be more replicable.

## Seaborn

Now we will turn to the second major package for visualization, `seaborn`. 

While `seaborn` also provides convenient and useful style changes over `matplotlib`, it's key benefits come in terms of more complex types of plots, support for statistical analysis (such as regression, error bars), and integration with `pandas`.

However all of this functionality is built on the foundation of `matplotlib`, and so understanding that plotting package well is essential to using `seaborn` effectively. A typical visualization will include a `seaborn` base plot, combined with `matplotlib` code for items such as title, etc.

Let's try it out!

In [None]:
import seaborn as sb
sb.set(rc={'axes.facecolor' : '#EEEEEE'})

The `sns.set()` function is another way to change some of the `rcParams`. Here, we're changing the plot's face color.

`seaborn` has the capacity to create a large number of informative, beautiful plots very easily. Here we'll review several types, but please visit their [gallery](https://seaborn.pydata.org/examples/index.html) for a more complete picture of all that you can do with `seaborn`.

### Boxplots Revisited
We previously looked at boxplots in `matplotlib`. Let's now use `seaborn` to look at the distributions of life expectancies separately for each continent. 

`Seaborn` includes native support for `pandas` data structures. To use it we specify the `DataFrame` we want to take the data from in the `data` parameter. We then specify which column names (or variables) from that `DataFrame` to use for in `x` and `y`. `x` specifies the variable to groupby, and `y` specifies the variable whose distribution should be plotted.

We'll plot the continents in alphabetical order by specifying the `order=` parameter, make the box face colors all white using `color='white'`, and disable drawing of the fliers (or outliers) using `fliersize=0`.

In addition to the boxplot, we'll plot a **stripplot**, which overlays a scatter plot of all the data points over each box. In combination, they're quite useful for understanding distributions. This takes many of the same parameters as the boxplot did, with the exception of `jitter=True` which causes the scatter markers to be slightly jittered horizontally. 

In [None]:
plt.figure(figsize=(10, 8))

sb.boxplot(x="continent", y="lifeExp", data=gm,
            order=np.sort(gm.continent.unique()),
           color= 'white',fliersize=0)

sb.stripplot(x="continent", y="lifeExp", data=gm,
              order=np.sort(gm.continent.unique()),
              alpha=0.1, size=5, jitter=True,
              color = 'black', edgecolor='Black')

plt.title("Life Expectancy by Continent")
plt.xlabel('Continent')
plt.ylabel('Life Expectancy');

With just a few lines of code we have a very nice looking plot using `seaborn`. It's possible to create a stripplot using `Matplotlib`, but it's not as easy as it is with `Seaborn`.

### Lineplots Revisited

Now let's go back to look at how `seaborn` can simplify making multi-line plots. Let's return to the GDP by continent over time plot. First let's look at how this plot was generated in `matplotlib`. 

**Question:** How many times did `plot` get called in the below code chunk?

In [None]:
#calculate the mean gdp by continent by year
per_continent_mean_gdp = gm.groupby(['continent', 'year'], as_index=False)['gdpPercap'].mean()

plt.figure(figsize=(10, 8))
#plot one line per continent
for continent in gm['continent'].unique():
    cur_continent_df = per_continent_mean_gdp[per_continent_mean_gdp['continent'] == continent]
    plt.plot(cur_continent_df['year'], 
             cur_continent_df['gdpPercap'], 
             alpha=0.75, 
             label=continent)

plt.title('Continent-Level Average GDP Per Capita, by Year')
plt.xlabel('Year')
plt.ylabel('Average GDP Per Capita')
plt.legend(loc='upper left');


Now let's make the same plot with `seaborn`. The `hue` argument can be used to generate the multiple line part of the plot without needing to plot each line individually. In this case, a column is given the `hue` which tells the program to plot each group in that column in a different color. Behind the scenes, the operations are very similar to the above plot, but this reduces the amount of code that the user needs to write.

**Note:** `hue` contrasts with the `color` argument which is used to specify a single color for the whole plot rather than a grouping variable

In [None]:
plt.figure(figsize=(10, 8))

sb.lineplot(data=gm,x='year',y='gdpPercap',hue = 'continent')

plt.title('Continent-Level Average GDP Per Capita, by Year')
plt.xlabel('Year')
plt.ylabel('Average GDP Per Capita')
plt.legend(loc='upper left');

The other main difference between this and the previous approach is that there is no need to aggregate the data (i.e. calculate the mean gdp for each year) because `seaborn` does this for us, including calculating error automatically. We can also control the parameters for this plot with arguments for `estimator` (aggregation function) and `errorbar` (error estimation). 

---
### Challenge 6: Exploring Seaborn 

Let's further customize this plot in a few different ways. (**Hint:** The [documentation](https://seaborn.pydata.org/generated/seaborn.lineplot.html) will be a great resource) 

1. Redundant contrasts are important for robust, accessible visualizations. Modify this plot to have a second contrast in addition to color (such as line size, line type) using the `continent` column.

2. Let's also modify our plot to use another type of error measurement. Identify the argument to modify error measurement type and try 2-3 different configurations. Which one is your favorite? (**Hint:** Look at the examples at the bottom of the documentation to see possible error configurations)

3. Write down one other modification you might make, either aesthetically or substantively, to the plot. **Bonus:** What argument would you modify to make that change?
---

In [None]:
## Modify the code below to generate your plot

sb.lineplot(data=gm,x='year',y='gdpPercap',hue = 'continent')

plt.title('Continent-Level Average GDP Per Capita, by Year')
plt.xlabel('Year')
plt.ylabel('Average GDP Per Capita')
plt.legend(loc='upper left');

### Other plots with `seaborn`

`Seaborn` also offers many  types of plots that are based on the basic visualizations that we've explored up until now in this workshop. A key skill for using `seaborn` is to identify what you want to visualize and the appropriate plot to use to do so. 

Let's say that we have three numeric variables and are interested in looking at the relationship / potential correlations between each pair of variables.

**Question:** What basic would you use to do this? How many times would `subplot` be called?

We can also take a look at the `seaborn` [gallery](https://seaborn.pydata.org/examples/index.html) to identify a plot that might help us make this plot easier. In this case let's try out the [`pairplot`](https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot). Let's try it out on the `gm_latest` (data from all countries for 2007)

In [None]:
sb.pairplot(gm_latest)

This is pretty close, except since we subset the data to just one year, the `year` column isn't as informative. So let's plot it again, but drop the year column. Let's also color our points by continent to add more infomration to the plot.

In [None]:
sb.pairplot(gm_latest.drop(['year'],axis=1),hue='continent')

Using `pairplot()` reduced the code we need for generating this plot by many lines, but the same result can also be achieved by using `scatterplot` + `kdeplot` (a close relative of `histogram`) + `subplot`. Most plots are variations of or based on the basic plots discussed above.

---
### Challenge 7: Exploring new plots in `seaborn`

Now that we've covered all of the basics of visualization with Python, now let's do some exploring! One of the benefits of using Python for data visualization is that it can give different angles and insights into the data set. 

1. Look at the [gallery](https://seaborn.pydata.org/examples/index.html) and choose another plot that you 'd like to try out. It can help to think about what type of information that you'd like to communicate, or choose a plot that you'd like to learn how to use. 
2. Read the documentation and look at the examples to see how the plot is implemented.
3. Visualize some aspect of the gapminder dataset and use `matplotlib` to customize the plot to highlight the most relevant information. Include an informative title and axis labels, if appropriate. Write down a wishlist of any further customizations/adjustments that you would like to make to the visualization.
4. What was the most useful resource for learning how to use this plot? What remaining questions do you have?

If you aren't sure what plot to use consider `kdeplot` (smoothed histogram), `regplot` (scatterplot with regression), or `jointplot` (bivariate scatterplot + univariate histograms)


---


In [None]:
# YOUR CODE HERE

## Saving Plots

Finally, if you'd like to save a plot you can use the `plt.savefig` function that is part of the matplotlib package. This will create an image file saved to wherever you specify. Running the cell below will save the plot as `facetgrid_graph.png` within your current directory. Refer to `help(plt.savefig)` for more documentation on this function.

In [None]:
sb.relplot(x='year', 
            y='gdpPercap', 
            hue='continent',
            col='continent', 
            col_wrap=3, 
            legend=None,
            kind='line',
            facet_kws={'sharex':False},
            data=per_continent_mean_gdp);
plt.savefig('facetgrid_graph.png')

## Going further

There are many visualization libraries in Python and you now have experience in using two popular ones.

Several other Python visualization libraries exist for creating *interactive* visualizations such as [Plotly](https://plot.ly/python/), [Bokeh](http://bokeh.pydata.org/en/latest/), or [Toyplot](http://toyplot.readthedocs.org/en/stable/tutorial.html#getting-started).