# Matplotlib
Data visualization is a key skill for aspiring data scientists. Matplotlib makes it easy to create meaningful and insightful plots. In this Jupyter notebook, you’ll learn how to build various types of plots, and customize them to be more visually appealing and interpretable.

#### Imports libraries and other dependencies

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.dpi'] = 180

In [None]:
# Import gapminder data
gapminder = pd.read_csv('../../data/gapminder.csv')

# Create lists from gapminder columns
gdp_cap = gapminder['gdp_cap'].tolist()
life_exp = gapminder['life_exp'].tolist()
pop = gapminder['population'].tolist()
# year = gapminder['year'].tolist()
# pop = gapminder['population'].tolist()

In [None]:
# Import gapminder_2 data
gapminder_2 = pd.read_csv('../../data/gapminder_2.csv')

# Create lists from gapminder columns
year = gapminder_2['Year'].tolist()
population = gapminder_2['Pop'].tolist()
life_exp_1950 = gapminder_2['life_exp1950'].tolist()

# Basic plots with Matplotlib
You will learn how to visualize data and to store data in new data structures. Along the way, you will master control structures, which you will need to customize the flow of your scripts and algorithms. These are the types of things data scientists use every day. We'll finish this chapter with a case study, where you'll blend together everything you've learned to solve an exciting challenge.
<br>
This Jupyter notebook is about data visualization, which is a very important part of data analysis. First, you will use it to explore your dataset. The better you understand your data, the better you'll be able to extract insights. And once you've found those insights, again, you'll need visualization to be able to share your valuable insights with other people.

# Line plot ( 1 )
With matplotlib, you can create a bunch of different plots in Python. The most basic plot is the line plot.

The world bank has estimates of the world population for the years 1950 up to 2100. The years are in a list called `year`, and the corresponding populations as a list called `pop`.

In [None]:
# Print the last item from year and pop
print(year[-1])
print(population[-1])

# Make a line plot: year on the x-axis, pop on the y-axis
plt.plot(year, population)

# Show plot
plt.show()

# Line Plot ( 2 ): Interpretation
Have another look at the plot you created. Based on the plot, in approximately what year will there be more than **ten billion** human beings on this planet?
<br>
Answer: 2060

# Line plot ( 3 )
Now that you've built your first line plot, let's start working on the data that professor Hans Rosling used to build his beautiful bubble chart. It was collected in 2007. Two lists are available for you:
- `life_exp` which contains the life expectancy for each country and
- `gdp_cap`, which contains the GDP per capita (i.e. per person) for each country expressed in US Dollars.

GDP stands for Gross Domestic Product. It basically represents the size of the economy of a country. Divide this by the population, and you get the GDP per capita.

In [None]:
# Print the last item of gdp_cap and life_exp
print(gdp_cap[-1])
print(life_exp[-1])

# Make a line plot, gdp_cap on the x-axis, life_exp on the y-axis
plt.plot(gdp_cap, life_exp)

# Show plot
plt.show()

# Scatter Plot ( 1 )
When you have a timescale along the horizontal axis, the line plot is your friend. But in many other cases, when you're trying to assess if there's a correlation between two variables, the scatter plot is the better choice.

In [None]:
# Change the line plot below to a scatter plot
plt.scatter(gdp_cap, life_exp)

# Put the x-axis on a logarithmic scale
plt.xscale('log')

# Scatter plot ( 2 )
You saw that the higher GDP usually corresponds to a higher life expectancy. In other words, there is a positive correlation.

Do you think there's a relationship between population and life expectancy of a country? The list `life_exp` from the previous exercise is already available. In addition, now also `pop` is available, listing the corresponding populations for the countries in 2007. The populations are in millions of people.

In [None]:
# Build Scatter plot
plt.scatter(pop, life_exp)

plt.xlabel('Population')
plt.ylabel('Life Expectancy')

# Show plot
plt.show()

# Histogram
The histogram is a type of visualization that's very useful to explore your data. It can help you to get an idea about the distribution of your variables.

# Build a histogram ( 1 )
To see how life expectancy in different countries is distributed, let's create a histogram of `life_exp`. `life_exp`, the list containing data on the life expectancy for different countries in 2007, is available as a list.

In [None]:
# Create histogram of life_exp data
plt.hist(life_exp)

# Show histogram
plt.show()

# Build a histogram ( 2 ): bins
In the cell above, you didn't specify the number of `bins`. By default, Python sets the number of `bins` to 10 in that case. The number of `bins` is pretty important. Too few `bins` will oversimplify reality and won't show you the details. Too many `bins` will overcomplicate reality and won't show the bigger picture.

To control the number of `bins` to divide your data in, you can set the `bins` argument.

In [None]:
# Build histogram with 5 bins
plt.hist(life_exp, bins=5)

# Show histogram
plt.show()

In [None]:
# Build histogram with 20 bins
plt.hist(life_exp, bins=20)

# Show histogram
plt.show()


# Build a histogram ( 3 ): compare
You saw population pyramids for the present day and for the future. Because we were using a histogram, it was very easy to make a comparison.

Do a similar comparison. `life_exp` contains life expectancy data for different countries in 2007. You also have access to a second list now, `life_exp_1950`, containing similar data for 1950. Can you make a histogram for both datasets?

You'll again be making two plots. The `plt.show()` and `plt.clf()` commands to render everything nicely are already included.

In [None]:
# Histogram of life_exp, 15 bins
plt.hist(life_exp, bins=15)

# Show histogram
plt.show()

In [None]:
# Histogram of life_exp_1950, 15 bins
plt.hist(life_exp_1950, bins=15)

# Show histogram
plt.show()

# Customization
For each visualization, you have many options. First, there are the different `plot` types. And for each `plot`, you can do an infinite number of customizations. You can change colors, shapes, labels, axes, and so on. The choice depends on: one, the data, and two, the story you want to tell with this data. Since there are so many possible customizations, the best way to learn this is by example.

# Labels
You're going to work on the scatter plot with world development data: GDP per capita on the x-axis (logarithmic scale), life expectancy on the y-axis.

As a first step, let's add axis labels and a title to the plot. You can do this with the `xlabel()`, `ylabel()` and `title()` functions, available in `matplotlib.pyplot`.

In [None]:
# # Basic scatter plot, log scale
# plt.scatter(gdp_cap, life_exp)
# plt.xscale('log')
#
# # Strings
# xlabel = 'GDP per Capita [in USD]'
# ylabel = 'Life Expectancy [in years]'
# title = 'World Development in 2007'
#
# # Add axis labels
# plt.xlabel(xlabel)
# plt.ylabel(ylabel)
#
# # Add title
# plt.title(title)
#
# # After customizing, display the plot
# plt.show()

# Ticks
You can control the y-ticks by specifying two arguments: `plt.yticks([0,1,2], ["one","two","three"])`

In [None]:
# Scatter plot
plt.scatter(gdp_cap, life_exp)

# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')

# Definition of tick_val and tick_lab
tick_values = [1000, 10000, 100000]
tick_labels = ['1k', '10k', '100k']

# Adapt the ticks on the x-axis
plt.xticks(tick_values, tick_labels)

# After customizing, display the plot
plt.show()

# Sizes
Right now, the scatter plot is just a cloud of blue dots, indistinguishable from each other. Let's change this. Wouldn't it be nice if the size of the dots corresponds to the population?

To accomplish this, there is a list `pop` loaded in your workspace. It contains population numbers for each country expressed in millions. You can see that this list is added to the scatter method, as the argument s, for size.

In [None]:
# Store pop as a numpy array: np_pop
np_pop = np.array(pop)

# Double np_pop
np_pop = np_pop * 2

# Update: set s argument to np_pop
plt.scatter(gdp_cap, life_exp, s=np_pop)

# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000, 10000, 100000],['1k', '10k', '100k'])

# Display the plot
plt.show()

# Colors

In [None]:
# Specify c and alpha inside plt.scatter()
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Show the plot
plt.show()

# Additional Customizations

In [None]:
# Scatter plot
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8)

# Previous customizations
plt.xscale('log')
plt.xlabel('GDP per Capita [in USD]')
plt.ylabel('Life Expectancy [in years]')
plt.title('World Development in 2007')
plt.xticks([1000,10000,100000], ['1k','10k','100k'])

# Additional customizations
plt.text(1550, 71, 'India')
plt.text(5700, 80, 'China')

# Add grid() call
plt.grid(True)

# Show the plot
plt.show()