# Python for Environmental Science Day 7
## Topics
* Making Nice Plots
* Matplotlib Usage
* Matplotlib Tips and Tricks

## Making Nice Plots
This day is meant to cover the usage of matplotlib (and to some extension seaborn), but first I would like to get a concept across that will help you to make nicer plots. The [data-ink ratio](https://youtu.be/JIMUzJzqaA8). This concept helps you to focus your graphics on the most essential parts. The following two bar charts represent the most extreme examples for a very high and a very low data to ink ratio.


![Chilling](https://cdn-images-1.medium.com/max/1200/1*s_SdOBsrJizFfKs0m5PKug.png)

This does not mean that every graphic you make has to be as spartan as the right one, but you should always think about what parts of your graphic are essential to its message. If you want to learn a bit more about this and similar concepts take a look at [this article](https://medium.com/marax-ai/intelligent-signals-visualising-data-df9152c10b00). Also, you might wanna take a look at [this article](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833) for "10 simple rules for better figures", but this is optional.

### Practice Questions
* Explaine the data-ink ratio in your own words. Can you think of a graphic you made yourself that had a bad data-ink ratio?
* What is the lie-factor of a graphic (see article)?

### Exercise 1
Try to find two graphics each, that have a good or bad data-ink ratio, respectively. A good starting point for your search are [data is beautiful](https://www.reddit.com/r/dataisbeautiful/) and [data is ugly](https://www.reddit.com/r/dataisugly/).

## Matplotlib Usage
After this is out of the way, let us start with the real topic for today: **matplotlib**. matplotlib is the main package in Python for 2D graphics (though it can do a bit 3D as well) and allows you to create publication ready figures with relative ease. For starters take a look at [this video](https://youtu.be/V-OWkPCYa0s) (it is a bit hard to understand, but I could not find a better video that covers this topic; take a look at the next section for an outline of what you should have understood from the video; you can probably stop watching after ~ 20 minutes, as he mainly talks about maps after that) to get a general feel for the way matplotlib is structured and the cool things you will be capable of once you get the hang of it. And [this video](https://youtu.be/q7Bo_J8x_dw) for other basic information and simple plotting.

## Matplotlib Structure
As we learned in the video matplotlib is relatively similar to matlab, but has a few differences. First of all, it is free and open source, which allows a much larger community to work on it. The second important thing is the anatomy of matplotlib. 

![Chilling](https://image.slidesharecdn.com/pyconcanada2015-151111021204-lva1-app6892/95/matplotlib-up-and-running-pycon-canada-2015-9-638.jpg?cb=1447208065)

As you can see in this picture matplotlib graphics subdivides intro three main things:
* Figure: This is the overarching structure that contains everything else. If you work with several subplots, they are contained in one single figure, also.
* Axes: This is where most of the real work happens, meaning that you will plot the things you want with the axes. 
* Axis: The x, and y axis which you can access and work with seperately.

Interfaces of matplotlib:
* pyplot: This is the real deal. If you use matplotlib, you should access it over pyplot. It allows you to easily create figures and change parts of them. 
* object-oriented API: this is useful if you want to embed your matplotlib figure in something like a webpage. But when you use matplotlib for science you can probably forget about it. It allows you more control, but is also more complicated than pyplot. 

More good things to know about pyplot:
* pyplot can work with several figures
* if you call pyplot without having a figure, it will create one for you
* pyplot.gca() gives you the current axes, pyplot.gcf() gives you the current figure
* pyplot.close() clears everything, so you can avoid plotting in old figures
* matplotlib works quite well with pandas dataframes

An exhausting course on matplotlib can be found [here](https://github.com/zutn/oreilly-matplotlib-course) if you want to dive in deep.

### Practice Questions
* What is the scripting layer of matplotlib and what is it used for?
* What is the difference between a figure, axes and an axis?

Have no worries, we will usually use pyplot to make things easier, but it can sometimes be helpful to know what happens behind the scenes of matplotlib. 

### Most Basic Plotting
So, after this introduction let us try a bit of plotting ourselves. As stated in the videos the most basic to plot things in matplotlib is the plot function, which allows you to draw dots or lines to represent your data. First let's start with a line.

In [None]:
import matplotlib.pyplot as plt
import random
# Make the plot reproducible
random.seed(1)
# Create some fake data
data1 = [random.randint(0, 30) for i in range(10)]
plt.plot(data1, "-")

But we can easily use the same data and represent them by little stars.

In [None]:
plt.plot(data1, "*")

Or change the color.

In [None]:
plt.plot(data1, "-", color="black")

As you can already see Python allows you to specify your figures quite a lot. This is often very helpful, as you can tailor it exactly to your needs, but will also lead you to long stackoverflow sessions to change one minor detail that you just could not figure out. If you take a look at [the documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html) you will see that plot() usually expects an x and y value. However, we did only provide one list. From this, plot() infers that this is a kind of series and simply plots the values against their index (which is in this case 0 to 9). 

**Note**: If we plot inside the notebook the figures are shown instantly. This will not happen if you execute the same code in Spyder. You will have to tell it explicitly that you want to see your figure by invoking **plt.show()**. But get in the habit of calling plt.show() here as well, so you get in the habit of doing so.

### Labelling Stuff
When you create figures, especially scientific ones, you need to label your axes! This is relatively easy in matplotlib. A bit longer explanation can be found [here](https://youtu.be/aCULcv_IQYw). Basically you just tell pyplot to handle it. 

In [None]:
plt.plot(data1, "-", color="black")
plt.xlabel("Index")
plt.ylabel("Data")
# Also label the figure as a whole
plt.title("A wonderful graph")
plt.show()

When you have more than one dataset in a figure it becomes essential to label it, so the readers know what is what. As this is part of almost every figure you will make, matplotlib has an easy integration for this. Just specify a label for the plot you are creating (as a keyword in the function) and call **plt.legend()**. There are some cases where this approach will not give you the desired results, but stackoverflow will have a solution for you. 

In [None]:
# Create a second dataset to plot also
import random
random.seed(0)
data2 = [random.randint(0, 30) for i in range(10)]
# Make the plotting
plt.plot(data1, "-", color="black", label="data1")
plt.plot(data2, "-", color="red", label="data2")
plt.xlabel("Index")
plt.ylabel("Data")
# Also label the figure as a whole
plt.title("Two wonderful graphs")
# And finally simply call the legend 
plt.legend()
plt.show()

### Practice Questions
* Can you include scientific notation in labels?

### Often used Kinds of Plots
In this section we will look at the useful matplotlib figures. But first we need our pokemon dataset again.

In [None]:
import pandas as pd
pokemon = pd.read_csv("pokemon.csv")

#### Bar Chart
After this is out of the way we will take a look at [bar charts and histograms](https://youtu.be/ZyTO4SwhSeE). First let us prepare a part of a dataframe we want to plot. So let us get the max attack values for different 'Type 1' and sort them.

In [None]:
max_attack = pokemon.groupby("Type 2").max().loc[:,"Attack"]
max_attack

And now we can create a barplot with this data. The barplot needs something to indicate where to plot the attack values on the x axis. Therefore, we also extract the types. And we change the rotation of the x-labels, so they are readable.

Play around with the rotation to see how it influences the plot. Also try giving it a different color.

In [None]:
types = max_attack.index
plt.bar(x=types, height=max_attack)
plt.xticks(rotation=90)
plt.show()

#### Histograms
Histograms are often used to look at distributions of things. In our case, let us check out the distribution of defense values in water pokemon. First extract the data.

In [None]:
water_defense = pokemon.loc[pokemon["Type 1"] == "Water", "Defense"]
water_defense

After we have the data we can simply call matplotlibs hist() function. 

In [None]:
plt.hist(water_defense)
plt.show()

This is again a bit rudimentary. So let us make this thing a bit more interesting, by adding labels and changing some properties of the histogram. 

In [None]:
plt.hist(x=water_defense, histtype="step", linestyle=":", color="black", bins=15)
plt.xlabel("Defense Value")
plt.ylabel("Count")
plt.title("Distribution of Defense Values in Water Pokemon")
plt.show()

Again, you can see that matplotlib allows you very easily to change your plots. You will learn to appreciate this feature very soon!

#### Scatter Plots
Another common plot is the scatter plot. As you might have already noticed the matplotlib functions all work very similarly. Therefore, I will keep this short. But if you need additional information, take a look at the [documentation of pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot). First we need some data to plot against each other. Let us look at attack and speed of all pokemon. 

In [None]:
attack = pokemon["Attack"]
speed = pokemon["Speed"]
plt.scatter(x=attack, y=speed, alpha=0.5)
plt.xlabel("Attack Value")
plt.ylabel("Speed Value")
plt.title("Scatter Plot Attack vs. Speed for all Pokemon")
plt.show()

### Practice Questions
* What does the 'alpha' keyword do and what is its use?
* What is the difference between a barplot and a histogram?
* What are the advantages of a boxplot over a barplot?

### Exercise 2
Write a Python program to draw a scatter plot taking a random distribution for the x and y values and plot them against each other. Make the scatter points black with an alpha of 0.5.

### Exercise 3
Try to recreate [this figure](https://www.w3resource.com/w3r_images/matplotlib-basic-exercise-1.png). Also add a label to the line (with the keyword) and create a legend.

Hint: It does not have to look exactly the same. 

### Exercise 4
Create a random sample of 30 values between 0 and 100 and create a boxplot of them. 

### Exercise 5
Create a random sample of 1000 values between 0 and 1. Use this to plot a histogram and set the histogram type to "step".  

### Exercise 6
Take a look at [the documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html#matplotlib.pyplot.boxplot) of the boxplot in matplotlib. Use this and your knowledge of pandas to create a figure that contains a boxplot for all meaningful attributes of the pokemon dataset.

### Working with Subplots
Often those kinds of problems are better tackled with subplots. matplotlib allows you to plot several subplots in one figure. Take a look at [this short tutorial](https://matplotlib.org/gallery/recipes/create_subplots.html) and [this question](https://stackoverflow.com/questions/31726643/how-do-i-get-multiple-subplots-in-matplotlib) to get a first understanding. Then let us think about what kind of figure might profit from several subplots next to each other. My idea would be to make four subplots and compare attack to defense, speed, special attack and special defense respectively. First let us extract the data we need. 

In [None]:
attack = pokemon["Attack"]
speed = pokemon["Speed"]
special_attack = pokemon["Sp. Atk"]
special_defense = pokemon["Sp. Def"]
defense = pokemon["Defense"]

Next we create our subplots. Usually it is better to create all the subplots you need first and then plot, instead of creating one subplot at a time, as it makes for cleaner programming. As we want four subplots, two rows and two columns of subplots are needed. 

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True, )
axes[0,0].scatter(attack, defense, alpha=0.2)
axes[0,0].set_title("Attack vs. Defense")
axes[0,1].scatter(attack, speed, alpha=0.2)
axes[0,1].set_title("Attack vs Speed")
axes[1,0].scatter(attack, special_attack, alpha=0.2)
axes[1,0].set_title("Attack vs Special Attack")
axes[1,1].scatter(attack, special_defense, alpha=0.2)
axes[1,1].set_title("Attack vs Special Defense")

Now our four subbplots are stored in ax in the same way we see them. Meaning the left upper subplot is stored in ax[0,0], the upper right in ax[0,1], the lower left in ax[1,0] and the lower right in ax[1,1]. With this knowledge we can access ax to plot our scatterplots. 

In additon, we learned that a linear relationship of attack can be found for all other four attributes. However, it is not a very strong one. 

### Practice Questions
* What problems can occur when you do not use 'sharex' and 'sharey'?
* How would you add an x and y label to the plot above?
* Find out what plt.close() does and why it might be useful in the exercises.

### Saving your Figures
Saving your files in matplotlib is easy. You simply have to call it with **plt.savefig()** and provide a name for the function. This will save the current figure in your working directory. Here are a few additional tips what savefig can do and what makes your saved figures nicer:

In [None]:
plt.plot([1,0], [2,3], linestyle=":")
# Get the current figure, so you can alter its properties
fig = plt.gcf()
# Change the size of the figure to adjust it to your needs (play around)
fig.set_size_inches(1,2)
# Sometimes (especially with subplots) the default saved images are cropped off at the edges
# fig.tight_layout() corrects this
fig.tight_layout()

# Now let us save the figure. There are many keywords you can use but for me the most helpful 
# are dpi (sets the resolution of your figure), bbox_inches (helps remove useless whitespace around the figure)
# and transparent (saves the figure without background)
plt.savefig("simple.png", dpi=300, bbox_inches="tight", transparent=True)

### Exercise 7
Download [this dataset](https://www.kaggle.com/dorbicycle/world-foodfeed-production) from kaggle and read it in. 


Do the following:
* Let pandas describe the dataset (include="all") to give you a feeling for it and to determine if you read in everything correctly
* Print all the unique values for the 'Item' column to see what's there (convert it to a list to see all)
* Make a meaningful figure for each of the following descriptions:
    * shows the total amount of food and feed over the whole time period (for every year, not as a total).
    * compares amount of barley for food and for feed in Afghanistan in the year 1961.
    * shows the 50 countries that had the most food and feed combined for the whole time period (this is probably a bit easier if you use pandas plotting function). 
    * explores if there is a linear correlation between food and feed amount for the whole time period in all countries.
    * explores if the soybean amount 2000 to 2009 is larger in the northern or southern hemisphere (keep the latitude in mind). Do this also for the time period 1990 to 1999. Plot both figures in a subplot next to each other. 
    * shows the distribution of the amount of all food oils combined (by country, in one total per country) for the year 2000. [This](https://stackoverflow.com/questions/11350770/pandas-dataframe-select-by-partial-string) might be helpful. 

Hint: Some categorical data will be identified by pandas as numeric. Correct this misunderstanding.

Hint: This will also require you to work with pandas a lot. Also, this exercise might take you a while, but will help you realize that the tools you learned are enough to explore a new dataset on your own.

Hint: This is already quite similar to the final project, so make sure to ask a lot of questions. 

Hint: Reading the csv will probably raise a Unicode Error. Stackoverflow will help you!

Hint: If you want something funny to cheer you up during this long exercise type the following in your console:

In [None]:
import antigravity

## Matplotlib Tips and Tricks
During my usage of matplotlib I came across a few things that help me making figures that I want to share with you. 

### Seaborn
[Seaborn](https://seaborn.pydata.org/) is a package that builts upon matplotlib and allows you to make some very nice figures. For example this one here is created with seaborn:

![Chilling](https://seaborn.pydata.org/_images/kde_ridgeplot.png)



In general I find the style and default color palette of seaborn much more appealing. If you want to have your matplotlib plots look more like seaborn, simply import seaborn at the beginning of your script. It will override some of the matplotlib defaults. 

### Some Commands for a better Data-Ink Ratio
When I introduced the concept of data-ink ratio at the beginning of the notebook, you might have wondered how you can include this concept in your coding. Here are a few tips:
* Use the alpha keyword more often. With a clever usage you can highlight the most important parts of your plot
* Get rid of gridlines 
* Get rid of borders
* Use alpha also on labels 

So now let us try this with a simple figure:




In [None]:
plt.plot([1,0], [2,3], linestyle=":")
plt.grid(True)
plt.xlabel("X")
plt.ylabel("Y")
plt.title("X and Y")
plt.show()

And now with all the advices mentioned above:

In [None]:
plt.plot([1,0], [2,3], linestyle=":")
# Remove grid
plt.grid(False)
# Apply alpha
alpha = 0.6
plt.xlabel("X", alpha=alpha)
plt.ylabel("Y", alpha=alpha)
plt.title("X and Y", alpha=alpha)
# Remove the borders
ax = plt.gca()
for spine in ax.spines.values():
    spine.set_visible(False)
# Also use alpha on the ticklabels
plt.setp(ax.get_yticklabels(), alpha=alpha)
plt.setp(ax.get_xticklabels(), alpha=alpha)
plt.show()

You do not have to make all your figures as barebone as this one, but I think it cannot hurt to know the tools. 

### xkcd style
You probably know [xkcd comics](https://xkcd.com/). If you like the style, you can simply activate it in your plots by calling **plt.xkcd()** at the beginning of your code. This makes your plots look like this:


![Chilling](https://matplotlib.org/xkcd/_images/xkcd_01.png)

## Final Advice

![Chilling](https://cdn-images-1.medium.com/max/1600/1*IpediaLpieKBR_jS0nmQdA.jpeg)