# Plotting data
**45 min**
15:45 – 16:30

## Making simple plots with Pandas
We'll be using the `matplotlib` library for plotting.

Before we talk about using `matplotlib` functions directly. We'll get started with plotting the easy way. 

We can actually create plots using `pandas`, because pandas will call `matplotlib` for us in the background. 

The Jupyter Notebook will render plots inline if we ask it to using a “magic” command:

In [None]:
import pandas as pd
%matplotlib inline 
# magic function ensuring that plots will be rendered inline

We'll keep working with the surveys data, so I'll make sure we have that imported:

In [None]:
surveys_df = pd.read_csv("surveys.csv")
surveys_df.columns # taking a look at the column names to see what may be of interest to plot

We'll plot the number of each species that were captured. 

First we get the number of records for each `species_id`. We'll can do this by using the `.groupby` method:

In [None]:
counts_by_species = surveys_df.groupby('species_id')['record_id'].count()
counts_by_species #let's take a look at the counts, what is the `type()`?

### Bar plots
Then we can use the `.plot` method that `pandas` provides for series:

In [None]:
counts_by_species.plot(kind='bar')

This `pandas` `.plot` method actually uses `matplotlib` in the background.

We can also get species counts in simpler way than using `.groupby`. We can use the `value_counts()` method directly on a dataframe. 

This does groupby, count, and sort in the background for us and returns a "series":

In [None]:
species_counts = surveys_df['species_id'].value_counts()
species_counts
# type(species_counts)

And now we can plot the values in the `species_counts` series:

In [None]:
species_counts.plot.bar()

The values here are the same, but we see that the species ids have been sorted from highest to lowest count.

**Challenge**: Create a bar graph using `pandas` `.plot` that shows the number of records for each plot.
**Hint**: Use the `plot_id` column.

In [None]:
# Challenge solutions

surveys_df['plot_id'].value_counts().plot.bar()

# or

surveys_df.groupby('plot_id')['record_id'].count().plot.bar()

Let's take another look at our `species_counts`.

In [None]:
species_counts

Since there are a lot of low values here and I'd like to make our plots a little easier to look at by restricting ourselves to species that were recorded more than 1000 times.

We can also select data we'd like to keep using a "boolean mask". 

This is where what we learned in the "Subsetting Data using Criteria" section can be useful:

In [None]:
species_top_counts = species_counts.loc[species_counts > 1000]
species_top_counts.plot.bar()

If we look at our `species_top_counts` variable, we see that this is also a series with an index of `species_id` and a value that is the count:

In [None]:
species_top_counts

To get the ids for the top species, we can actually extract the index to get a list of the top species:

In [None]:
top_species = species_top_counts.index
top_species

Just like we selected for the `species_top_counts` from the `species_counts` series using a "boolean mask", we can also select rows in our original dataframe based on the `top_species` ids. 

First we'll create the mask to identify the rows in our dataframe that include one of our `top_species`:

In [None]:
# If `top_species` is in the `species_id` column for a particular row of our dataframe `surveys_df` 
# return True, otherwise false.
mask = surveys_df.species_id.isin(top_species)
# Here is how our mask looks:
mask

Now we use our "boolean mask" to return the rows in the original dataframe where our mask says "True", to do this, we use `.loc`:

In [None]:
top_species_surveys = surveys_df.loc[mask]
top_species_surveys

*(Draw what is happening `survey_df`, `mask`, and `top_species_surveys`*

We can use our subsetted dataframe to look at how many animals from the top species were captured in each plot:

In [None]:
top_species_at_plots = top_species_surveys['plot_id'].value_counts()
top_species_at_plots.plot.bar()

But we can also do more using `matplotlib`.

## Using matplotlib

`matplotlib` is the most widely used scientific plotting library in Python. It is actually used by `pandas` to make these plots. 

Even though `pandas` can use `matplotlib` to show us plots, we can't call it directly right now. If we want to be able to use it directly to create and manipulate plots, we'll need to import it.

The most commonly uses a sub-library is `matplotlib.pyplot`:

In [None]:
import matplotlib.pyplot as plt

### Scatter plot
Let's start by making a scatter plot, to compare the `hindfoot_length` and `weight` values in our `top_species_surveys` dataframe.

When plotting with `matplotlib` directly, we will need to tell it which pieces of data to use. First, we'll define the variables we'd like to plot:

In [None]:
# Remove NaN values in our dataframe
top_species_surveys = top_species_surveys.dropna(how='any',axis=0)

# define x and y for clarity
x = top_species_surveys['hindfoot_length']
y = top_species_surveys['weight']

We'll use the `.scatter` function in pyplot to make out plot, this is very similar to what we gave `pandas` before:

In [None]:
## Instead of: top_species_surveys.plot.scatter
plt.scatter(x, y)
plt.show() # not necessary in Jupyter Notebooks because this is an interactive entivnment

**Note**: Scatter plot will actually ignores `NaN` values, but to be cleaner and on the safe side for other uses we removed `NaN` values from our data.

We can also learn more about how to use scatter:

In [None]:
plt.scatter?

We can change the appearance of our scatter plot by changing the default options, including the marker size, color, and shape:

In [None]:
plt.scatter(x, y, s=10, c='g', marker='x')

**Note**: In Jupyter Notebooks, we don't need to use `plt.show()` because this is an interactive environment. In other coding environments, you may need this function to show the plot.

We can also use `matplotlib` to add more information to our plot, including a title, and x and y labels:

In [None]:
plt.scatter(x, y, s=10, c='green', marker='o')
plt.title("Hind foot length and weight")
plt.xlabel("Hind foot length")
plt.ylabel("Weight")

`matplotlib` also makes other styles available for our figures:

In [None]:
plt.style.available

Let's try a new style, and also change the size for our labels:

In [None]:
plt.style.use('seaborn-whitegrid')
plt.scatter(x, y, s=10, c='green', marker='o')
plt.title("Hind foot length and weight", fontsize=22)
plt.xlabel("Hind foot length", fontsize=16)
plt.ylabel("Weight", fontsize=16)

`Matplotlib` has a lot of functionality and can be overwhelming. So, it's worth remembering there are multiple ways of doing things.

A useful strategy is to do as much as you easily can in a convenience layer. We can start by creating a plot using `pandas`, and then combine this with calls to `matplotlib` for the rest.

Going back to our `top_species_surveys` dataframe, we'll use `plot.scatter` from there with options:

In [None]:
top_species_surveys.plot.scatter(x="hindfoot_length", y="weight", 
                title="Hind foot length and weight", s=10, 
                c='green', marker='x')

This figure made through `pandas` will automatically display the x and y labels from the dataframe column names. We can also find out more about using `df.plot`:

In [None]:
surveys_df.plot?

Now we can still add on to the plot with `matplotlib`:

In [None]:
surveys_df.plot(x="hindfoot_length", y="weight", kind="scatter", 
                s=10, c='green', marker='x')
plt.title("Hind foot length and weight", fontsize=22)
plt.xlabel("Hind foot length", fontsize=16)
plt.ylabel("Weight", fontsize=16)

This plot gives us an idea of the hindfoot length by weight across all species, we see a few clusters, but we can't tell which species our points are associated with.

Let's try adding this information to our plot by changing the colors of our points. We'll do this using the `groupby` method on our `top_species_surveys` dataframe and then plotting.

In [None]:
plt.style.use('seaborn-bright')
surveys_by_top_species = top_species_surveys.groupby('species_id')

for species, data in surveys_by_top_species: # Explain what's going on here
    plt.scatter(data.hindfoot_length, data.weight, label=species,
                s=10, marker='o', alpha=0.5)

plt.title("Hind foot length and weight by species", fontsize=16)
plt.xlabel("Hind foot length", fontsize=16)
plt.ylabel("Weight", fontsize=16)
plt.legend(loc='best', frameon=True, edgecolor='black') # Mess around a bit with the legend

This time, We included a label for each species plotted and a legend. Can you tell what the `alpha` option does?

Our legend shows that we only have 6 unique colors for 11 different species. So, we can't distinguish between DM and PB in our scatter plot.

**Challenge**: Try to change the size of the points for each species.

In [None]:
# Challenge Solution

plt.style.use('seaborn-bright')
surveys_by_top_species = top_species_surveys.groupby('species_id')

size = 8 # new line

for species, data in surveys_by_top_species:
    plt.scatter(data.hindfoot_length, data.weight, label=species,
                s=size, marker='o', alpha=0.5)
    size += 3 # new line, same as size = size + 3

plt.title("Hind foot length and weight by species", fontsize=16)
plt.xlabel("Hind foot length", fontsize=16)
plt.ylabel("Weight", fontsize=16)
plt.legend(loc='best', frameon=True, edgecolor='black')

In [None]:
plt.legend?

Now let's save our plot to a file:

In [None]:
plt.style.use('seaborn-paper')
surveys_by_top_species = top_species_surveys.groupby('species_id')

size = 8

for species, data in surveys_by_top_species:
    plt.scatter(data.hindfoot_length, data.weight, label=species,
                s=size, marker='o', alpha=0.5)
    size += 3

plt.title("Hind foot length and weight by species", fontsize=16)
plt.xlabel("Hind foot length", fontsize=16)
plt.ylabel("Weight", fontsize=16)
plt.legend(loc='best', frameon=True, edgecolor='black')

plt.savefig("my_figure.png") # new line

If you happen to want to do this across two Juptyer windows, or just want to refer to this figure later, you can first save a reference to the current figure in a local variable:

In [None]:
plt.style.use('seaborn-paper')
surveys_by_top_species = top_species_surveys.groupby('species_id')

size = 8

fig = plt.gcf() # new line, save a reference to the current figure in a local variable

for species, data in surveys_by_top_species:
    plt.scatter(data.hindfoot_length, data.weight, label=species,
                s=size, marker='o', alpha=0.5)
    size += 3

plt.title("Hind foot length and weight by species", fontsize=16)
plt.xlabel("Hind foot length", fontsize=16)
plt.ylabel("Weight", fontsize=16)
plt.legend(loc='best', frameon=True, edgecolor='black')

And then save the figure:

In [None]:
fig.savefig("my_figure.pdf")

This will save the current figure to the file "my_figure.png". 

The file format will automatically be deduced from the file name extension (other formats are jpg, pdf, ps, eps and svg).

## Key Points
- `Matplotlib` is the engine behind `Pandas` plots.
- Object-based nature of matplotlib plots enables their detailed customization after they have been created.
- Export plots to a file using the `savefig` method.
- There are other libraries for plotting worth looking up, including: `seaborn`, `plotnine`, `plot.ly`.