In [1]:
%matplotlib inline

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# In-Class Lab 1 &mdash; Visualization with Matplotlib & Pandas

Now it is your turn to try creating some of the new chart types you learned in class! To practice, you are going to be using the Palmer Penguins dataset that was collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network. It contains data for 344 penguins, including three different types of species from 3 islands in the Palmer Archipelago, Antarctica. The Palmer Archipelago, also known as Antarctic Archipelago, is a group of islands off the northwestern coast of the Antarctic Peninsula.

<img src="../images/lter_penguins.png" alt="Palmer Penguins" style="width:442px;"/>
Artwork Credit: @allison_horst https://github.com/allisonhorst/palmerpenguins

The dataset contains the penguin bill dimensions recorded as `"bill_length_mm"` and `"bill_depth_mm"`. The bill, also known as the culmen, is the dorsal ridge atop the bill, as you can see here.

<img src="../images/culmen_depth.png" alt="Penguin Flippers" style="width:342px;"/>
Artwork Credit: @allison_horst https://github.com/allisonhorst/palmerpenguins

Here is some additional information about the fields in this dataset.

| Column | Type | Description |
| :----- | :--: | :---------- |
| `species` | categorical | The penguin's species type (Adélie, Chinstrap, or Gentoo). |
| `island` | categorical | The island in Palmer Archipelago, Antarctica (Biscoe, Dream, or Torgersen). ||
| `bill_length_mm` | float | The length of the penguin's bill measured in millimeters. |
| `bill_depth_mm` | float | The bill depth measured in millimeters. |
| `flipper_length_mm` | float | Penguin wings are called flippers. They are flat, thin, and broad with a long, tapered shape and a blunt, rounded tip. The length of the flipper is measured in millimeters. |
| `body_mass_g` | integer | The penguin's body mass measured in grams. | 
| `sex` | categorical | The penguin's gender (male or female) |

To begin, start by importing the Palmer penguins dataset into a pandas DataFrame. Fortunately, this dataset was recently added as one of the seaborn library's built-in datasets.

Like every dataset, the Palmer penguins data could use a little cleaning! However, since the goal of this lab is to give you a chance to practice creating the new chart types from this morning's lecture, I've gone ahead and removed any rows with missing data and converted the categorical columns to the appropriate data type. 

In [3]:
# Import the Palmer penguins dataset into a pandas DataFrame
df = sns.load_dataset('penguins')

# Convert to category data type
df[['species', 'island', 'sex']] = df[['species', 'island', 'sex']].astype('category')

# Drop rows with missing values
df = df.dropna().reset_index(drop=True)

# Look at the first 5 rows of the data
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
4,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


# Style

Before you get started creating your charts, begin by selecting one of the pre-defined styles provided by matplotlib.

Matplotlib [style sheets](https://matplotlib.org/tutorials/introductory/customizing.html#using-style-sheets)

In [4]:
# Select a particular matplotlib style


# Bar Chart

During today's class, we looked at how to use pandas `value_counts()` and `plot()` methods to create a bar chart, which you will use to visualize the distribution of penguins by `"species"`. A couple of points to keep in mind:
- Remember to specify that you want pandas to create a bar chart by passing `"bar"` to the `kind` parameter
- Pass a tuple to the `figsize` parameter to create a 9 x 6 inch Figure
- Set the `plot()` method's `rot` parameter to rotate the xticks by `0`
- Don't forget to give your plot a title and label the x- and y-axis! You should also adjust their size based on the size of your plot. (During class, you learned a couple of different ways to do this; however, here you should use the methods from matplotlib's pyplot API.)

Pandas `Series.bar` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.bar.html)
Pandas `Series.plot` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.plot.html)  
Pandas `Series.value_counts` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html)  
Matplotlib `pyplot.title` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.title.html)  
Matplotlib `pyplot.xlabel` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.xlabel.html)  
Matplotlib `pyplot.ylabel` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.ylabel.html)

In [5]:
# Create a bar chart using pandas value_counts() and plot() methods


# Add a title


# Label the x- and y-axis



Great job! Now by default, pandas `value_counts()` method returns a Series containing counts of the unique values. However, sometimes getting a percentage is a better criterion than the count. Therefore, this time to answer the question, you should set `normalize=True` to find the relative frequencies of the unique values.

During class, you also learned how to chain the pandas `plot()` and `bar()` methods together to visualize your data, which is the approach that you should do here. In addition, you pass the `bar()` method the following parameters:
- Set the `color` parameter using a color from the [Tableau palette](https://matplotlib.org/gallery/color/named_colors.html)
- Set the `edgecolor` to black using `'k'` or `'black'`
- Set the `linewidth` parameter to `1.25`

Futhermore, this time let's customize your plot by using matplotlib's object-oriented API. Remember that this can be done using pandas since the `plot()` method will return a matplotlib Axes object we can modify. You should use matplotlib's object-oriented methods to do the following:
- Add a title and set the `size` parameter to an appropriate size
- Label the x- and y-axis and set the `size` parameter to an appropriate size
- Rotate the x-axis tickmarks to `0`

Matplotlib Documentation
- [Add a title](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_title.html)
- [Label the x-axis](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xlabel.html)
- [Label the y-axis](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_ylabel.html)
- [Rotate the x-axis tickmarks](https://matplotlib.org/api/_as_gen/matplotlib.axis.Axis.set_tick_params.html)

**Question:** Which island has the largest number of penguins?  
**Answer:** 

In [6]:
# Create a bar chart using pandas and save in a variable named "ax"


# Add a title


# Label the x- and y-axis



# Rotate the xticks


# Stacked Bar Plot

Stacked bar plots share a lot of the strengths and weaknesses of univariate bar charts. However, they tend to work best for nominal categorical or small ordinal categorical variables.

To create a stacked bar chart, here you should use pandas `groupby()` method that we covered in today's class to create a stacked bar chart that shows the number of penguins on each `"island"`, split into males and females

Pandas `DataFrame.groupby` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)  
Pandas `GroupBy.size` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.size.html)  
Pandas `Series.unstack` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unstack.html)

In [7]:
# Groupby the "island" and "sex" columns to create a stacked bar chart



# Histogram

Bar plots are helpful when we are working with categorical data, such as when we want to know more about the species and islands where the penguins live. However, when we want to know the distribution of a continuous numerical feature, such as the length and depth of the penguins' flippers, we need to use a histogram.

Python offers several different options for plotting and customizing histograms, which you will get to practice now. First, begin by using matplotlib's `plt.hist()` function to visualize the distribution of the `"flipper_length_mm"` column.

You should also customize your plot by setting the following parameters:
- Increase the number of `bins` to 15
- Use the `edgecolor` to outline the bars in black
- Add grid lines to the y-axis of your plot using `plt.grid()`

Matplotlib `pyplot.grid` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.grid.html)  
Matplotlib `pyplot.hist()`[documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html)

In [8]:
# Plot the distribution of the "flipper_length_mm" feature


# Add a title


# Add grid lines


Now, let's recreate the previous histogram using matplotlib's object-oriented API. However, this time instead of using the `edgecolor` parameter to distinguish between the bars, you should try adjusting the `rwidth` parameter. In addition, this time you should customize your chart by setting the `color` parameter to one of the many colors available in matplotlib.

Matplotlib `Axes.grid` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.grid.html)  
Matplotlib `Axes.hist` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.hist.html)  
Matplotlib [Color Demo](https://matplotlib.org/tutorials/colors/colors.html)

In [9]:
# Instantiate a Figure and a single Axes


# Plot the distribution of the "flipper_length_mm" feature


# Add a title


# Add grid lines


Finally, use pandas `DataFrame.hist()` function to plot a histogram for each numerical feature in the dataset. In addition:

- Pass the `figsize` parameter to `hist()`, creating a 6x6 inch Figure
- Adjust the spacing between the bar using `rwidth`
- Adjust the spacing between your subplots by adding `plt.tight_layout()`

Matplotlib `pyplot.tight_layout` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.tight_layout.html)  
Pandas `DataFrame.hist` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html)

In [10]:
# Plot a histogram for each numerical feature


# Adjust the layout


# Box Plots

Next, you will try to use pandas to create a box plot for each numerical column in your dataset. In addition, you should adjust the size of the Figure by setting `figsize=(8,4)`.

Pandas `DataFrame.plot.box` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.box.html#pandas.DataFrame.plot.box)  

In [11]:
# Create a box plot for each numerical feature


Drats! That wasn't very helpful because of the large differences in the scale of the data. Instead, this time create a box plot just for the `"flipper_length_mm"` column. You should also orient the plot horizontally by setting the `vert` parameter.

Pandas `Series.plot.box()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.box.html#pandas.Series.plot.box)  

In [12]:
# Create a box plot for a single numerical feature


# Scatter Plot

Scatter plots can be a useful way of examining the relationship between two one-dimensional data series. Here, you will start by creating a scatter plot to visualize the relationship between the `"bill_length_mm"` and `"bill_depth_mm"` columns.  You should use matplotlib's `scatter()` function to plot the `"bill_length_mm"` column along the x-axis and the `"bill_depth_mm"` column along the y-axis.

Matplotlib `Axes.scatter` [documentation](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.scatter.html)

In [13]:
# Instantiate a Figure and a single Axes

# Create a scatter plot


Now, let's create a scatter plot using using pandas `plot()` method by passing the column names to `x` and `y` and setting the `kind` parameter to `"scatter"` in order to see if there is a relationship between the `"flipper_length_mm"` and `"body_mass_g"` features.

Pandas `DataFrame.plot.scatter` [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html)

In [14]:
# Create a scatter plot
