# SW 282: Lab 5 - Plotting in Python

---

### Proessor Erin Kerrison

In this lab, you will learn about matplotlib, Python's plotting library, and how to use it to make different types of data visualizations.

---

### Table of Contents

1. [Introduction to `matplotlib`](#section-1) <br>
&nbsp;&nbsp;&nbsp; a. [Introduction](#section-1a) <br>
&nbsp;&nbsp;&nbsp; b. [Basics of matplotlib](#section-1b) <br>
&nbsp;&nbsp;&nbsp; c. [Types of Plots](#section-1c) <br>
&nbsp;&nbsp;&nbsp; d. [Customizations](#section-1d) <br>
2. [Practice](#section-2) <br>
&nbsp;&nbsp;&nbsp; a. [Line Graphs](#section-2a) <br>
&nbsp;&nbsp;&nbsp; b. [Bar Graphs](#section-2b) <br>
&nbsp;&nbsp;&nbsp; c. [Histograms](#section-2c) <br>
3. [Choosing Plots](#section-3) <br>

---

In [None]:
from datascience import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
from ipywidgets import interact, IntSlider
import ipywidgets as widgets
import warnings
warnings.simplefilter('ignore', FutureWarning)
%matplotlib inline
from lab05_plots import *

## 1. Introduction to `matplotlib` <a id="section-1"></a>

### 1a. Introduction <a id="section-1a"></a>

*What is matplotlib?*

> Here's the Wikipedia definition: "matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy."

Basically, matplotlib is a cool way we can create plots. 

A note on visualization in general: Visualization is SUPER important in data science. Sometimes, we create visualizations to examine our data before we begin analysis. Other times, we include visualizations in our final reports to convey important information in a nice manner.

With that, let's learn about matplotlib! Let's import it as well as numpy to start us off!

### 1b. Basics of matplotlib <a id="section-1b"></a>

First, let's talk about how matplotlib works in general. Here's how we do things: we work with the `plt` module (we imported matplotlib as `plt`, so that's the name we're going to use now), adding things to our plot one by one. Then, when we're done adding all of the things we want to add, we call `plt.show()` to display our plot with all of its features!

Again, here's the general framework:

```python
plt.<...> # Adding things to our plot
plt.<...> # Adding things to our plot
plt.<...> # Adding things to our plot
```

Now that we know that, let's go ahead and start doing fun stuff!

**A quick note:** Look up in the dependencies cell above, where we import matplotlib. Notice the line that says `%matplotlib inline`? That is a special command in Jupyter Notebooks that tells the notebook to display our graphs right below the cell in which they're generated. If you don't include that command _somewhere_ in your notebook, your plots won't render.

### 1c. Types of Plots <a id="section-1c"></a>

#### Simple Line Plot

Let's see how to create a line plot with matplotlib. What do you need to create a line? x and y coordinates of course! Let's make those:

In [None]:
x = [1, 2, 3]
y = [4, 5, 6]

To plot a line plot with these two arrays is easy! We need to call the plot function from plt, and pass in the x and y arrays. That's it!

In [None]:
plt.plot(x, y);

(Note the semicolon `;` in the call above. If you include this on the last line of your matplotlib call, Jupyter knows not to print out some annoying text supplied by matplotlib.)

<div class="alert alert-info">

**QUESTION:** Plot the $x = y$ line by defining your own `x` and `y` arrays and then using matplotlib!

</div>

In [None]:
...

#### Bar Charts

Wow, line plots sure were easy! Let's move on to more of a challenge: bar charts. To review, a bar chart consists of one or many bars. Each bar represents a different category of a categorical variable, and the height or y-value of each bar is the value of that category. 

For example, say we had the counts of the animals your friends owned. We could visualize this with a bar chart. In fact, let's do exactly that!

In [None]:
animals = ["Cat", "Dog", "Bird", "Zebra"]
counts = [280, 120, 60, 3]

We are going to use the `plt.bar` function. The required arguments are two things: 

1) `left`, which is a sequence of scalars, which represents the x coordinates of the left sides of the bars

2) `right`, which is also a sequence of scalars, which represents the heights of the bars

This may be a bit confusing, but here's what we're going to use for `left`: the array starting with numbers 0, 1, ..., up to the number of bars we want in our bar chart. We can accomplish with `np.arange`.

As a reminder, `np.arange(x)` creates an array with numbers 0, 1, ..., x - 1, which is exactly what we want!

In [None]:
left = np.arange(len(animals))

Right would just be the counts, which we already have. Let's go ahead and plot our bar chart!

In [None]:
plt.bar(left, counts);

Hmm... this doesn't seem to be very informative. Our array for `left` placed our bars nicely, but we don't actually want those x-values there. We want labels which are our animals! Here's how we'll accomplish that:

In [None]:
plt.bar(left, counts)
plt.xticks(left, animals);

Basically, what we did was change our x-labels from the numbers to the labels for the animals with the function `xticks`.

<div class="alert alert-info">

**QUESTION:** Create your own bar chart! This time, we want you to plot the names of fruits against their counts.

</div>

In [None]:
fruits = ["strawberry", "blueberry", "cantaloupe", "orange", "apple"]
fruit_counts = [100, 230, 3, 52, 64]

...

#### Histograms

This is the final kind of plot we're going to learn today. Histograms! To review, histograms are "a diagram consisting of rectangles whose area is proportional to the frequency of a variable and whose width is equal to the class interval."

Let's create a histogram of exam scores.

In [None]:
scores = [80, 96, 78, 100, 23, 79, 93, 95]

plt.hist(scores);

Wow, that was easy! Why don't you try?

<div class="alert alert-info">

**QUESTION:** Create a histogram for these exam scores in `other_scores`.

</div>

In [None]:
other_scores = [83, 28, 12, 34, 29, 89]

...

### 1d. Customizations <a id="section-1d"></a>

Wow, we've learned so many plots today! However you may notice... our plots are pretty basic. They have the main plot and that's it. Wouldn't it be nice to add in things like titles and x-labels and colors? Let's learn how to do that!

Basically, we just need to add on more "layers" to our plot before calling show. Here are a few things we can add:

#### `title`

We can add a title to our plot by using `plt.title([name])`

In [None]:
plt.plot(x, y)
plt.title("My Line Plot!");

#### `xlabel`, `ylabel`

We can add labels for our x and y axes with these functions.

In [None]:
plt.plot(x, y)
plt.title("My Line Plot!")
plt.xlabel("cats")
plt.ylabel("dogs");

#### Adding a splash of color!

Let's figure out how to make our line a different color than what it is now! It's also pretty easy, we just need to add an extra thing to our plot call on `plt`.

In [None]:
plt.plot(x, y, color='green') # We added color = 'green'
plt.title("My Line Plot!")
plt.xlabel("cats")
plt.ylabel("dogs");

<div class="alert alert-info">

**QUESTION:** Your turn! Go ahead and customize the barchart we created earlier. Make it look super nice!

</div>

In [None]:
# Add your customizations to the calls below!
plt.bar(left, counts)
plt.xticks(left, animals);

## 2. Practice <a id="section-2"></a>

In this section, we'll practice our `matplotlib` functions on a real-world data set. We'll also work through thinking about what types of plots are best to answer different types of questions.

In [None]:
hybrid = Table.read_table("data/hybrid.csv")
hybrid

### 2a. Line Graphs <a id="section-2a"></a>

Let's start by looking at something that is good at being represented by line graphs: time trends. In the cell below, we use our `hybrid` dataset to create a plot of how the `acceleration` changes by `year`. To start this, we will group the table `hybrid` by year and take the average of the `acceleration`.

In [None]:
by_year = hybrid.group("year", np.mean)
plt.plot(by_year.column("year"), by_year.column("acceleration mean"))
plt.title("Acceleration by Year")
plt.xlabel("Year")
plt.ylabel("Acceleration")
plt.xticks(np.arange(1998, 2013, 2));

As you can see, there is a generally positive trend in the acceleration over time. If we wanted to plot multiple values, we can stack plots on top of each other using repeated plot calls, as shown in the cell below.

In [None]:
plt.plot(by_year.column("year"), by_year.column("acceleration mean"))
plt.plot(by_year.column("year"), by_year.column("mpg mean"))
plt.title("Acceleration by Year")
plt.xlabel("Year")
plt.legend(make_array("Acceleration", "MPG"))
plt.xticks(np.arange(1998, 2013, 2));

Notice in the plot above that we have included a legend; this is accomplished using the call `plt.legend` which takes an array of labels. This array should correspond to the order in which you call your `plt` functions. Since we called `plt.plot` on `acceleration` and then `mpg`, our legend call was

```python
plt.legend(make_array("Acceleration", "MPG"))
```

The labels in your legend do not need to match the column labels.

Let's say that we didn't want to see this trend over the whole interval, but that we were only interested in values from, say, 2000 to 2008. We can cut the span of our plot by setting limits on the axes. This is done using `plt.xlim` and `plt.ylim`, which takes an array of two values: the first is the low end of the range, and the second is the upper end.

In [None]:
plt.plot(by_year.column("year"), by_year.column("acceleration mean"))
plt.plot(by_year.column("year"), by_year.column("mpg mean"))
plt.title("Acceleration by Year")
plt.xlabel("Year")
plt.xlim(make_array(2000, 2008))
plt.legend(make_array("Acceleration", "MPG"))
plt.xticks(np.arange(1998, 2013, 2));

<div class="alert alert-info">

**QUESTION:** Create a line plot of the `msrp` variable. Include axis labels and a title. Set the color of the line to `"green"`.

</div>

You will need to take the MSRP mean grouped by year. The table `by_year` has already been created for you above. Use the column `msrp mean` from that table in your plot. Look at the calls above if you get stuck.

In [None]:
...

### 2b. Bar Graphs <a id="section-2b"></a>

In this section, we'll take a quick look at generating bar graphs. We'll do a quick example using a horizontal bar graph of the number of items in each `class`. To get this data, we need to group the `hybrid` table by `class` and get the count of each unique value in that column. This is done for you in the cell below.

In [None]:
counts = hybrid.group("class")
counts

If we wanted to create a horizontal bar graph, we would do something similar to that above, but we would use the call `plt.barh` and set the `yticks` instead of `xticks`:

In [None]:
plt.barh(np.arange(counts.num_rows), counts.column("count"))
plt.yticks(np.arange(counts.num_rows), counts.column("class"))
plt.xlabel("Count")
plt.ylabel("Class");

In this example, we can think of the year as another _categorical variable_ (that is, one which takes on only a specific set of possible values). Because of this, we can generate a bar graph of the number of vehicles in our data from each year to get an idea of how our data is spread over time.

<div class="alert alert-info">

**QUESTION:** Create a horizontal bar graph of the `year` variable. Include axis labels.

</div>

To get the count of rows for each year, we have provided you with the table `year_counts`, which is structured in the same way as `counts` but for years instead of classes.

In [None]:
year_counts = hybrid.group("year")
...

### 2c. Histograms  <a id="section-2c"></a>

Histograms are good for getting an idea of the **distribution** of a variable, which describes probability that any number will be the value of that variable. Let's think about the distribution of the MSRP of the hybrid vehicles. Because histograms deal with the frequencies of certain values, the call `plt.hist` only requires one argument: the array of values that we want to see the distribution of.

In [None]:
plt.hist(hybrid.column("msrp"));

The histogram defaults to 10 _bins_ of equal size; a bin is a subset of the number line into which we will place values in the histogram. We can change the number of bins by setting the `bins` argument to an integer:

In [None]:
plt.hist(hybrid.column("msrp"), bins=20);

Now, we can see more minute fluxuations in the data that weren't visible before, e.g. the dip just after \\$40,000. We can also use fewer bins:

In [None]:
plt.hist(hybrid.column("msrp"), bins=5);

Now, we have lost some of the granularity of our data because we have too few bins.

#### On the number of bins

Although it may seem logical to want as many bins as possible, it is in fact true that having too many bins may lead you to draw conclusions from your data that aren't valid. For example, with 20 bins we see a small dip just after \\$40,000. However, this could just be a peculiarity due to random chance of our data. It is always a dilemma when drawing histograms to know how many bins to use; in the end, the best idea is to use enough that you see a good trend, but not so many that the trend has lots of hills and valleys. 

Run the cell below so that you can see how the number of bins affects the histogram of the data.

In [None]:
def msrp_hist(bins):
    plt.hist(hybrid.column("msrp"), bins=bins);
    
interact(msrp_hist, bins=IntSlider(value=10, min=1, max=50));

For the distribution of the `msrp` variable, I think that the right number of bins (that is, the one with enough but not too much detail) is 10.

<div class="alert alert-info">

**QUESTION:** Create a histogram of the `mpg` variable. Choose an appropriate number of bins and label the x-axis.

</div>

In [None]:
...

## 3. Choosing Plots <a id="section-3"></a>

In this section, we will think about which types of plots are best to answer certain questions.

<div class="alert alert-info">

**QUESTION:** For this section, each code cell will generate four plots, each with a letter label. In the Markdown cell below each plot, write down which plot is the best to answer the question at the top of the figure, and justify your reasoning.

</div>

In [None]:
part_3_plot_1()

_Type your answer here, replacing this text._

In [None]:
part_3_plot_2()

_Type your answer here, replacing this text._

In [None]:
part_3_plot_3()

_Type your answer here, replacing this text._

In [None]:
part_3_plot_4()

_Type your answer here, replacing this text._

---

## Submission

Congrats on finishing another lab notebook! To turn in this lab assignment, go to File > Download as > PDF via Chrome and upload the PDF to bCourses.

---
Notebook developed by: Chris Pyles

Data Science Modules: http://data.berkeley.edu/education/modules