In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab03.ipynb")

In [None]:
# Run this cell to load all required Python libraries 
exec(open("./utils.py").read())
%matplotlib inline

# Lab 3 – Data Visualization

## Data 6, Summer 2022

So far, we have discussed methods to interpret the data, but what if we want to present our data in a visual format? In this lab, you'll learn several important table methods for producing data visualizations. Visualizations are some of the most powerful tools in data science; they're helpful for showing data to people who don't necessarily have a background in data science, and allow data scientists like yourselves to help others understand the data in a more intuitive way.

In Lecture 8, we talked about methods we could use to visualize one variable, namely the `barh` and `hist` methods. We added the `scatter` and `plot` methods in Lecture 9. These methods allow us to visualize two or more variables at once, which can open up more patterns in the data and can further improve your ability to visualize data for people who do not necessarily understand data science.

As data scientists it is not only our job to be able to use the visualization methods we know, but it is also our job to know *when* to use which methods. As we build our toolkit of visualization techniques going forward, it's important to understand the advantages and disadvantages of each visualization type.

We will be working with the same `brfss` dataset as we did in the previous lab, so we will load that in to begin looking at the new methods. 

In [None]:
brfss = Table.read_table("data/brfss.csv")
brfss.show(5)

# The [barh](http://data8.org/datascience/_autosummary/datascience.tables.Table.barh.html#datascience.tables.Table.barh) method

The `barh` (horizontal bar chart) method is used to visualize **categorical** variable values. Categorical variables are non-numbers, like names and qualities (Color, State Names, etc.). As we saw in lecture, categorical variables come in 2 different types: *ordinal* and *nominal*. Refer to the [Lecture 8 Slides](https://docs.google.com/presentation/d/1QbR3eXN7XxxUvmPB4xOWzLsJYFUgtKc1wyLHuABH5mw/edit?usp=sharing) to see the difference between the two types.

The `barh` method takes in 1 mandatory argument, which is the **name of the column** you want on the left (vertical) axis of your `barh` plot. There are also optional arguments that have to do with plotting -- you'll see examples of those in this lab and in the homework. The remaining optional arguments in the `datascience` documentation linked above can also be used, feel free to try out some of the others on your own!

To use the `barh` method properly, we first need to select the columns we want to see in the graph. We should not call `barh` directly on a large `Table` because without specifying a column, we get a bar graph for every single instance of every single variable, which you can imagine results in a lot of bar graphs.

In [None]:
# Just run this cell to load a table with State Counts
states = Table.read_table("data/state_counts.csv")
states

In [None]:
# Since the `states` table only has two columns, we can plot it with barh
states.barh("State")

Notice that each value in the "State" column is plotted with a bar with length corresponding to its count.

<!-- BEGIN QUESTION -->

**Question 1**: Plot a horizontal bar chart that shows the counts of each category from the `"Days Smoking"` column of the `brfss` table. 

*Hint*: Use the `smoking_counts` table.

<!--
BEGIN QUESTION
name: q1
points: 0
manual: true
-->

In [None]:
smoking_counts = Table.read_table("data/smoking_counts.csv")
...

<!-- END QUESTION -->



### Multiple Columns 

We can also use `barh` to see multiple statistics at once. Let's use the `barh` method to see the average number of both *poor mental health* and *poor physical health* days. We'll be using the following columns:
1. `"Physical Health"`: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? 
2. `"Mental Health"`: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? 

Run the following cell to show an example of how to create an *overlaid bar chart* with two statistics.

In [None]:
state_averages = Table.read_table("data/state_averages.csv")
state_averages

In [None]:
# We must group first to get our desired columns, then we can call barh
state_averages.barh("State", overlay=True)

If we want different visualizations for each variable, we can set the optional `overlay` argument to `False`. The default value of `overlay` is `True`, so if you don't give it a value, you will get a plot with all the included variables at once.

In [None]:
state_averages.barh("State", overlay=False)

That way we can choose if we want to have one plot with all our information or a new plot for each piece of information!

In this case, do we prefer an overlaid plot or two separate plots? Can you think of a case where we might want to have two separate plots instead of one overlaid plot? (Hint: think about the units for both variables — are they the same or different?)

Discuss with the people around you and check in with your GSI to confirm. 

### Where `barh` fails

The `barh` method works well on categorical variables, but what if we have a **numerical** variable that we want to see the distribution in one particular state? Let's see what happens if we try to use `barh` on a numerical variable (`"Binge Drinking"`) instead of a categorical variable:

In [None]:
# Just run this cell -- don't worry about this `group` method
brfss.group("Binge Drinks").barh("Binge Drinks")

As you can see, this bar plot is not particularly helpful. There are many categories that seem to not have any corresponding bar. Yet, that isn't the case! Seeing the breakdown of `"Binge Drinks"` does not provide us with any useful information, and it is also difficult to read or understand. Instead, for numerical variables, we have another visualization method that helps us visualize a numerical variable's distribution...

# The [hist](http://data8.org/datascience/_autosummary/datascience.tables.Table.hist.html#datascience.tables.Table.hist) method

The `hist` method allows us to see the distribution of a numerical variable. Categorical variables should be visualized using `barh`, and numerical variables should be visualized using `hist`.

The `hist` method takes in 1 mandatory argument and has several optional arguments (as is the case with `barh`, there are many other optional arguments, but here are just a few of them). For this lab, we'll set **`density` to be `False`**.

| **Argument** | **Description** | **Type** | **Mandatory?** |
| -- | -- | -- | -- |
| `column` | Column name whose values you want on the x-axis of your plot | Column name (string) | Yes |
| `density` | If `True`, then the resulting plot will be displayed not on the count of a value, but on the density of that value in the Table | boolean | No |
| `group` | Similar to the Table method `group`, groups rows by this label before plotting | Column name (string) | No |
| `overlay` | When `False`, make a new plot for each eligible statistic in the Table | boolean | No |
| `bins` | A NumPy array of bin boundaries you want your histogram to gather data into | array | No |
| `unit` |  A name for the units of the plotted column | Column name (string) | No |

**Again, in all cases, `density` should be set to `False`**

Keep in mind the same plotting optional arguments mentioned in the `barh` introduction.

Let's take a look at the distribution of exercise sessions in different states to see how the `hist` method helps visualize numerical variables, first starting with our favorite state, California. We'll use the `sleep_no_negatives` table to exclude missing values (-1's).

In [None]:
sleep_no_negatives = brfss.where("Sleep Time", are.not_equal_to(-1))

In [None]:
# This plot shows the distribution of sleep time for Californians
my_bins = np.arange(0, 25, 1)
california = sleep_no_negatives.where("State", "California")
california.hist("Sleep Time", density = False, bins=my_bins)

This shows us that people living in California usually tend to sleep between 7 to 8 hours a night, but there are many people who sleep more hours (10+) or few hours (less than 5). Let's see how that compares to sleep time in another state, Illinois:

<!-- BEGIN QUESTION -->

**Question 2:** Fill in the following **code cell** to produce a histogram representing the ***distribution of sleep time*** for respondents from the state of Illinois.

*Note*: Set the optional `bins` argument of the `hist` method to `my_bins`. We've provided this variable for you.

<!--
BEGIN QUESTION
name: q2
points: 0
manual: true
-->

In [None]:
# This plot shows the distribution of sleep time for Illinois residents
my_bins = np.arange(0, 25, 1)
il = ...
...

<!-- END QUESTION -->



### California vs. Illinois

We can use `hist` on a `Table` with just rows for these two states and use the optional `group` argument.

*Note*: You'll see how `are.contained_in` works with the `where` method next week. For now, think of it as finding any rows corresponding to *either* `"California"` or `"Illinois"`.

In [None]:
# Just run this cell to load the `il_ca` table
il_ca = sleep_no_negatives.where("State", are.contained_in(["California", "Illinois"]))
il_ca.show(5)

<!-- BEGIN QUESTION -->

**Question 3:** Now that we've created our `il_ca` table, fill in the following **code cell** to produce a histogram representing the ***distribution of sleep time*** for *both* California and Illinois. You'll first need to `select` the necessary columns from `il_ca` then fill in the appropriate call to the `hist` method.

*Hint*: Take a look at the optional `group` argument from the description above.

*Note*: Set the optional `bins` argument of the `hist` method to `my_bins`. We've provided this variable for you.

<!--
BEGIN QUESTION
name: q3
points: 0
manual: true
-->

In [None]:
# This plot shows the distribution of sleep time for people from California AND New York
my_bins = np.arange(0, 25, 1)
...

<!-- END QUESTION -->



It appears that sleep time in California is a very similar, on average, to the sleep time in California. The plot above shows the New York `Sleep Time` to be almost exactly on top of the Illinois `Sleep Time`. Let's see if we can use a table query to figure out the same information:

In [None]:
print(f"California average:\t{np.mean(california.column('Sleep Time'))}")
print(f"Illinois average:\t{np.mean(il.column('Sleep Time'))}")

As we can see, the plot we made appeared to suggest that average amount of sleep should be very similar between California and Illinois, and the table operations reflected that! This is a benefit of visualization, that information can be learned about the dataset with just visual observation. It is always beneficial to back your claims about data with concrete facts about the dataset, but visualizations can help abstract away some of the confusion of looking at raw data so that non-data-scientists can better understand what is going on.

<!-- BEGIN QUESTION -->

**Question 4 (*Discussion*):** Now, think about what would happen if you chose two states with **very different counts**, why would it be more difficult to compare them with histograms? 

Once you've discussed with someone around you or a GSI, proceed with the code cells below to confirm your answers. We'll look to compare **Texas** and **Delaware**.

<!--
BEGIN QUESTION
name: q4
points: 0
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

In [None]:
# Just run this cell
texas = sleep_no_negatives.where("State", "Texas")
delaware = sleep_no_negatives.where("State", "Delaware")
print(f"Texans in `cleaned_exercise_mental_health` dataset: {texas.num_rows}")
print(f"Delawareans in `cleaned_exercise_mental_health` dataset: {delaware.num_rows}")

Each individual plot looks fine:

In [None]:
# This plot shows the distribution of sleep times for Texas respondents
my_bins = np.arange(0, 25, 1)
texas.hist("Sleep Time", density = False, bins=my_bins)

In [None]:
# This plot shows the distribution of sleep times for Delaware respondents
my_bins = np.arange(0, 25, 1)
delaware.hist("Sleep Time", density = False, bins=my_bins)

Take a look at the y-axis on both of these plots. What do you think will happen when we try to plot them on the same graph?

In [None]:
# Just run this cell
texas_delaware = sleep_no_negatives.where("State", are.contained_in(["Texas", "Delaware"]))
texas_delaware.show(5)

<!-- BEGIN QUESTION -->

**Question 5:** Using the code in **Question 3** as reference, produce a histogram showing the distribution of sleep time for respondents from *Delaware* and *Texas*. What do you notice about this plot?

<!--
BEGIN QUESTION
name: q5
points: 0
manual: true
-->

In [None]:
# This plot shows the distribution of the number of times people from Delaware and Texas exercised in a month
my_bins = np.arange(0, 25, 1)
...

<!-- END QUESTION -->



As you can see, there is so much more Texas data than Delaware data that we can hardly make comparisons between the two. Trying to figure out information from this plot is very difficult, so we would either have to use another type of visualization or change the perspective of this plot to be able to learn from it.

# The [scatter](http://data8.org/datascience/_autosummary/datascience.tables.Table.scatter.html#datascience.tables.Table.scatter) method

As we mentioned, visualizing two variables can show us patterns in the data that can help us learn new information. The `scatter` method allows us to see the relationship between two numerical variables in our data using a **scatter plot**. The first provided column name goes along the x-axis and the second goes along the y-axis.

Let's take a look at the relationship between **Physical Health** and **Alcohol Consumption**. For reference, here are the following questions from the original BRFSS Survey that correspond to our `"Physical Health"` and `"Binge Drinks"` columns.

> **Physical Health:** Now thinking about your physical health, which includes physical illness and injury, for how many days during
the past 30 days was your physical health not good? 


>**Binge Drinks**: Considering all types of alcoholic beverages, how many times during the past 30 days did you have 5 or more drinks for men or 4 or more drinks for women on an occasion?

### Housekeeping

**Question 6:** As was the case with our previous visualizations lab, we know that the missing numerical values are encoded as `-1`s. Create a new table called `scatter_cleaned` which contains every row from the original `brfss` table that *does not* contain a `-1` in either the `"Physical Health"` column or the `"Binge Drinks"` column.

*Hint*: If you're having trouble with the code, feel free to reference the `barh` section of this lab.

<!--
BEGIN QUESTION
name: q6
points: 0
-->

In [None]:
scatter_cleaned = ...
scatter_cleaned

In [None]:
grader.check("q6")

### Producing Scatter Plots

Now, we can call `scatter` on the `scatter_cleaned` table. Run the following cell to do so.

In [None]:
scatter_cleaned.scatter("Binge Drinks", "Physical Health")

Just like that, you've produced your first scatter plot! It looks a little messy, however. Oftentimes scatter plots can suffer from what's known as **[overplotting](https://www.displayr.com/what-is-overplotting/)**: when many data points fall on top of each other, creating a blob of data. When data is *overplot*, it's often difficult to see the individual data points on the scatter plot.

To fix this, we attempt to focus in on a smaller subset of the data. In this case, we'll look at points in which `"Binge Drinks"` falls between 0 and 30 days and the `"Physical Health"` column falls between 0 and 30 days.

In [None]:
# Create a smaller subset of data
scatter_reduced = scatter_cleaned.where("Binge Drinks", 
                                are.below(30)).where("Physical Health", are.below(30))
scatter_reduced

<!-- BEGIN QUESTION -->

**Question 7:** Using the `scatter_reduced` table, produce a scatterplot that plots "`Binge Drinks"` on the x-axis and `"Exercise Sessions (Past Month)"` on the y-axis. The code should be very similar to the previous scatter plot.

<!--
BEGIN QUESTION
name: q7
points: 0
manual: true
-->

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

That looks a little better! There is still a cluster of data points in the bottom left corner, but a clear relationship can be seen between the two variables.

**Question 8 (*Discussion*):** What relationship between binge drinking and number of days with poor physical health does the above scatter plot reveal? Discuss with someone around you and check in with your GSI once you've agreed on an answer.

<!--
BEGIN QUESTION
name: q8
points: 0
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### The `group` and `labels` Optional Arguments

The `scatter` method also allows you to specify specific groups or labels for each data point using the `group` or `labels` keyword arguments. 

Say we wanted to investigate the relationship between an individual's **number of children** and their **mental health**. The corresponding questions from the original BRFSS survey were as follows:
>  **Mental Health**: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?

> **Children**: How many children less than 18 years of age live in your household?

**Question 9**: In order to take advantage of the optional arguments, let's first load an additional table from the `"states_scatter.csv"` file. We'll provide the code for this.

Then, using the `states` table, produce a scatter plot that plots the average children against the average number of poor mental health days. 

In [None]:
states = Table.read_table("data/states_scatter.csv")
states

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: q9
points: 0
manual: true
-->

In [None]:
...

<!-- END QUESTION -->



This plot looks good, but it is difficult to see which points correspond to which states. To give each data point it's city name, we can use the `group` or `label` arguments:

In [None]:
states.scatter("Children mean", "Mental Health mean", labels="State")

In [None]:
states.scatter("Children mean", "Mental Health mean", group="State")

As you can see, one of these plots is easier to read than the other, so we were better off using the `group` argument in this case. Moreover, since there are so many states, some of the colors get reused, making it difficult to inerpret the *second* scatter plot. However, in practice, it may be useful to use `labels`, not `group`, so think about when it may be useful to use each argument.

Scatter plots are useful when visualizing two numerical variables together. If you want to plot two numerical variables but one of those variables corresponds to time, we can use a line plot to visualize the non-time variable as time passes.

# The [plot](http://data8.org/datascience/_autosummary/datascience.tables.Table.plot.html#datascience.tables.Table.plot) method

Similar to `scatter`, we give plot the names of two numerical columns and it creates a **line plot** for us. If we want to draw multiple line plots on the same set of axes, we give it a table with multiple numerical columns, and tell it which one contains the values for the x-axis.

The `plot` method allows us to see how non-time variables change over time. Let's use `plot` to look at the exercise patterns over the course of the year. First, we will look at a single line plot using `plot`:

In [None]:
# Just run this cell to load a new table
months = Table.read_table("data/months.csv")
months

<!-- BEGIN QUESTION -->

**Question 10**: Using the `months` table and the `plot` method, produce a *line plot* that plots the average sleep time over time

*Hint*: You'll want to plot the month on the x-is and average exercise sessions on the y-axis.

<!--
BEGIN QUESTION
name: q10
points: 0
manual: true
-->

In [None]:
...

<!-- END QUESTION -->



 ### Identifying Temporal Patterns
 
 Line plots are incredibly effective tools for identifying temporal patterns (i.e. changes over time). Let's utilize our newfound knowledge of the `plot` method to uncover underlying temporal patterns within our BRFSS data. Run the following cells and answer the question that follows.

In [None]:
# Run this cell -- you should understand how this code works
vermont = sleep_no_negatives.where("State", "Vermont")
florida = sleep_no_negatives.where("State", "Florida")

In [None]:
# Run this cell to produce a line plot for Vermont
vt_grouped = vermont.group("Month", np.average)
vt_grouped.plot("Month", "Sleep Time average")

In [None]:
# Run this cell to produce a line plot for Florida
fl_grouped = florida.group("Month", np.average)
fl_grouped.plot("Month", "Sleep Time average")

### Multiple Variables

<!-- BEGIN QUESTION -->

If we want to see multiple variables on one plot, we can include them in the table we call `plot` on. 

**Question 11**: For both the `vermont_averages.csv` and `florida_averages.csv` files, read the file into two new tables, `vt_health` and `fl_health`, respectively. Then, for each table, select the following columns:
>1. Month
2. Physical Health average
3. Mental Health average

Finally, produce a scatter plot with *one line per variable* that is not `"Month"`. That is, `"Month"` is what should be plotted on the x-axis.

<!--
BEGIN QUESTION
name: q11
points: 0
manual: true
-->

In [None]:
vt_health = ...
vt_health

<!-- END QUESTION -->

In [None]:
# Create the line plot
...

In [None]:
fl_health = ...
fl_health

In [None]:
# Create the line plot
...

<!-- BEGIN QUESTION -->

**Question 12 (*Discussion*)**: What insights can you draw for each state about how **mental and physical** health change over the course of the year?

*Note*: Remember that a *higher* value for both `"Mental Health"` and `"Physical Health"` corresponding to a *larger* number of days where the individual considered their mental or physical health to be *poor*.

<!--
BEGIN QUESTION
name: q12
points: 0
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



### Choose a State

We've just looked at two states, but there are many more to investigate. Run the following cell to experiment with other states.

In [None]:
def plot_state(state):
    state_tbl = brfss.where("General Health", are.not_equal_to(-1)).where("State", \
                    state).where("Physical Health", are.not_equal_to(-1))
    grouped = state_tbl.group("Month", np.average)
    reduced = grouped.select("Month", 
                  "Physical Health average", 
                  "Mental Health average")
    reduced.plot("Month")
    plt.title(f"{state} Line Plots")
    plt.ylim(0,15)
    plt.ylabel("Number of Days")
    
state_names = ['Alabama','Alaska','Arizona','Arkansas','California', 'Colorado', 'Connecticut', 'Delaware',
 'District of Columbia', 'Florida', 'Georgia', 'Guam', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa',
 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York',
 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island',
 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
 'West Virginia', 'Wisconsin', 'Wyoming']

In [None]:
interact_manual(plot_state, state=state_names);

## Done! 😇

That's it! There's nowhere for you to submit this, as labs are not assignments. However, please ask any questions you have with this notebook in lab or on Ed.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)