# BUDS Report 07: Introduction to Visualizations

### Table of Contents

1. <a href='#section 1'>Numerical vs. Categorical Data</a>
2. <a href='#section 2'>Scatter Plots</a>
3. <a href='#section 3'>Histograms</a>
4. <a href='#section 4'>Bar Charts</a>

In [None]:
# run this cell
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## 1. Numerical vs. Categorical Data <a id='section 1'></a>

When working with data, it's important to consider whether it is numerical or categorical. This will help you create visualizations and make better analyses regarding the data.

We work with two types — numerical data and categorical data. It may sound pretty straightforward, but here are the definitions.

- **Numerical Data:** values that are from an ordered number scale where the _differences_ are meaningful

- **Categorical Data:** values are from a specific inventory that may or may not have an ordering

Run the following cell to load the `ces_data` table from the past few days. It'll be loaded into your notebook in the same way as previous notebooks but with a little more data cleaning than before. Again, the data cleaning consists of getting rid of `nan` values (datapoint attributes that couldn't be collected). You don't have to understand how the code works, but do take note of the change in the number of rows. Answer the following questions about numerical and categorical data.

In [None]:
ces_data = Table.read_table("ces_data_v2.csv")
print("The original CalEnviroScreen data had " + str(ces_data.num_rows) + " rows.")

# this does a bit of data cleaning
# don't worry about understanding these next few lines of code
for i in np.arange(ces_data.num_columns):
    if i != 3 and i != 11:
        ces_data = ces_data.where(i, are.above_or_equal_to(0))
print("The cleaned CalEnviroScreen data now has " + str(ces_data.num_rows) + " rows.")
ces_data

Does the column "California.County" contain numerical or categorical data?

_Written Answer:_

Does the column "ZIP" contain numerical or categorical data?

_Written Answer:_

Does the column "Asthma" contain numerical or categorical data?

_Written Answer:_

You can see that looking at whether or not the data is a number is not sufficient enough to determine whether the data is numerical. Many times, categorical data can be represented with numbers in the way that the column "Census.Tract" does. Although Census tracts are shown to be numbers, they represent a community or region.

## 2. Scatter Plots <a id='section 2'></a>
    
A scatter plot is a visualization that is pretty easy to recognize. Each datapoint (or row in our table) gets its own point and is placed on a graph where the _x_-axis and _y_-axis denote different numerical attributes. It requires the use of method call `tbl.scatter(x_axis_colummn, y_axis_column)`.

As denoted earlier, the CalEnviroScreen dataset has thousands of rows. If we choose to visualize all of our data, it might be hard to see the individual points. In fact, making a visualization with this many points will make it hard for this notebook to run. To prevent these issues from occurring, select a single county to work with. Recall from a previous notebook that counties made of two words (like "Los Angeles") are written normally, but counties made of one word (like "Alameda ") have a space at the end.

**Assign the string of this county to `your_county` and take note that we are now using a table called `your_data`.**

In [None]:
your_county = ...

your_data = ces_data.where("California.County", your_county)
your_data

Recall that [Report 06](https://highschool.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FBUDS-SU23&urlpath=tree%2FBUDS-SU23%2FWeek-2%2F6_Creating-Tables.ipynb&branch=main) looked at the poverty rate in California. Let's continue to look at this aspect of California tracts.

<div class="alert alert-warning">
    <b>PRACTICE:</b> In the following cell, create a scatter plot with poverty on the x-axis and pollution burden on the y-axis.
    </div>

In [None]:
...

It seems like there might be a slight upward trend, so let's consider what the pollution burden refers to. Refer back to your reference sheet and talk about why you might see this upward trend in the scatter plot. Feel free to look back at the [CalEnviroScreen reference sheet](https://drive.google.com/file/d/1i8Jr_y_Q49Kkf2fTzcwYXh-uYUIjiHlJ/view) or the [official CalEnviroScreen report](https://oehha.ca.gov/media/downloads/calenviroscreen/report/calenviroscreen40reportf2021.pdf/).

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Many other indicators are considered when the CalEnviroScreen data was collected. Let's look at whether there is an association between health indicators and poverty. In the next cell, create two separate scatter plots looking at the two health indicators included in the dataset. Place them on the y-axis.
    </div>

In [None]:
...

We see a similar trend as the previous scatter plot. Again, think about reasons we might see a correlation between these attributes. List a few that you think of. It might be helpful to consider how these characteristics were measured.

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Finally, create a scatter plot with poverty on the x-axis and the CalEnviroScreen Score on the y-axis. 
    </div>

In [None]:
...

There seems to be a stronger/clearer relationship between the CalEnviroScreen score and poverty than the other three relationships we viewed. Why might we see this? Consider what the CalEnviroScreen score is composed of.

_Written Answer:_

### Reading Documentation

Although looking at these scatter plots provide insight into the county you chose, it may be helpful to compare scatter plots of multiple different counties. Again, you can't do this with the whole dataset, but you can consider a smaller subset of counties.

Create and assign an array containing two different California counties to the variable `more_counties`. For reference, we have printed all California counties in the following cell.

**Make sure you take note of whether there is a space after the county name or not.**

In [None]:
# just run this cell; you don't need to understand the code
ces_data.group("California.County").column(0)

In [None]:
more_counties = ...

more_data = ces_data.where("California.County", are.contained_in(more_counties))
more_data

Great! Now you have a dataset with your selected counties. In order to create a scatter plot that effectively compares multiple counties, you need to add a few arguments to your call expression. You need to do this because this scatter plot is not like the previous scatter plots. 

Instead of having you copy and paste code, let's practice reading documentation! Recall that **documentation** is a description that explains the arguments that a function/method takes and what it does.

Jupyter Notebook offers a helpful tool in quickly attaining documentation of a command. Here's how you can do it:

- Click on the function/method that you are calling.
- Press **`Shift + Tab`** at the same time.
    - At this point, a little pop-up menu should show up.
- Select the caret symbol `^` to get a full screen view of the documentation.
    - Otherwise, you can select the plus sign `+` for a minimized documentation window.
    

<div class="alert alert-warning">
    <b>PRACTICE:</b> Try doing this with <code>np.arange</code> in the cell below. Be sure to click on the <code>arange</code> portion, as that is the function that we want to call. NumPy (denoted <code>np</code>) is the library that houses the function <code>arange</code>. In the following Markdown cell, explain what you see in the documentation.
    </div>

In [None]:
np.arange(0, 10, 2)

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Repeat this process with the <code>scatter</code> method. A number of phrases (like "Signature" and "Docstring") might show up. Take note of the similarities between the documentation of <code>arange</code> and the documentation of <code>scatter</code>. What do you think is being described in the documentation of <code>scatter</code>? You don't need to understand all items in the documentation, but try skimming all parts.
    </div>
    
Once you have done this, talk to your team and facilitators about your guesses.

In [None]:
more_data.scatter(...)

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Now that you have an idea of how <code>scatter</code> works and what each of its arguments are, create a scatter plot of your two counties' poverty-CalEnviroScreen score relationships. Keep "Poverty" on the <i>x</i>-axis and "CES.3.0.Score" on the <i>y</i>-axis. Try adjusting the different arguments so that your visualization is easier to interpret. For example, you might want to alter the size of the points if you have a county with many tracts.
    </div>

In [None]:
more_data.scatter(...)

Compare and contrast the two counties you just made a visualization of. Are there similarities in their trends? What differences do you see (if any stand out)?

_Written Answer:_

## 3. Histograms <a id='section 3'></a>

Another type of visualization that you may have seen before is a histogram. Because it displays the distribution of a quantitative variable, it is very useful for numerical data. In the context of a histogram, a **distribution** will show how frequently numbers from a number range appear.

Run the following cell to see the distribution of CalEnviroScreen scores. Call expressions that make histograms will always follow this format: `tbl.hist(x_axis_column, normed=False)`.

In [None]:
ces_data.hist("CES.3.0.Score", normed=False)

What do you notice about the shape of this histogram? Is it nice and even? Does it pull off into one direction?

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Let's once again take a look at either poverty or unemployment to create a visualization. (Choose which one you'd like to look at.) Make sure to use <code>ces_data</code> and the <code>normed=False</code> argument. How does this histogram's shape and spread compare to the histogram of CES scores?
    </div>

In [None]:
...

_Written Answer:_

We're able to use the entire dataset with histograms because it requires less computation to generate histograms than it does scatter plots. Still, it might be useful to look at a subset of our data again. Let's stick to using the tables you created earlier. Recall that you have one county in the `your_data` table and multiple counties in the `more_data` table.

In [None]:
# run this cell
print("The county in your_data is: ")
print(" - " + your_county)

print("The counties in more_data are: ")
for county in more_counties:
    print(" - " + county)

<div class="alert alert-warning">
    <b>PRACTICE:</b> Create the same histogram you made in the last "PRACTICE" question. This time, make it with <code>your_data</code> and see how its distribution compares to the rest of California. Note the differences and similarities. Draw on what you know about this county and explain why this might be.
    </div>

In [None]:
...

_Written Answer:_

<div class="alert alert-warning">
    <b>PRACTICE:</b> Try making a histogram that displays multiple different counties. Keep the same column on the x-axis as the previous two histograms and read the documentation of <code>hist</code> in the same way you read the documentation of <code>scatter</code>. Again, try adjusting the different arguments so that your visualization is easier to interpret.
    
Please be sure to keep the <code>normed=False</code> argument!
    </div>

As usual, discuss any arguments that are confusing or not quite intuitive with your team and your facilitators.

Also, ignore any red warning signs if your visualization successfully appears. The message will not break your computer or cause any issues; it's simply indicating that something is out-of-date. Don't worry about this.

In [None]:
more_data.hist(...)

<div class="alert alert-warning">
    <b>PRACTICE:</b> You might find that the shapes of these distributions are drastically different. Let's try a <i>new</i> subset of counties. Choose two <i>new</i> counties that you think might have similar population sizes and characteristics.

If your above distributions are already pretty similar, choose two counties that you think are pretty different.

    
Then, make the same histogram as above using these <i>new</i> counties.
    </div>

In [None]:
diff_counties = ...

diff_data = ces_data.where("California.County", are.contained_in(diff_counties))
diff_data

In [None]:
...

Do you find that the histograms with similar counties are easier to interpret than the histograms with different counties? Do you see any issues with this visualization? For example, what might be an issue when comparing counties with very different population sizes?

_Written Answer:_

Not all of the histograms look as nice as the CalEnviroScreen score histogram because some of our data is distributed differently. For example, take a look at the pesticides distribution. Even if you make changes to its arguments, the visualization is hard to use.

In [None]:
diff_data.hist("Pesticides", bins=30, unit="total pounds per square mile", group="California.County", normed=False)

## 4. Bar Charts <a id='section 4'></a>

It's a little hard to work with the current `ces_data` table because there are so many census tracts. Let's try looking at unique counties instead of unique Census tracts. To do so, we need to think about how we want to represent our information. For example, is it better to take the *sum* of the SF county Census tracts' populations or the *average* of the SF county Census tracts' populations? Should we take the *sum* of unemployment percentages or the *average* of unemployment percentages?

<div class="alert alert-warning">
    <b>PRACTICE:</b> Assign <code>sum_attributes</code> to a new array containing "California.County" and the name of the column we would want to take sums of. You don't need to know how the following line of code works, but you can read it and see if you have an idea of what happens in that line.
    </div>

In [None]:
sum_attributes = ...

county_sums = ces_data.select(sum_attributes).group("California.County", sum)
county_sums

<div class="alert alert-warning">
    <b>PRACTICE:</b> Now assign <code>avg_attributes</code> to an array with "California.County", "CES.3.0.Score", and three other columns that we would want to take averages of. Again, you don't need to know how the following line works.
    </div>

In [None]:
avg_attributes = ...

county_avgs = ces_data.select(avg_attributes).group("California.County", np.average)
county_avgs

Notice what happens after we made tables with only one row per unique county. Both tables have only 56 rows (or 56 counties) and now have different column names.

Since visualizing 56 different counties might get a little messy, we'll only work with bay area counties for the remainder of the notebook. You don't need to understand the next code cell, but it might help you understand what has happened.

In the next cell, we join the `county_sums` table with the `county_avgs` table so that all the data can be accessed in one table. Then, we make an array with only Bay Area counties and keep the rows where "California.County" is a Bay Area county.

In [None]:
county_statistics = county_sums.join("California.County", county_avgs)

bay_counties = make_array("Alameda ", "Contra Costa", "Marin ", "Napa ", "San Mateo", "Santa Clara", "Solano ", "Sonoma ", "San Francisco")
bay_county_statistics = county_statistics.where("California.County", are.contained_in(bay_counties))

bay_county_statistics

Now that you've set up a table with the necessary information, you can go ahead and make a bar chart. Calls to create bar charts follow the following format: `tbl.barh(category_column, count_column)`.

We use `barh` instead of `bar` because horizontal bar charts tend to present better on Jupyter Notebooks.

<div class="alert alert-warning">
    <b>PRACTICE:</b> Create a (horizontal) bar chart with the bay area counties as its categories and the column you took sums of as its counts.
    </div>

In [None]:
...

Let's sort the bars in order so that we have a better idea of the relative differences between different counties. Notice that the counties in the above bar graph are sorted in the same order as the column "California.County" in the `bay_county_statistics` table. Try to think of how we could get our bars to appear in order of lowest total population to highest.

In [None]:
...

### Time Permitting (Optional)

Finally, create more bar charts for the columns you took averges of. Feel free to read the documentation of `barh` in the same way you read the documentation of `scatter` and `hist`. This might make your visualizations much cleaner.

In [None]:
...

In [None]:
...

In [None]:
...

### Downloading as PDF

Congratulations on finishing this lengthy report!

Download this notebook as a pdf by clicking <b><code>File > Download as > PDF via LaTeX (.pdf)</code></b>. Turn in the PDF into bCourses under the corresponding assignment.