# BUDS Report 04: Tables, Part 2 

### Table of Contents

1. <a href='#section 1'>Background Information</a>
2. <a href='#section 2'>Comparing Counties</a>
3. <a href='#section 3'>Looking at the Range</a>

In [None]:
# run this cell
from datascience import *
import numpy as np
import math
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline

## 1. Background Information <a id='section 1'></a>

At this point, you should have discussed your knowledge about pollution with your team and looked at the CalEnviroScreen website. Even with extensive previous knowledge about the environment, there may have been a lot of details that you might not have considered. Even more extensive details can be found [here](https://oehha.ca.gov/media/downloads/calenviroscreen/report/calenviroscreen40reportf2021.pdf) at the official CalEnviroScreen report.

Load the data below to see how our table contains the characteristics that the CES report indicated and what additional information is provided.

In [None]:
ces_data = Table.read_table("ces_data_v2.csv")
ces_data

There are a _lot_ of functions that you could use on tables. In UC Berkeley's own introductory data science course, many students use the [Python Reference Sheet](https://www.data8.org/su23/reference/). You don't need to understand everything on the sheet, but take a quick look and notice what kind of information each column contains. Then try to answer the following question by using the [Python Reference Sheet](https://www.data8.org/su23/reference/). You can keep it open for the rest of the report as it's hard to remember all the functions.

<div class="alert alert-warning">
<b>PRACTICE:</b> In the cell below, write code that evaluates to the number of rows and number of columns in the above table. Then, explain what each row in the table represents and the categories that the columns seem to fall under.
    </div>

In [None]:
ces_num_rows = ...
ces_num_columns = ...

print("There are ", ces_num_rows, " rows and ", ces_num_columns, " columns in this table.")

_Written Answer:_

## 2. Comparing Counties <a id='section 2'></a>

Now that we have an idea of how much information we have in our table, let's try to look at some relationships in the data. Many of the table manipulations you have learned may come in handy here.

If you are having trouble figuring out which function to use, consider what you want your table to look like at the end. How many columns/rows should you have, and what should they look like? Then, compare that with what your table currently looks like and think of the steps you would take to get to the end result.

<div class="alert alert-warning">
<b>PRACTICE:</b> Create a table with only the CES score, the name of county, the asthma rating, and three features that you think might be related to asthma rates. Call this table `asthma_data` and be sure that you do not pick latitude or longitude. Feel free to relabel the column names but keep the new names in mind as you continue.
    </div>

In [None]:
asthma_data = ...
asthma_data

In order to use our data, we learned that we need to *clean* our data. Sometimes there are values that could not be collected, so their values in the table are a little weird. These are denoted with values that show up as `nan`. You don't need to know what that means, but we do need to get rid of these values in our data.

Set `characteristics` to a new array containing the string "CES3.0.Data", the string "Asthma", and the labels of all your chosen characteristics. Then run the following cell to clean some of your data. You **don't** need to know what is happening in this cell, but feel free to ask your facilitators what is happening conceptually.

In [None]:
characteristics = ...

for characteristic in characteristics:
    asthma_data = asthma_data.where(characteristic, are.above_or_equal_to(0))
    
asthma_data

In this notebook cleaning our data consisted of getting rid of tracts that had missing values. Can you see a potential issue with this method? Can you think of other ways we might alleviate this issue?

_Written Answer:_

<div class="alert alert-warning">
<b>PRACTICE:</b> Next, create a table that shows the counties of the ten cities with the <i>highest</i> CES scores. Recall that counties can repeat in this table and that tables will automatically show the first ten entries.
    </div>

In [None]:
highest_ces = ...
highest_ces

<div class="alert alert-warning">
    <b>PRACTICE:</b> Now, make a table that looks at the counties of the cities with the <i>highest</i> asthma values.
    </div>

In [None]:
highest_asthma = ...
highest_asthma

Do the counties shown in the last two tables look similar? Why might they be different? 

_Hint:_ Think about what CES ratings are composed of.

_Written Answer:_

Although we didn't see a clear relationship between the CES score and asthma rates, let's check the relationship between asthma rates and the characteristics (columns) you chose.

<div class="alert alert-warning">
<b>PRACTICE:</b> Find the ten tracts most <i>negatively</i> affected by the first characteristic you chose. Consider and explain what values would be a good/bad amount for this characteristic and use that to find those ten tracts. For example, high percentages of potable water are a good thing, but high percentages of contaminated water are a bad thing. Repeat the same process for the remaining two characteristics you chose.
    </div>
    
To standardize your table names, you can name it something like `highest_ozone` or `lowest_ozone` depending on what you're calculating.

In [None]:
# put your first characteristic here
...

In [None]:
# put your second characteristic here
...

In [None]:
# put your third characteristic here
...

The first ten counties in your table(s) might look similar to those in the `highest_asthma` table. They also might not.

If the counties are similar in any table to the `highest_asthma` table, explain why you think that might be. For all other tables, describe anything interesting you see.

_Written Answer:_

## 3. Looking at the Range of Values <a id='section 3'></a>

It's important to know how the values in each column are measured and interpreted. We might assume that we're looking at counts or percentages, but sometimes data is collected in an unconventional way or in a way that we hadn't thought about.

Consult the [CalEnviroScreen report](https://oehha.ca.gov/media/downloads/calenviroscreen/report/calenviroscreen40reportf2021.pdf) to learn a little more about asthma and the three features you chose. Then, answer the following question about them.

 <div class="alert alert-warning">
<b>PRACTICE:</b> Find out the range (the difference between the largest number and smallest number) of values in the "Asthma" column. To do so, use the tables you created in the previous section. Feel free to also create new tables that are similar to them. Once you see the range of numbers, write down how you think "Asthma" might have been measured. Then, check the CalEnviroScreen report to check whether your guess was correct, and list how the measurement differed. Then, repeat this process for the features you chose.
    </div>

In [None]:
# find the largest asthma value
asthma_largest = ...

# create a new table of the counties with the lowest asthma values
# then find the smallest value and the difference
lowest_asthma = ...
asthma_smallest = ...

asthma_diff = ...

print("The largest value in the Asthma column is ", asthma_largest, ".")
print("The smallest value in the Asthma column is ", asthma_smallest, ".")
print("The range in the Asthma column is ", asthma_diff, ".")

_Written Answer:_

In [None]:
# first characteristic
# do a similar process as you did in the cell before
# feel free to copy and paste code but be sure to change any names that would differ
...

_Written Answer:_

In [None]:
# second characteristic
# do a similar process as you did in the cell before
# feel free to copy and paste code but be sure to change any names that would differ
...

_Written Answer:_

In [None]:
# third characteristic
# do a similar process as you did in the cell before
# feel free to copy and paste code but be sure to change any names that would differ
...

_Written Answer:_

That process might have been tedious, but you might have learned a lot about the values under each column. For example, do you feel like we can make comparisons between the different characteristics using these scales?

Recall how [CES scores](https://drive.google.com/file/d/1i8Jr_y_Q49Kkf2fTzcwYXh-uYUIjiHlJ/view) are calculated. What problems could arise with the CalEnviroScreen formula if some of our characteristics tend to be really small while others are really large and if some of our characteristics have a wide range while others do not?

_Written Answer:_

Now that we know how a CES score is calculated and what pitfalls might come with it, let's look at the range of CES scores for counties. Because there's so many different ways to code the same result, we'll do this section a little bit differently from what we did before. This way, you'll get exposure to a variety of methods.

To simplify this task, let's only look at Bay Area counties. Assign the variable `bay_counties` to an array of Bay Area counties. Recall that in Report 03, we encountered some column names that were labeled weirdly. Some of the values under "California.County" might also be labeled like this, so try accessing some of those values or adding whitespace to your strings.

Again, you don't need to understand what is happening in the last line; just know that it is filtering out all non-Bay Area counties.

In [None]:
bay_counties = ...

bay_data = asthma_data.where("California.County", are.contained_in(bay_counties))
bay_data

<div class="alert alert-warning">
<b>PRACTICE:</b> Now that we are only working with bay area data, let's try to find the highest CES scores for each county. Create a table that contains the name of the county and its highest CES score. Try thinking of a way that we can do this using the <code>sort</code> method.
    </div>
    
*Hint:* If you are having trouble with this question, look back at [Report 03](https://highschool.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FBUDS-SU23&urlpath=tree%2FBUDS-SU23%2FWeek-1%2F3_Tables-Pt-1.ipynb&branch=main) and its more complicated questions.

In [None]:
bay_highest_ces = ...
bay_highest_ces = bay_highest_ces.select(...)
bay_highest_ces

Once you've done that, repeat the process for the *lowest* CES scores of each county. Then, describe the discrepancies you see in CES scores. Why might we see such large (or small discrepancies)? Do you have background knowledge on these counties that might explain what you see? Discuss with your team.

In [None]:
bay_lowest_ces = ...
bay_lowest_ces = bay_lowest_ces.select(...)
bay_lowest_ces

_Written Answer:_

### Downloading as PDF

Download this notebook as a pdf by clicking <b><code>File > Download as > PDF via LaTeX (.pdf)</code></b>. Turn in the PDF into bCourses under the corresponding assignment.