## Module 3 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class.

Before you attempt any of these activities, make sure to watch the video lectures for this module.

### Joining tables
Let's look at the spatial distribution of vaccine hesitancy, early in the pandemic.

The CDC has a dataset at the county level, [available via Socrata](https://data.cdc.gov/Vaccinations/Vaccine-Hesitancy-for-COVID-19-County-and-local-es/q9mh-h2tw).


<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Import the vaccine hesitancy dataset into a dataframe. Call it <strong>hesitancy</strong>.
</div>

*Hint*: Use the same approach as for the Seattle permits (class 2) or the Los Angeles housing (lecture 5). Just use a different URL.

*Hint*: Add the `limit` keyword at the end of the URL to get more than 1000 rows. [See the example here](https://github.com/socrata/discuss/issues/145). There are 3,142 rows, according to the dataset's webpage, so you will be safe if you specify a limit of (say) 5000 rows.

In [None]:
# your code here
hesitancy = 9999

In [None]:
# get the permit data from the API
import json
import requests
import pandas as pd


url = 'https://data.cdc.gov/resource/q9mh-h2tw.json?$limit=5000' # copied and pasted from the webpage
r = requests.get(url)
hesitancy = pd.DataFrame(json.loads(r.text))

Before we do any joins, let's look at some state-level summary statistics.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create a dataframe with the means of estimated_hesitant and the vaccination rate.
</div>

*Hints*:
- It might make more sense to weight each county by population, but let's not worry about that here.
- The `percent_adults_fully` columm gives the vaccination rate (as of June 2021)
- Use `groupby`!
- Before you do any operations, you might need to convert the data type of the column. I recommend creating a new column, e.g. `df['newcol'] = df.oldcol.astype(float)`

In [None]:
# your code here

In [None]:
hesitancy['hesitant'] = hesitancy.estimated_hesitant.astype(float)
hesitancy['vacc'] = hesitancy.percent_adults_fully.astype(float)
statelevel = hesitancy.groupby('state')[['hesitant','vacc']].mean()
statelevel.head()

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create a scatter plot of hesitancy vs vaccination rates.
</div>

In [None]:
# your code here

In [None]:
statelevel.plot.scatter(x='hesitant', y='vacc')

## Joining
Now let's do a join.
It looks like the county boundaries are in our original dataframe, but in a weird format. We could try and parse them. But instead, let's get the county boundaries and total population using the Census API and `pygris`. 

We'll just do one state, in order to reduce the sizes of the files for this exercise.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Get a geodataframe of the county boundaries and population totals for a state of your choice.
</div>

*Hints*: 
* Look at the [examples](https://www.census.gov/data/developers/guidance/api-user-guide.Example_API_Queries.html) provided by the Census Bureau
* You'll need to specify the state (e.g. "California") and the level (use "county"). So your API query will have some text like this: `for=county:*&in=state:06`
* A small state will download faster! You don't have to do California (FIPS 06)
* The population variable is B01001_001E
* To download the country boundaries, you'll need `pygris.counties` rather than `pygris.tracts` (which we used before)

In [None]:
import requests
import pandas as pd
import geopandas as gpd
import pygris

# add your code
# I'll use state 05, which is Arkansas
r = requests.get('https://api.census.gov/data/2022/acs/acs1?get=NAME,B01001_001E&for=county:*&in=state:05')
censusdata = r.json()
df = pd.DataFrame(censusdata[1:], columns=censusdata[0])
df.rename(columns={'B01001_001E':'pop'}, inplace=True)
df['GEOID'] = df.state + df.county

# get the boundaries
# pygris is still not working for me - these two lines
#counties = pygris.counties(state='05', year=2022)
#counties['GEOID'] = counties.STATEFP + counties.COUNTYFP

# so I downloaded the counties from here: https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html
counties = gpd.read_file('/Users/adammb/Downloads/tl_2022_us_county')
censusDf = counties.set_index('GEOID')[['geometry']].join(df.set_index(['GEOID']))

You should now have `GEOID` as the index for your `censusDf`, as well as a geometry column.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Join the covid dataframe to your census dataframe.</div>

*Hints:*
* Look at which column gives the county FIPS code in each dataframe.
* Do the data types match? Anything else you need to clean up before joining?
* It might be helpful to do a left join from the census dataframe. That means that you will automatically drop the data for counties in other states.

In [None]:
# your code here

In [None]:
hesitancy['GEOID'] = hesitancy.fips_code.astype(str).str.zfill(5)
hesitancy.set_index('GEOID', inplace=True)

joinedDf = censusDf.drop(columns='state').join(hesitancy)
joinedDf.head()

In [None]:
joinedDf.head()

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Create some county-level maps and other simple analyses.
</div>

The [geopandas documentation](https://geopandas.org/en/stable/docs/user_guide/mapping.html) has some useful tips.

Hint: make sure your column is numeric before you plot it!

In [None]:
# your code here

In [None]:
# super basic but a start
# I'm going to limit to one state for clarity. If you don't, you should probably limit to the continental US
joinedDf = joinedDf[joinedDf.state=='ARKANSAS']
joinedDf['hesitant_numeric'] = joinedDf.estimated_hesitant.astype(float)
joinedDf.plot('hesitant_numeric',  cmap='OrRd', legend=True)

<div class="alert alert-block alert-info">
<h3>What you should have learned</h3>
<ul>
  <li>Gain more practice with the APIs</li>
  <li>Understand basic data cleaning operations, such as converting strings to numeric fields.</li>
  <li>Understand how to compute group-level means and other summary statistics.</li>
    <li>Understand how to join tables on a common column.</li>
</ul>
</div>