
# How The Web Works... and an Introduction to APIs
<br>

<img src="https://cdn-images-1.medium.com/max/1920/1*CWytxLBZtxrxekPofi0-RQ.png" width=500>
<br>

Before we take take a look at APIs, let's take a step back and learn about how the web works. Understanding some of the fundamentals about how information travels around the internet will help us a ton. And, the best place to start is by learning about HTTP, the Hypertext Transfer Protocol.

## HTTP

A simple protocol called [HTTP](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) powers most of the communications on the web, including your browser and probably most of the apps that you use. HTTP allows you (via your browser, a mobile app or even code you write!) to **request** data (HTML or a web page, PDFs, MP3s, JSON, CSVs, etc.) from a service across the internet (e.g. google.com, twitter.com) and that service will respond with the requested data (i.e. the **response**). 

Let's take a look at how HTTP in more detail by looking at Mike's slides on "How the Internet Works." You can find the Keynote (for Mac's) or PDF copy of the slides in the docs directory of our github repository.

### A Quick Overview of HTTP: Request and Response

Here is the tl;dr on how a HTTP request works:

1. A "**client**" (your browser, your Instagram app, or even some code that you are about to write) makes a **request** for data.

2. The "**request**" is in the form of a [URL](https://en.wikipedia.org/wiki/URL) (Uniform Resource Locator -- a web address). The URL specifies the site you are requesting information from and the page/document/data you want. For example: https://nytimes.com/ is the URL for the New York Times and this URL https://www.nytimes.com/interactive/2019/12/19/opinion/location-tracking-cell-phone.html specifies a given news story in the form of an HTML page.

3. A "**server**" receives the request (e.g. a New York Times server, if you are requesting a nytimes.com URL) and then returns the page/document/data you asked for. This is the **response**. This response can be an HTML page, a PDF, some JSON or CSV data, etc.

Simple!

One important note: this type of request is called a "**GET**" request. There are other types of HTTP requests which we'll learn about later. (The main difference being in how you specify the data you want -- GET specifies the data you want in the URL of the request as we'll see below.) 

### Anatomy of a URL

The [URL](https://en.wikipedia.org/wiki/URL), or Uniform Resource Locator, or "web address," contains a variety of important information about data that we are requesting. Here are the various fields in a URL:

<img src="https://camo.githubusercontent.com/43bd353c3d0879547481da33bba7d15768bdf4bb/68747470733a2f2f7261772e6769746875622e636f6d2f41544c2d5744492d437572726963756c756d2f686f772d7468652d696e7465726e65742d776f726b732f6d61737465722f696d616765732f616e61746f6d792d75726c2e706e67" width=500>
    
For now, we're just going to focus on the protocol, domain and path. The parameters are very important but we'll come back to that in a future lesson.

### What Kind of Data is on the Other End of a Request?

The data you find in a web page (HTML) or PDF document is meant to be read as you would read the page of a book. But in this class, we'll learn that that kind of reading is labor-intensive. We want a computer to read for us instead -- to take in the data and create something new. This means we want other formats, which lead us to CSV, JSON and XML.


[State of the Union, and specifically recent TV Ratings (**HTML**)](https://en.wikipedia.org/wiki/State_of_the_Union):<br> 
[`https://en.wikipedia.org/wiki/State_of_the_Union`](https://en.wikipedia.org/wiki/State_of_the_Union)

[NOAA Daily Weather Records (**HTML**)](https://www.ncdc.noaa.gov/cdo-web/datatools/records):
<br>
[`https://www.ncdc.noaa.gov/cdo-web/datatools/records`](https://www.ncdc.noaa.gov/cdo-web/datatools/records)

[USDA School Breakfast Program Monthly Data (**PDF**)](https://catalog.data.gov/dataset/school-breakfast-program-monthly-data/resource/11d5e56a-a7ed-4fb0-a07c-3e8aac48c4cf):
<br>
[`https://fns-prod.azureedge.net/sites/default/files/pd/35sbmonthly.pdf`](https://fns-prod.azureedge.net/sites/default/files/pd/35sbmonthly.pdf)

[FDNY Monthly Response Times (**CSV**)](https://data.cityofnewyork.us/Social-Services/FDNY-Monthly-Response-Times/j34j-vqvt):
<br>
[`https://data.cityofnewyork.us/api/views/j34j-vqvt/rows.csv`](https://data.cityofnewyork.us/api/views/j34j-vqvt/rows.csv)

[FDNY Monthly Response Times (**JSON**)](https://data.cityofnewyork.us/Social-Services/FDNY-Monthly-Response-Times/j34j-vqvt):
<br>
[`https://data.cityofnewyork.us/resource/6b8a-2fci.json`](https://data.cityofnewyork.us/resource/6b8a-2fci.json)


## Enough About URLs! Let's Write Some Code

Ok, time for us to write some code to make out own HTTP requests. There are many python libraries which handle all of the fun of HTTP for us - we'll use one simply called [`requests`](http://docs.python-requests.org/en/master/).

To install the `requests` Python library, run the following cell. Recall that the `!` sign indicate that the code in the cell is to be interpreted as something other than Python commands. In this case, we are giving instructions to the UNIX shell.

In [None]:
!pip install requests

In the code below, we will make an HTTP request to `https://nytimes.com`. This is what our browser does when we type nytimes.com into our browser bar.

In [None]:
from requests import get

# Specify the location of the information you want as a string

url = 'https://nytimes.com'

# Then fetch the data (the resource) at that address using get() from
# the "requests" package

response = get(url)

So what is `response`? Remember that we can inspect the object to see what type it is?

In [None]:
print(type(response))

I'm not sure how much that helps us so let's jump over to the [`requests` library documentation](https://requests.readthedocs.io/en/master/) to see how we use this library. In particular, we can look at the [Response documentation](https://2.python-requests.org/en/master/api/#requests.Response)

In [None]:
# print out the HTTP status code
print(response.status_code)

There are a lot of possible [HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) that the nytimes.com server might return, but we're hoping for a `200` here (which means "success!").

In [None]:
# we can also print out the "headers" sent back by nytimes.com
# the information in the response headers are data sent from
# nytimes.com and it contains information about the page we just requested
# take a look - anything interesting to see in there?
print(response.headers)

Again, the "headers" contain information sent to us along side the page that we've requested (in this case, the nytimes.com homepage). The headers will have information about the page we just requested. Looking at output in the cell above, can you tell the type of python object? Hint: it rhymes with pict :-)

Now you - since `response.headers` is a dictionary, how would you find the "Content-Length" value? Content-Length is the nytimes.com server telling us how many bytes they've sent us.

In [None]:
# put your code here:




In [None]:
# best of all, we can see the page we've requested using the following code
# this is the nytimes.com homepage HTML
print(response.text)

**NOTE**: This is the same as opening the URL in Chrome and selecting `View ➡️ Developer ➡️ View Source`

### A Quick Exercise

Write some code in the box below to make an HTTP request to NPR's homepage (npr.org). After you make the request, print the homepage HTML. Ready? Go!

In [None]:
# put your code here:




**Follow-up Question**: Have a look at the HTML page. Some of you will have had experience writing or reading HTML (which hopefully makes you appreciate Markdown). You see `tags` that are used to structure the information on the page. You might see headers `<h1>` say, or paragraphs `<p>` or anchor tags (links) `<a>`. 

Now, try to find all the headlines on the page. Do they have special tags? Is there other information in the tags that indicate the content is a headline? When we surf the web making requests from Python, we have to learn to *parse* HTML pages and figure out how to pull out the information we are interested in. You might be interested in NPR headlines, for example. You might be interested in only the *Opinion* stories. So, looking at the NPR homepage HTML, what patterns do you notice?

Code which fetches a page, like the NYTimes or NPR, and parses out the headlines is an example of "web scraping."

This isn't to say that it's all difficult. There is one kind of tag that is especially easy to work with. It's the `<table>` tag. It, well, creates a table. Have a look at the [Wikipedia page about the State of the Union Address](https://en.wikipedia.org/wiki/State_of_the_Union). It has a table of TV Ratings. Just so we don't forget about Pandas, we can pull that table into a DataFrame and start to work with it. We have seen `read_csv()` for the Axios data, but now we'll try `read_html()` for a `<table>` on a web page.

In [None]:
from pandas import read_html

wiki_sotu = read_html("https://en.wikipedia.org/wiki/State_of_the_Union")
print(type(wiki_sotu))

We have a list! One element per table on the page. Use the `View ➡️ Developer ➡️ View Source` to see that there are other `<table>` tags floating around to structure the different parts of the page. How many objects in the list?

In [None]:
print(len(wiki_sotu))

Write some code to pull out the first table...

In [None]:
# Your code here



... now the second...

In [None]:
# Your code here



Notice that the headers for the table are included in the first row of the table. That's too bad. We can tell `read_html()` to use the first row of the table (index 0) for the headings of the columns. The *argument* is called `header`. (We are also going to specify that "TBD" is a missing value.)

In [None]:
# Read in the HTMl page, collecting the tables
wiki_sotu = read_html("https://en.wikipedia.org/wiki/State_of_the_Union", header=0, na_values="TBD")

# Pull out the second that has the TV ratings, call the DataFrame "ratings"
ratings = wiki_sotu[1]

# And then have a look!
ratings

While it's a little "mean"ingless, take the average of the total viewers column.

In [None]:
ratings["Viewers, millions"].mean()

When someone puts the data you want on a web page, putting it in a table is a huge advantage. It is probably the closest thing that you can get to "publishing data" on a web page. There are still issues, like figuring out which table on the page or if everything is formatted properly, but it can be pretty easy. 

There is a lot more to learn about HTTP and "web scraping" but we'll pick that up in future lessons. For now, let's move on to APIs! With an API we get data, honest to goodness data, and not some piece of a document made to look sorta like data.

## What's an API?

An API, or application programming interface, allows you to specify the data you want and returns it in a computer-friendly format like [JSON](https://www.json.org/) or [XML](https://en.wikipedia.org/wiki/XML) rather than HTML. The "interface" is a regularized way to make requests, and a consistent specification for the data you asked for. So many organizations now publish APIs for their data. From [The New York Times](https://developer.nytimes.com/) to [ProPublica](https://propublica.github.io/campaign-finance-api-docs/), to governmental organizations like the ~~[EPA](https://developer.epa.gov/category/api/)~~, to social media sites like [Twitter](https://developer.twitter.com/en/docs) and [Instagram](https://developers.facebook.com/products/instagram/) and [LinkedIn](https://developer.linkedin.com).

**The idea of an API is quite old,** and in fact APIs exist throughout the operating system in your computer. There is an API that lets different applications on your computer access printing capabilities, or communicate via your wireless hardware. These APIs, again, provide application developers with a regularized way to access services. So Word's print screen looks like the print screen from your PDF previewer or even Photoshop.

**Then in time, the services that were being advertised moved from your computer to the web.** So-called "mashups" came on the scene that let you feed data from one service into another. To put this in a vague historical perspective, if Web 1.0 meant putting your content online, then Web 2.0 was about cooperation between sites, sharing data via the internet to build new services. 

Salesforce.com led the way with its API in 2000 (I believe), recognizing that customers needed the same data across different platforms. Ebay followed, providing an API so that others could embed their data and services. Personally, it was the Google Maps' API that really drove the idea home. It appeared in 2006 and immediately spawned a number of mapping mashups. You can read about the history of APIs from [a services perspective](https://history.apievangelist.com/), [as evolution of the mashup](https://www.ibm.com/developerworks/library/x-mashups/index.html), or as [a technical innovation](http://www.openlegacy.com/blog/the-history-of-apis-and-how-they-impact-your-future), eventually leading back to a [PhD thesis in 2000 by Roy Fieldings](http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm) laying out the whole scheme. 

Today there are so many APIs it's hard to keep track. Look at the growth, captured by the "readmeblog".

<img src="https://blog.readme.io/content/images/2016/11/Screenshot-2016-11-01-16.01.29.png" width=500>

Ah, but fortunately someone is keeping track for us! Have a look at [ProgrammableWeb](https://www.programmableweb.com/) for all the latest APIs. They have a [report on API growth](https://www.programmableweb.com/news/programmableweb-api-directory-eclipses-17000-api-economy-continues-surge/research/2017/03/13) that includes a table!

In [None]:
api_tables = read_html("https://www.programmableweb.com/news/programmableweb-api-directory-eclipses-17000-api-economy-continues-surge/research/2017/03/13")
api_tables[0]

OK now we're just showing off. 

**Google's autocomplete**. Let's start our API fun by looking at the API that powers Google's auto-complete/suggest feature. Everytime you start typing something into your Chrome browser or the Google search box, it will make suggestions for you. That's all negotiated by API. You can [read about it here](https://shreyaschand.com/blog/2013/01/03/google-autocomplete-api/). Notice that this is an "unpublished" API in that its specifics are not documented by Google. 

Here is how you'd access that programmatically (that is, with Python and not with a browser).

In [None]:
# make a request -- here it is like we've typed "donald trump is"
url = "http://suggestqueries.google.com/complete/search?client=firefox&q=donald trump is"
response = get(url)

# print out the response we get -- we aren't print()ing it so you can see it's a string
response.text

### Interlude - JSON

The `response.text` is a string. But it's in a format that looks eerily familiar. If you had to guess what would you say this string represents? That is, if you were to read the string as a piece of Python, what kinds of data do you see? What structures is it organized in?

The output here looks like a list of lists. It is formatted in something called the JavaScript Object Notation and you can [read about it here.](https://www.json.org/) or see [a tutorial here.](https://restfulapi.net/introduction-to-json/) (just read up to the "syntax" part). organizing data in JSON looks a lot like wht we do in Python (with some small exceptions). 

So in the string above, we see square braces that mean a "list" in Python or an "array" in JSON. A "dictionary" in Python is specified in the same way as an "object" in JSON. There are some subtle differences like `None` in Python is `null` in JSON. But let's ignore that for the moment. 

So what we see is Google providing us data in a format we can use directly in our code. The `requests` library returns an object that not only has the `.text` of the response, but also has a method called `.json()` which parses the JSON and tris to turn it into a python object.

Let's make our request again - this time we will work with the Python object that's created.

In [None]:
url = "https://www.google.com/complete/search?client=firefox&q=donald trump is"
response = get(url)

# the requests library helps us with JSON. here, we can convert the response (JSON) to a python object
data = response.json()

# what type of object is it?
print(type(data))

In [None]:
print(len(data))

In [None]:
data[0]

In [None]:
data[1]

In [None]:
# write some code to pull out the fifth suggestion



In [None]:
# Try to auto-complete another phrase and look at the results 



*Technical note.* A URL can't include spaces, but the `requests` package and your browser are now smart enough clean things up before they send it to Google or whatever service you're pulling data from. So adding "donald trump is" as the query string is strictly speaking not right, but the environment is making up for the mistake. We'll say more about character encodings later.

### API Authentication

The autocomplete from Google is designed to be used widely. We simply made a request and received data. Most API providers, on the other hand, require you as the developer to use a form of authentication while using their APIs. This way they know who's doing what and can impose limits (so that you don't put too much of a drain on their servers, say). There are various forms of authentication: OAuth, API keys and even username and passwords.

For example, like [The New York Times](https://developer.nytimes.com/) only require that you use an API key when making API calls. With API keys, you usually just pass the key in your API calls, like:

```
https://developer.nytimes.com/article_search_v2.json?api_key=abcxyz&q=tesla
```

[OAuth](https://en.wikipedia.org/wiki/OAuth) is a bit more complicated but provides more fine-grained control for the API service as well as the users. We'll come back to this next week when we work with the Twitter APIs (yep, they use OAuth for their API authentication).

## Census API

The [Census](https://www.census.gov/developers/) has a lot of data (available via their APIs) that will be useful for our Iowa predictions that we explored in class last week. You can have a look at the [various APIs](https://www.census.gov/data/developers/data-sets.html) that they make available, but we'll focus specifically on the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs) data. In particular, we'll be looking at the [2018 ACS data](https://www.census.gov/programs-surveys/acs/news/data-releases/2018/release.html). The [ACS](https://en.wikipedia.org/wiki/American_Community_Survey) is an ongoing survey conducted by the Census and provides estimates each year in between the decennial census.

Before we do anything, please signup for a Census API key. You can do that by filling out their form here: https://api.census.gov/data/key_signup.html. They should email you an API key (which we'll use in our API calls below). You can call the APIs before they issue you a key but you'll be rate limited.

Next, take look at [this page](https://www.census.gov/data/developers/guidance/api-user-guide/query-examples.html) which will show you how to construct URLs to query the API to retrieve ACS data that you're looking for. The documentation isn't the best :-) so we'll stumble through it together. Make sure to expand the *Example: The American Community Survey (ACS)* for the proper instructions. What you'll see on that page is that you can query for: 
 * a particular year
 * the 1-year or 5-year estimates
 * a particular region (state, county, etc)
 * the actual variables/data you are looking for (population, race, etc)
 

Let's start by constructing a URL that will fetch the name of each county in Iowa:

https://api.census.gov/data/2018/acs/acs5?get=NAME&for=county:*&in=state:19

Clicking on the link should give you a page that has the following results (truncated):

```
[["NAME","state","county"],
["Cass County, Iowa","19","029"],
["Cherokee County, Iowa","19","035"],
["Crawford County, Iowa","19","047"],
["Des Moines County, Iowa","19","057"],
```

If we breakdown the URL into its various compontents, you'll see that we have the base URL of:

`https://api.census.gov/data/2018/acs/acs5`

This means we are asking for the ACS 5-year data from 2018.

The URL is then followed by some URL parameters:

  * `get=NAME`: this tells the API that we want to get the location/geo name only (we'll ask for other fields later)
  * `for=county:*`: this tells the API that we want every county. The `*`, or wildcard, means "all" in this case.
  * `in=state:19`: this tells the API that we only want data for state #19. The census uses [FIPS codes](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code) and 19 is the state code for Iowa.


Now, let's make that same API call using a little python. You can use the URL above, but you should add your API key to the URL by adding `&key=your_own_api_key_goes_here` to the end of the URL. Put your own API key in there - you should have received one by now from the Census.


In [None]:
# call the API from python
url = "https://api.census.gov/data/2018/acs/acs5?get=NAME&for=county:*&in=state:19"
response = get(url)

# the requests library helps us with JSON. here, we can convert the response (JSON) to a python object
data = response.json()

# what type of object is it?
print(type(data))

In [None]:
# print out the first element in the list, which should be the header
data[0]

In [None]:
# how about the second element in the list
data[1]

In [None]:
# if we look at the entire thing, it looks like a "list-of-lists"
data

If we have a list of lists, we can easily drop this into a pandas DataFrame.

In [None]:
# you try...create a DataFrame from our list-of-lists




Now that we are able to make a simple API call to grab ACS data, let's ask for more data! In the API call, we can ask for multiple variables in the `get` URL parameter (as long as they are comma separated). This page has all of the variables we can ask for in the 2018 ACS 5-year data set: https://api.census.gov/data/2018/acs/acs5/variables.html

As a quick illustration, let's ask for the *Estimate!!Total (SEX BY AGE)* which is denoted by this name: `B01001_001E`. All we need to do is add `B01001_001E` to our URL from above and we should get each county population. Notice that we've added it right after `NAME` - so, in this case, we are saying that we'd like to `get` both the `NAME` and `B01001_001E` for each county in Iowa:

https://api.census.gov/data/2018/acs/acs5?get=NAME,B01001_001E&for=county:*&in=state:19

Now, write some python to fetch this URL and print out the results:

In [None]:
# put your code here




In [None]:
# once you've done that, can you print out the population of Boone County?




### Tidying Up our API Calls

The API URls are starting to look a little messy! Our last URL was `https://api.census.gov/data/2018/acs/acs5?get=NAME,B01001_001E&for=county:*&in=state:19` - if we start to add some more URL parameters, it'll be pretty unreadble. If you look at the URL parameters, do you notice a pattern? `get=NAME,B01001_001E&for=county:*&in=state:19` URL parameters are delimited by the `&` character, which means we have the following parameteres:
  * `get=NAME,B01001_001E`
  * `for=county:*`
  * `in=state:19`
  
Looking at each parameter, do you notice another pattern? Each of the URL parameters are a key/value pair, delimited by the `=` character. So, we have:
  * `get` --> `NAME,B01001_001E`
  * `for` --> `county:*`
  * `in`  --> `state:19`
  
When you hear "key/value pairs", what python object do you think of? Hint: it sounds like "pictionary."

The `requests` library gives us an easy way to add URL parameters to any URL in a clean way - by creating a dictionary of URL parameters. Let's take a look at how we do this.

Using the same example as above, we are going to use the `requests` library to call the same URL, but in a slightly cleaner way. The URL we want to requests is: 

`https://api.census.gov/data/2018/acs/acs5?get=NAME,B01001_001E&for=county:*&in=state:19`

In [None]:
# using the same example as above
# the URL we want to call, looks like this:
# https://api.census.gov/data/2018/acs/acs5?get=NAME,B01001_001E&for=county:*&in=state:19

# and we can break it up into the "base" URL and the URL parameters

from requests import get

url = "https://api.census.gov/data/2018/acs/acs5"  # we're leaving off: ?get=NAME,B01001_001E&for=county:*&in=state:19

url_parameters = {
    "get": "NAME,B01001_001E",
    "for": "county:*",
    "in": "state:19"
}

response = get(url, params=url_parameters)

# we can print out the URL that requests has created, using our URL parameters
# does it look correct?
print(response.url)

### A Quick Note on URL Encoding

Here, the `requests` library takes our dictionary of URL parameters and properly appends them to our URL. When printing out our "final" URL, do you notice the `%2C` in the URL, instead of a `,`? URLs may only contain specific, valid characters (essentially `A-Z`, `a-z`, `0-9`, `-`, `_`, `.`, `~`, and a handful of other characters). Any characters not in the list of valid characters need to be encoded. You can read a bit more on [URL Encoding](https://en.wikipedia.org/wiki/Percent-encoding) but, essentially, any characters not in the valid list are encoded using the `%` character followed by the hexidecimal value of the [ASCII character](https://en.wikipedia.org/wiki/ASCII). In the case of the comma (`,`), it's encoded as: `%2C`. 2C is the hexidecimal value for the comma.

Anway, I think this code looks a little cleaner and will be easier to edit as we want to add more data fields to the `get` field or even change the state/county values. 

### Saving our JSON Data to Files

Now that we're able to fetch data from the Census API, we may want to store the data locally (on our laptop) so we don't have to fetch it each time we want it. This is called "caching" data locally. Why would we want to do this? We might be working offline and need our data files to run some analysis. We might not want to wait for a slow API call each time we need to load our data.

In previous notebooks, we saw how to read data from files (usually CSV files) stored on our laptops. Now, let's look at how we can save data to a file. To write data to a file in python, we need to open up a file, write some data to it and then close the file. A simple version of that looks like this:

In [None]:
# first, open a file
# the "w" means we want to write to the file
# we could replace the "w" with an "a" if we want to "append" to an existing file
our_file = open("sample_file.txt", "w")

# write the string "hello!" to our file
our_file.write("hello!")

# close the file
our_file.close()

In [None]:
# now, we can open the file and read the data from the file back into our notebook
our_file = open("sample_file.txt", "r")

# read the contents of the file into our variable called data
data = our_file.read()

# what type is it?
print(type(data))

In [None]:
# let's print it out. did we get what we expect?
print(data)

OK, now that we know how to "write to" and "read from" files, let's see how we store JSON data in a file. Let's use our code from above to fetch some data from the Census API and save it to a file:

In [None]:
url = "https://api.census.gov/data/2018/acs/acs5"  # we're leaving off: ?get=NAME,B01001_001E&for=county:*&in=state:19

url_parameters = {
    "get": "NAME,B01001_001E",
    "for": "county:*",
    "in": "state:19"
}

response = get(url, params=url_parameters)

data = response.json()

print(type(data))

Our variable `data` is a python list. We can't just save a list to a file - we first need to convert it to a string before we it to a file. The python [`json`](https://docs.python.org/3/library/json.html) library has two great utilities:
 * `dumps()` converts a python object, like a list, to a string
 * `loads()` converts a string into a python object (if it's valid JSON)
 
Let's convert our list to a string and save it to a file:

In [None]:
# save our list from the census to a file called "iowa_data.json"
import json

# open our file for writing
census_file = open("iowa_data.json", "w")

# convert the list to a JSON string
data_str = json.dumps(data)

# we should have a string now!
print(type(data_str))

# let's write it to the file
census_file.write(data_str)

# and close up the file
census_file.close()

In [None]:
# now, let's read it back in and convert it back to a list

# open our file for *reading*
census_file = open("iowa_data.json", "r")

# read in the file contents
data_str = census_file.read()

# we should have a string (JSON)
print(type(data_str))

# convert the JSON string to a list
data = json.loads(data_str)

# we should have a list now
print(type(data))

# and close up the file
census_file.close()

In [None]:
# let's print it out to make sure it looks ok
print(data)

### Saving Files to a Directory

When writing to files like we did above, the file will be created in the same directory as your notebook. In most cases, this is fine, but in some cases, you may want to save the files to another directory. Let's say you create a new directory called `data_files` and let's assume this directory is in the same directory as your notebook. You can put the directory name in the `open` command, like this:

```
census_file = open("data_files/iowa_data.json", "w")
census_file.write("hello!")
census_file.close()

```

This will create a new file called `iowa_data.json` in your `data_files` directory.

### Census JSON to Pandas DataFrame

Now that we know how to make API calls to grab Census data, let's load the Census data into a DataFrame so we can run some simple analysis on it. Let's make the same API call from above (getting the name and population of each county in Iowa) and load the data into a DataFrame:

In [21]:
import pandas as pd

url = "https://api.census.gov/data/2018/acs/acs5"  # we're leaving off: ?get=NAME,B01001_001E&for=county:*&in=state:19

url_parameters = {
    "get": "NAME,B01001_001E",
    "for": "county:*",
    "in": "state:19"
}

response = get(url, params=url_parameters)

data = response.json()

# now, load the list into a DataFrame
iowa_df = pd.DataFrame(data)

iowa_df.head()

Unnamed: 0,0,1,2,3
0,NAME,B01001_001E,state,county
1,"Cass County, Iowa",13191,19,029
2,"Cherokee County, Iowa",11468,19,035
3,"Crawford County, Iowa",17132,19,047
4,"Des Moines County, Iowa",39600,19,057


Well, that doesn't quite look right! The first row in the DataFrame is the "header" row. There are a few ways to fix this, but one approach is to say we want to load everything from `data` but skip the first row. This is done by specifying `data[1:]`, which means "give me everything from the second position in the list, all the way to the end." As part of creating the DataFrame, we can also specify the headers:

In [26]:
iowa_df = pd.DataFrame(data[1:], columns=['Name', 'Population', 'State', 'County'])

iowa_df.head()

Unnamed: 0,Name,Population,State,County
0,"Cass County, Iowa",13191,19,29
1,"Cherokee County, Iowa",11468,19,35
2,"Crawford County, Iowa",17132,19,47
3,"Des Moines County, Iowa",39600,19,57
4,"Fayette County, Iowa",19929,19,65


In [28]:
# let's sort the counties by population
iowa_df.sort_values(['Population'])

Unnamed: 0,Name,Population,State,County
69,"Montgomery County, Iowa",10155,19,137
43,"Keokuk County, Iowa",10200,19,107
14,"Woodbury County, Iowa",102398,19,193
85,"Franklin County, Iowa",10245,19,069
53,"Winnebago County, Iowa",10571,19,189
20,"Mitchell County, Iowa",10631,19,131
96,"Guthrie County, Iowa",10674,19,077
65,"Hancock County, Iowa",10888,19,081
88,"Louisa County, Iowa",11223,19,115
1,"Cherokee County, Iowa",11468,19,035


Hmm... that doesn't like quite right. It looks like it's sorting alphabetically instead of numerically. This means that the data in our Population column of type `str`, not `int` (integer). We can easily convert all of the values in the `Population` column to integers using the following: 

In [31]:
# convert all values in the Population column to integers and replace them in the data frame
iowa_df['Population'] = iowa_df['Population'].astype(int)

# then, let's try sorting by population again (descending this time)
iowa_df.sort_values(['Population'], ascending=False)

Unnamed: 0,Name,Population,State,County
11,"Polk County, Iowa",474274,19,153
28,"Linn County, Iowa",222121,19,113
29,"Scott County, Iowa",172288,19,163
83,"Johnson County, Iowa",147001,19,103
15,"Black Hawk County, Iowa",133009,19,013
14,"Woodbury County, Iowa",102398,19,193
13,"Story County, Iowa",96922,19,169
32,"Dubuque County, Iowa",96802,19,061
93,"Pottawattamie County, Iowa",93503,19,155
75,"Dallas County, Iowa",84002,19,049


### Now It's Your Turn

Partner up with a friend or neighbor and fetch some more data from the Census that we'll need in our modeling. The Census may not have everything we're looking for but it has a lot of it. Take a look here: https://api.census.gov/data/2018/acs/acs5/variables.html. You can add additional variables (up to 50 of them in each request) in your API call by adding them to the `get` URL parameter (hint: it's comma-delimited so make sure you separate new variables with a comma). See our examples above and add additional fields that you'd like to capture. If you want to add more than 50, you can make one request at a time and save the data (to a file!).

In [None]:
# Your turn....





### Combining Election Data with Our Census Data

The [MIT Election Data & Science Lab](https://electionlab.mit.edu/data) has a handy CSV data file of all county-level returns for Presidential elections from 2000-2016. This doesn't have any data from the current primaries but it will help in our modeling of previous elections. The CSV file can be found [here](https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/VOQCHQ/HEIJCQ&version=6.0) but we've also copied it our [github](https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/data/countypres_2000-2016.csv) and you can easily load it into a DataFrame:

In [71]:
from pandas import read_csv

county_df = read_csv("https://raw.githubusercontent.com/computationaljournalism/columbia2020/master/data/countypres_2000-2016.csv")

county_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2000,Alabama,AL,Autauga,1001.0,President,Al Gore,democrat,4942.0,17208,20191203
1,2000,Alabama,AL,Autauga,1001.0,President,George W. Bush,republican,11993.0,17208,20191203
2,2000,Alabama,AL,Autauga,1001.0,President,Ralph Nader,green,160.0,17208,20191203
3,2000,Alabama,AL,Autauga,1001.0,President,Other,,113.0,17208,20191203
4,2000,Alabama,AL,Baldwin,1003.0,President,Al Gore,democrat,13997.0,56480,20191203


In [37]:
# let's look at all Iowa county-level elections

county_df[ county_df['state'] == 'Iowa' ]

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
3032,2000,Iowa,IA,Adair,19001.0,President,Al Gore,democrat,1753.0,4123,20191203
3033,2000,Iowa,IA,Adair,19001.0,President,George W. Bush,republican,2275.0,4123,20191203
3034,2000,Iowa,IA,Adair,19001.0,President,Ralph Nader,green,66.0,4123,20191203
3035,2000,Iowa,IA,Adair,19001.0,President,Other,,29.0,4123,20191203
3036,2000,Iowa,IA,Adams,19003.0,President,Al Gore,democrat,897.0,2145,20191203
3037,2000,Iowa,IA,Adams,19003.0,President,George W. Bush,republican,1170.0,2145,20191203
3038,2000,Iowa,IA,Adams,19003.0,President,Ralph Nader,green,51.0,2145,20191203
3039,2000,Iowa,IA,Adams,19003.0,President,Other,,27.0,2145,20191203
3040,2000,Iowa,IA,Allamakee,19005.0,President,Al Gore,democrat,2883.0,6462,20191203
3041,2000,Iowa,IA,Allamakee,19005.0,President,George W. Bush,republican,3277.0,6462,20191203


If you look at the DataFrame, you'll notice that a few of the columns need a little cleanup. See that the FIPS column looks like a `float`ing point number (decimal)? Let's convert that to a string because we'll need that in a few minutes. Second, the `candidatevotes` column is also a `float` and can be converted to an `int`. We've done this above with the `astype` fields so give it a try here:

In [74]:
# first, let's convert the candidate votes to ints
# the fillna(0) means that if we don't have a value in one of the rows, we'll turn it into a 0
county_df['candidatevotes'] = county_df['candidatevotes'].fillna(0).astype(int)

# now, let's convert the FIPS code to strings. it's a little bit of a dance
# but since the field has decimal points (e.g. 1001.0), we convert it to `int`s first,
# which will remove the decimal points, and then convert it to a string
county_df['FIPS'] = county_df['FIPS'].fillna(0).astype(int).astype(str)

county_df.head()

In [None]:
# now, your turn....
# create a new data frame with only the 2016 Iowa results





Great! So, now we have two data sets - our Iowa 2016 election results and our Census data. How would you combine the two data sets in pandas? Work in small groups and explore how we can combine these two data sets. Ideally, we want to do some analysis on how the Presidential candidates did in each Iowa county in 2016 and look at some of the characterstics of those counties (from the Census) - this is the data we'll use in any prediction modeling we do in the future.


In [None]:
# Your turn....



