# Chronicling America API :

In this notebook, we'll get some data out of the Library of Congress's historical newspapers collection called Chronicling America. You can browse this collection online here: https://chroniclingamerica.loc.gov/. Take a look at this search interface and try a few searches.

Remember that an HTTP API is just a way of interacting wiht an application by sending requests to URLs that control the application. You can use the Chronicling America API to get back machine readable data in JSON format by just adding `&format=json` to the end of the url on one of your searches. There is more information on how to formulate searches here https://chroniclingamerica.loc.gov/about/api/#search.

## Let's get some Data : GET requests to the ChronAm API

The point of exposing and API for an application is to make it easier to write programs that interact with that program. For this example, we're working on getting data out of the API so we'll use the HTTP GET method. we'll use the Python programming language and it's requests library to interact with the API. You can run the code in the cells below by hilighting them and pressing: Shift + Enter.

In [None]:
# we start by importing the requests library
import requests

# we'll also import the JSON library to help us read
# data as JSON later on.
import json

In [None]:
# now we can use the requests library to make HTTP
# requests a lot like how we use the browser. The request
# below is the equivalent of typing a URL into your web
# browser and hitting enter. Let's try a search of the
# Chronicling America API:

requests.get(
    'https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=dogs&format=json'
)

You should have got back something like `<Response [200]>`. This is an HTTP status code that means success, but we're hoping to get a little more back than just a message like this, so let's improve on our program.

In [None]:
# In this example we'll assign the response to a variable
# I'll call 'result'. Try modifying the search below to search
# for a differnt term.

result = requests.get(
    'https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=dogs&format=json'
)

This time, you should get no response back. Instead we saved the response as 'result'. In the cells below lets look at some of the things we can do with a request result.

In [None]:
# We can check the status code
result.status_code

In [None]:
# We can check what URL we asked for
result.url

In [None]:
# we can look at the content of the request
result.content

In [None]:
# The above is kind of messy so we can also look 
# at the content as JSON
result.json()

## Parsing Data

Ok, now we have a big pile of data, but how do we know what's really in there? Let's use some of the techniques from the last lesson on working with JSON to explore what we have.

In [None]:
# First let's assign some data to a variable to make it
# easier to work with.

data = result.json()

In [None]:
# Recall that we can use the .keys() method on a JSON object
# try calling it on the data object below 


In [None]:
# OK, let's see whats in some of those keys, remember
# that the syntax retrieving the value associate with
# a keys is data['keyname']



In [None]:
# Looks like the meat of the data is in the items element
# I'll assign one item out of the list to a new variable

item = data['items'][6]
item

## Writing a short program that consumes an API

Now that we've had a look at our data, let's write an short program to do some analysis. Since the data includes the full OCR'd text for each article we find with our searchers, let's do some analysis on that text. 

We'll write a short program that searches the newspaper archive for articles published during the Civil War that match a term. We'll count how many articles come up in in Union, Confederate, and Border states. We'll also try to apply a technique called sentiment analysis to (https://en.wikipedia.org/wiki/Sentiment_analysis) to each article we retreive. For this program, we're using a tool called the Afinn word list. The Afinn word list assigns a positive or negative number to a list of common words depending on whether the words are associated with positive or negative sentiment.

While our research project wouldn't pass peer review, I hope it demonstrates that getting data in a machine readable form can save lots of busy work, and demonstrate some common types of techniques used to work with data from HTTP APIs.

In [None]:
# We'll need a list of what states were on what side during the war.
# There are nuances here, but for this demonstration I'm using the list
# from this nps factsheet https://www.nps.gov/civilwar/facts.htm:

union_states = [
    'Maine', 'New York', 'New Hampshire',
    'Vermont', 'Massachusetts', 'Connecticut',
    'Rhode Island', 'Pennsylvania', 'New Jersey',
    'Ohio', 'Indiana', 'Illinois', 'Kansas',
    'Michigan', 'Wisconsin', 'Minnesota', 'Iowa',
    'California', 'Nevada', 'Oregon'
]

confederate_states = [
    'Texas', 'Arkansas', 'Louisiana', 'Tennessee',
    'Mississippi', 'Alabama', 'Georgia','Florida',
    'South Carolina', 'North Carolina', 'Virginia'
]

border_states = [
    'Maryland', 'Delaware', 'West Virginia',
    'Kentucky', 'Missouri']

### Paged Results

One common issue we run into when working with APIs is that they'll only give you a handful of results at a time. This is especially true of results from search engines like the one we're using. 

Try doing a search on the web interface, and go to the second page: https://chroniclingamerica.loc.gov/

See if you can find any hints in the URL about how the search engine keeps track of what page you're on.

We can take advantage of the paging system to make our program repeat itself for every page.

In [None]:
# Computers are good at counting and repeating things. 
# This is the technicaue we'll use to get several pages
# of results.

# count to 20 and assign the 
# results to a variable called number
numbers = range(0, 20)

# for every number in the range of numbers
# print that number

for number in numbers:
    print(number)

In [None]:
# We can also do more complicated procedures for every number
for number in numbers:
    print('https://chroniclingamerica.loc.gov/search/pages/results/'
          '?proxtext=dogs&format=json&page='
         + str(number))

Wow neat, every one of those is a valid URL for a page of data, and they're all in order!

### Our complete program

Our complete program will use the above technique to do a series of calculations for several pages of results with one seach term. Let's talk through it and try running it a few times.

In [None]:
## Here's our comlete program

# set up our sentiment analysis library
from afinn import Afinn
afinn = Afinn()

# We'll get 10 pages of what the LOC deems are
# the most relevant results for our search
pages = 10

# Set our search terms
terms = "Lincoln"

# start our scores at zero

union_score = 0.0
border_score = 0.0
confederate_score = 0.0

union_article_count = 0
border_article_count = 0
confederate_article_count = 0

print("calculating...")

for page in range(0, pages):
    
    print("fetching result page... {}".format(str(page)))
    
    page_json = requests.get(
        "http://chroniclingamerica.loc.gov/search/pages/results/"
        "?proxtext={}&page={}&rows=20&date1=1861&date2=1865&format=json"
        .format(terms.lower(), str(page))).json()
    
    for item in page_json['items']:
        
        try:
            sentiment_score = afinn.score(item['ocr_eng'].lower())
        except KeyError:
            pass
        
        if item['state'][0] in union_states:
            union_score += sentiment_score
            union_article_count += 1
        elif item['state'][0] in confederate_states:
            confederate_score += sentiment_score
            confederate_article_count += 1
        elif item['state'][0] in border_states:
            border_score += sentiment_score
            border_article_count += 1

# At the end, we'll just print out our results     
print("\n")
print("union sentiment score = " + str(union_score))
print("union total articles = " + str(union_article_count))
print("union average afinn score per article = " + str(union_score / union_article_count))
print("\n")

print("confederate sentiment score = " + str(confederate_score))
print("confederate total articles = " + str(confederate_article_count))
print("confederate average afinn score per article = " + str(confederate_score / confederate_article_count))
print("\n")

print("border sentiment score = " + str(union_score))
print("border total articles = " + str(border_article_count))
print("border average afinn score per article = " + str(border_score / border_article_count))


        
