# Chronicling America API

[Chronicling America](https://chroniclingamerica.loc.gov/) is a collection of digitized American newspapers dating from 1777 to 1963 provided by the Library of Congress. The collection offers an application programming interface (API) which allows users to easily harvest large amounts of data.

In this notebook we will search Chronicling America's API, gather the search results into a Pandas dataframe, clean the data, and save it as a csv file.

In [1]:
# imports
import requests
import json
import math
import pandas as pd
import spacy



##Chronicling America URLs

If I search for a term, "abolition" for example, on https://chroniclingamerica.loc.gov/ I will get a results url that looks like this:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=abolition&x=12&y=18&dateFilterType=yearRange&rows=20&searchType=basic

These search results are human actionable, but not machine actionable. Chronicling America as an API that allows me to get machine actionable results if I add `&format=json`:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=abolition&x=12&y=18&dateFilterType=yearRange&rows=20&searchType=basic&format=json

If we examine the url we see that there are a number of search parameters:
- `state=`
- `date1=1770`
- `date2=1963`
- `proxtext=abolition`

We can edit these values to modify our search. I change the parameters to limit our search:

https://chroniclingamerica.loc.gov/search/pages/results/?state=Massachusetts&date1=1770&date2=1865&proxtext=prohibition&x=20&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json

Now I can use the `requests` library to retrieve data from the url.

In [2]:
# initial search
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=New+York&date1=1895&date2=1963&proxtext=Babe+Ruth&x=20&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)
raw = response.text
results = json.loads(raw)

## Explore search results

In [3]:
results.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

In [4]:
# explore items
print(type(results['items']))

<class 'list'>


In [5]:
print(results['items'][0])

{'sequence': 64, 'county': ['New York'], 'edition': None, 'frequency': 'Daily', 'id': '/lccn/sn83030214/1922-07-09/ed-1/seq-64/', 'subject': ['New York (N.Y.)--Newspapers.', 'New York (State)--New York County.--fast--(OCoLC)fst01234953', 'New York (State)--New York.--fast--(OCoLC)fst01204333', 'New York County (N.Y.)--Newspapers.'], 'city': ['New York'], 'date': '19220709', 'title': 'New-York tribune. [volume]', 'end_year': 1924, 'note': ['Also available in digital format on the Library of Congress website.', 'Archived issues are available in digital format as part of the Library of Congress Chronicling America online collection.', 'Available on microfilm from University Microfilms International, and Recordak.', 'Evening ed.: Evening edition of the tribune, 1866.', 'Merged with: New York herald (New York, N.Y. : 1920); to form: New York herald, New York tribune.', 'Semiweekly ed.: New-York tribune (New York, N.Y. : 1866 : Semiweekly), 1866-<1899>.', 'Triweekly eds.: New-York tri-weekly

In [6]:
print('totalItems:', results['totalItems'])
print('endIndex:', results['endIndex'])
print('startIndex:', results['startIndex'])
print('itemsPerPage:', results['itemsPerPage'])
print('Length and type of items:', len(results['items']), type(results['items']))

totalItems: 2871
endIndex: 20
startIndex: 1
itemsPerPage: 20
Length and type of items: 20 <class 'list'>


The Chronicling America API returned 1,656 results. However, it will only display 20 at a time by default. I can add a new parameter `page=` to cycle through all the results, but first I need to know how many pages there will be. I can find this out by dividing `totalItems` (1,656) by `itemsPerPage` (20) and then round-up using `math.ceil`.

In [7]:
# find total amount of pages
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

144


Now that I know how many pages there will be, I can use a for loop to iterate through each result page and then each item on each result page. I then gather the data I want from each item: newspaper title, city, date, and text.

Notice in the code below I placed the url string in parentheses () so that I could break it up over multiple lines making it easier to read.

Also, for the sake of this demonstration, I am only iterating over 10 pages. For the full results the for loop should begin: `for i in range(1, total_pages+1)` (the `+1` is necessary becase the seond number in the range function is exclusive).

In [8]:
# create empty list for data
data = []

In [9]:
# set search parameters
start_date = '1895'
end_date = '1963'
search_term = 'Babe+Ruth'
state = 'New+York'

In [10]:
# loop through search results and collect data
for i in range(1, 11):  # for sake of time I'm doing only 10, you will want to put total_pages+1
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state={state}&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')  # f-string
    response = requests.get(url)
    raw = response.text
    print(f'page {i} status code:', response.status_code)  # checking for errors
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        row_data = {}
        try:
          row_data['title'] = item_['title_normal']
        except:
          row_data['city'] = "none"
        try:
          row_data['city'] = item_['city']
        except:
          row_data['city'] = "none"
        try:
          row_data['date'] = item_['date']
        except:
          row_data['date'] = "none"
        try:
          row_data['raw_text'] = item_['ocr_eng']
        except:
          row_data['raw_text'] = 'none'
    data.append(row_data)

page 1 status code: 200
page 2 status code: 200
page 3 status code: 200
page 4 status code: 200
page 5 status code: 200
page 6 status code: 200
page 7 status code: 200
page 8 status code: 200
page 9 status code: 200
page 10 status code: 200


In [11]:
# put data into DataFrame
df = pd.DataFrame.from_dict(data)

In [12]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,new-york tribune.,[New York],19210515,"WELLESLEY SENIORS, some two hundred and fifty ..."
1,new-york tribune.,[New York],19211006,Masterly Filching of Carl Mays Deceives Giants...
2,evening world.,[New York],19221110,"THE EVENING WORLD, FRIDAY, NOVEMBER 10, 1922.\..."
3,daily worker.,"[Chicago, New York]",19261106,"• ,\nis - "" - By V. F. Calverton\nEvery day of..."
4,new-york tribune.,[New York],19210508,Gi?^8J_y*g Second Game in jRow From DodgCTS^Ru...


### Change date format
Pandas allows us to clean and edit our data easily (relatively). We can first convert the string values in the date column to properly formated dates and then sort the dataframe by date.

In [13]:
# convert date column from string to date-time object
df['date'] = pd.to_datetime(df['date'])

In [14]:
df.head()

Unnamed: 0,title,city,date,raw_text
0,new-york tribune.,[New York],1921-05-15,"WELLESLEY SENIORS, some two hundred and fifty ..."
1,new-york tribune.,[New York],1921-10-06,Masterly Filching of Carl Mays Deceives Giants...
2,evening world.,[New York],1922-11-10,"THE EVENING WORLD, FRIDAY, NOVEMBER 10, 1922.\..."
3,daily worker.,"[Chicago, New York]",1926-11-06,"• ,\nis - "" - By V. F. Calverton\nEvery day of..."
4,new-york tribune.,[New York],1921-05-08,Gi?^8J_y*g Second Game in jRow From DodgCTS^Ru...


In [15]:
# sort by date
df = df.sort_values(by='date')

In [16]:
df.head()

Unnamed: 0,title,city,date,raw_text
7,new-york tribune.,[New York],1920-03-20,Ruth Finally Recovers His Batting Eye and Poun...
4,new-york tribune.,[New York],1921-05-08,Gi?^8J_y*g Second Game in jRow From DodgCTS^Ru...
0,new-york tribune.,[New York],1921-05-15,"WELLESLEY SENIORS, some two hundred and fifty ..."
1,new-york tribune.,[New York],1921-10-06,Masterly Filching of Carl Mays Deceives Giants...
6,new-york tribune.,[New York],1921-11-10,Serapis Wins Annapolis Handicap at Pimlico?Buf...


### Process text
We can now porcess our text for analysis. The text provded by Chronicling America comes from optical character recognition (ocr) and the accuracy of ocr can be low. Here I will remove new line characters (`\n`), stop words, and then lemamtize the text.

**Rememeber** the decisions you make in how to process your text should be based on the kind of analysis you want to do.

In [17]:
# write fuction to process text
# load nlp model
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')  # these are unnecessary for the task at hand

def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [18]:
# apply process_text function
# this may take a few minutes
df['lemmas'] = df['raw_text'].apply(process_text)

In [20]:
# save to csv
df.to_csv(f'{search_term}{start_date}-{end_date}.csv', index=False)