# Chronicling America API

[Chronicling America](https://chroniclingamerica.loc.gov/) is a collection of digitized American newspapers dating from 1777 to 1963 provided by the Library of Congress. The collection offers an application programming interface (API) which allows users to easily harvest large amounts of data.

In this notebook we will search Chronicling America's API, gather the search results into a Pandas dataframe, clean the data, and save it as a csv file.

In [1]:
# imports
import requests
import json
import math
import pandas as pd
import spacy

##Chronicling America URLs

If I search for a term, "abolition" for example, on https://chroniclingamerica.loc.gov/ I will get a results url that looks like this:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=abolition&x=12&y=18&dateFilterType=yearRange&rows=20&searchType=basic

These search results are human actionable, but not machine actionable. Chronicling America as an API that allows me to get machine actionable results if I add `&format=json`:

https://chroniclingamerica.loc.gov/search/pages/results/?state=&date1=1770&date2=1963&proxtext=abolition&x=12&y=18&dateFilterType=yearRange&rows=20&searchType=basic&format=json

If we examine the url we see that there are a number of search parameters:
- `state=`
- `date1=1770`
- `date2=1963`
- `proxtext=abolition`

We can edit these values to modify our search. I change the parameters to limit our search:

https://chroniclingamerica.loc.gov/search/pages/results/?state=Massachusetts&date1=1770&date2=1865&proxtext=prohibition&x=20&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json

Now I can use the `requests` library to retrieve data from the url.

In [48]:
# initial search
url = 'https://chroniclingamerica.loc.gov/search/pages/results/?state=Florida&date1=1900&date2=1901&proxtext=federalism&x=20&y=8&dateFilterType=yearRange&rows=20&searchType=basic&format=json'
response = requests.get(url)
raw = response.text
results = json.loads(raw)

## Explore search results

In [49]:
# explore results
results.keys()

dict_keys(['totalItems', 'endIndex', 'startIndex', 'itemsPerPage', 'items'])

In [50]:
# explore items
type(results['items'])

list

In [51]:
# expore first item
print(results['items'][0])

{'sequence': 1, 'county': ['Marion'], 'edition': None, 'frequency': 'Daily (except Sunday)', 'id': '/lccn/sn84027621/1900-04-17/ed-1/seq-1/', 'subject': ['Florida--Marion County.--fast--(OCoLC)fst01204301', 'Florida--Ocala.--fast--(OCoLC)fst01215855', 'Marion County (Fla.)--Newspapers.', 'Ocala (Fla.)--Newspapers.'], 'city': ['Ocala'], 'date': '19000417', 'title': 'The Ocala evening star. [volume]', 'end_year': 1943, 'note': ['Archived issues are available in digital format from the Library of Congress Chronicling America online collection.', 'Description based on: Vol. 1, no. 5 (June 24, 1895).', 'In 1895, the Ocala (FL) Evening Star [LCCN: sn84027621] surfaced as a rival publication to the Ocala (FL) Banner [LCCN: sn88074815]. These two titles subsequently merged into one publication on September 1, 1943. The resulting Ocala (FL) Star-Banner [LCCN: sn78002071] has remained the daily newspaper in Marion County (FL) since that time. The Ocala Evening Star was also published from 1897 i

In [52]:
print('totalItems:', results['totalItems'])
print('endIndex:', results['endIndex'])
print('startIndex:', results['startIndex'])
print('itemsPerPage:', results['itemsPerPage'])
print('Length and type of items:', len(results['items']), type(results['items']))

totalItems: 288
endIndex: 20
startIndex: 1
itemsPerPage: 20
Length and type of items: 20 <class 'list'>


The Chronicling America API returned 1,656 results. However, it will only display 20 at a time by default. I can add a new parameter `page=` to cycle through all the results, but first I need to know how many pages there will be. I can find this out by dividing `totalItems` (1,656) by `itemsPerPage` (20) and then round-up using `math.ceil`.

In [53]:
# find total amount of pages
total_pages = math.ceil(results['totalItems'] / results['itemsPerPage'])
print(total_pages)

15


## Gather search results

Now that I know how many pages there will be, I can use a for loop to iterate through each result page and then each item on each result page. I then gather the data I want from each item: newspaper title, city, date, and text.

Notice in the code below I placed the url string in parentheses () so that I could break it up over multiple lines making it easier to read.

Also, for the sake of this demonstration, I am only iterating over 10 pages. For the full results the for loop should begin: `for i in range(1, total_pages+1)` (the `+1` is necessary becase the seond number in the range function is exclusive).

In [54]:
# create empty list for data
data = []
data

[]

In [55]:
# set search parameters
start_date = '1900'
end_date = '1901'
search_term = 'federalism'
state = 'Florida'

In [56]:
# loop through search results and collect data
for i in range(1, 400):  # for sake of time I'm doing only 10, you will want to put total_pages+1
    url = (f'https://chroniclingamerica.loc.gov/search/pages/results/?state={state}&date1={start_date}'
           f'&date2={end_date}&proxtext={search_term}&x=16&y=8&dateFilterType=yearRange&rows=20'
           f'&searchType=basic&format=json&page={i}')  # f-string
    
    response = requests.get(url)
    raw = response.text
    print(f'page {i} status code:', response.status_code)  # checking for errors
    results = json.loads(raw)
    items_ = results['items']
    for item_ in items_:
        row_data = {}
        try:
            row_data['title'] = item_['title_normal']
        except:
            row_data['city'] = "none"
        try:
            row_data['city'] = item_['city']
        except:
            row_data['city'] = "none"
        try:
            row_data['date'] = item_['date']
        except:
            row_data['date'] = "none"
        try:
            row_data['raw_text'] = item_['ocr_eng']
        except:
            row_data['raw_text'] = 'none'
    data.append(row_data)

page 1 status code: 200
page 2 status code: 200
page 3 status code: 200
page 4 status code: 200
page 5 status code: 200
page 6 status code: 200
page 7 status code: 200
page 8 status code: 200
page 9 status code: 200
page 10 status code: 200
page 11 status code: 200
page 12 status code: 200
page 13 status code: 200
page 14 status code: 200
page 15 status code: 200
page 16 status code: 200
page 17 status code: 200
page 18 status code: 200
page 19 status code: 200
page 20 status code: 200
page 21 status code: 200
page 22 status code: 200
page 23 status code: 200
page 24 status code: 200
page 25 status code: 200
page 26 status code: 200
page 27 status code: 200
page 28 status code: 200
page 29 status code: 200
page 30 status code: 200
page 31 status code: 200
page 32 status code: 200
page 33 status code: 200
page 34 status code: 200
page 35 status code: 200
page 36 status code: 200
page 37 status code: 200
page 38 status code: 200
page 39 status code: 200
page 40 status code: 200
page 41 s

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [57]:
# put data into DataFrame
df = pd.DataFrame(data)

In [58]:
# sanity check
# df
data

[{'title': 'chipley banner.',
  'city': ['Chipley'],
  'date': '19000224',
  'raw_text': '1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J BANNER r E < A t r v f\n1 ff i f r M tr t i k f\nA i J vt r if i t\ns t\nVOLUME VII CHIPLEY WASHINGTOCOUNTY FLORIDA BRD FEBRUARY 241900 NUMBKR36\nMR MCRUMS STATEMENT\n4\nI\nFormer Consul to Pretoria flakes Sensational\nCharges In Address to the Public\nHINTS A1UN ALLIANCE\nn\n1\nSays English Censor at Durban\nOpened and Inspected His\nI\nOff1 Documents\nI\nsigned statement iras given out\nat Washington Wednesday night by\nCharles E Macrum former United\nStates consul to Pretoria In part it\nwas as follows\nThe situation in Pretoriawas such\nthat first as an official could not\nremain there while my government at\nhome was apparently in the dark as to\nthe exact condition South Africa\nSeoond as a man and citizen of\nthe United States could not remain\nIn Pretoria sacrificing my own self\nrespect and that of the people of Pre\ntoria while the governmen

### Change date format
Pandas allows us to clean and edit our data easily (relatively). We can first convert the string values in the date column to properly formated dates and then sort the dataframe by date.

In [59]:
# convert date column from string to date-time object
df['date'] = pd.to_datetime(df['date'])

In [60]:
# sanity check
df

Unnamed: 0,title,city,date,raw_text
0,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...
1,florida star.,"[Cocoa, Titusville]",1900-03-30,t fi T t\nI r ij\n< V I\nio Irn r >\nL l\nr I ...
2,weekly tallahasseean.,[Tallahassee],1901-07-11,TTTT\nQW\n>\ns siL 4i t\ni\nw iL to tow toY U ...
3,florida star.,"[Cocoa, Titusville]",1901-03-29,Ift 0 J p\n1t\nr it f 0\n1 0\nj I3 TEL JLOiI A...
4,weekly tallahasseean.,[Tallahassee],1900-07-26,i\nI\nI I Ia\nFVV I\nf 6 THE WEEKLY TALLAEASSE...
...,...,...,...,...
106,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...
107,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...
108,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...
109,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...


In [61]:
# sort by date
df = df.sort_values(by='date')

In [62]:
# sanity check
df

Unnamed: 0,title,city,date,raw_text
10,bradford county telegraph.,[Starke],1900-02-02,r yr yrW yrtpI\nW 3\ni1 tpI i\nBOERS AGAIN ROU...
14,daily news.,[Pensacola],1900-02-07,r\nrrHE DAII ATiLY 1 1rJf Y NEWS NEWSVOL30 I 1...
13,daily news.,[Pensacola],1900-02-13,w\nll 4 THE DAILY NEWS PENSACOLA FLORIDA TUESD...
81,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...
80,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...
...,...,...,...,...
6,florida star.,"[Cocoa, Titusville]",1901-03-08,B vrnSf 4eF ii t > vjfcfCS o 1 n y + ai 1 x Lc...
3,florida star.,"[Cocoa, Titusville]",1901-03-29,Ift 0 J p\n1t\nr it f 0\n1 0\nj I3 TEL JLOiI A...
2,weekly tallahasseean.,[Tallahassee],1901-07-11,TTTT\nQW\n>\ns siL 4i t\ni\nw iL to tow toY U ...
9,ocala banner.,[Ocala],1901-09-13,HI\nl ti THEOCALA THE UC9LA B BANNER NNER SEPT...


### Process search results
We can now porcess our text for analysis. The text provded by Chronicling America comes from optical character recognition (ocr) and the accuracy of ocr can be low. Here I will remove new line characters (`\n`), stop words, and then lemamtize the text.

**Rememeber** the decisions you make in how to process your text should be based on the kind of analysis you want to do.

In [63]:
# write fuction to process text
# load nlp model
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes('ner', 'parser')  # these are unnecessary for the task at hand

def process_text(text):
    """Remove new line characters and lemmatize text. Returns string of lemmas"""
    text = text.replace('\n', ' ')
    doc = nlp(text)
    tokens = [token for token in doc]
    no_stops = [token for token in tokens if not token.is_stop]
    no_punct = [token for token in no_stops if token.is_alpha]
    lemmas = [token.lemma_ for token in no_punct]
    lemmas_lower = [lemma.lower() for lemma in lemmas]
    lemmas_string = ' '.join(lemmas_lower)
    return lemmas_string

In [64]:
# apply process_text function
# this may take a few minutes
df['lemmas'] = df['raw_text'].apply(process_text)
df

Unnamed: 0,title,city,date,raw_text,lemmas
10,bradford county telegraph.,[Starke],1900-02-02,r yr yrW yrtpI\nW 3\ni1 tpI i\nBOERS AGAIN ROU...,r yr yrw yrtpi w tpi boer rout britons britons...
14,daily news.,[Pensacola],1900-02-07,r\nrrHE DAII ATiLY 1 1rJf Y NEWS NEWSVOL30 I 1...,r rrhe daii atily y news vol pensacola florida...
13,daily news.,[Pensacola],1900-02-13,w\nll 4 THE DAILY NEWS PENSACOLA FLORIDA TUESD...,w ll daily news pensacola florida tuesday febr...
81,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...,t t t sr h ivff r lh b h c hipley r j banner r...
80,chipley banner.,[Chipley],1900-02-24,1\nt T\nt\nI sr h iVff r lH B h C HIPLEY i r J...,t t t sr h ivff r lh b h c hipley r j banner r...
...,...,...,...,...,...
6,florida star.,"[Cocoa, Titusville]",1901-03-08,B vrnSf 4eF ii t > vjfcfCS o 1 n y + ai 1 x Lc...,b vrnsf ii t vjfcfcs o n y ai x lc f j j y c s...
3,florida star.,"[Cocoa, Titusville]",1901-03-29,Ift 0 J p\n1t\nr it f 0\n1 0\nj I3 TEL JLOiI A...,ift j p t r f j tel jloii ajjs lrlnid j ort j ...
2,weekly tallahasseean.,[Tallahassee],1901-07-11,TTTT\nQW\n>\ns siL 4i t\ni\nw iL to tow toY U ...,tttt qw s sil t w il tow toy u ut uj y t j jl ...
9,ocala banner.,[Ocala],1901-09-13,HI\nl ti THEOCALA THE UC9LA B BANNER NNER SEPT...,hi l ti theocala b banner nner september presi...


In [67]:
# save to csv
df.to_csv(f'{search_term}{start_date}-{end_date}.csv', index=False)