The purpose of this Jupyter Notebook is produce a dataset from the Chronicling America Title Essays. It introduces readers to the concept of the Chronicling America API, the various facets that can queries, and the titles essays themselves. It is meant to act as a stand-in workshop that introduces a few programming elements while producing the neccessay to bulk download and package the Chronicling America title essays. If you would like to jump ahead and see the packaged title essays, included in this repository is a comma-separated value (CSV) written on DATE (08/29/2022). As such, unless further fields are desired beyond those queried in this code, feel free to use that CSV.

To best understand this Jupyter Notebook, it should be read in conjunction with the following two Jupyter notebooks (link) and (link). While this notebook provides the reader with the ability to produce datasets, the following notebooks hope to contextualize these data by providing code-snippets for possible project starting points. 

As a whole, this notebook should be read as a hands-on tutorial for downloading data, selecting metadata fields, and writing that data to CSVs. Because the essay content that it queries is updated EVERY SO OFTEN, this code only needs to be run once every YEAR OR SO. 

NEXT A NOTE ABOUT THESE TITLE ESSAYS

NEXT, a note about WHAT DOES THIS CODE TELL US ABOUT DATA AND ACCESSING DATA

FINALLY, because this code is written for pedagogic purposes, it highly commented. Feel free to fork this code, or take elements for your project. IT IS OPEN SOURCE (GET RIGHTS ASSOCIATED WITH LC)


---------

## Part 1

-------

First, as with most Python notebooks, we will import the necessary Python libraries. 

Included below is a list of the libraries we are importing along with a link to further documentation.

- [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/)
- [csv](https://docs.python.org/3/library/csv.html)
- [numpy](https://numpy.org/doc/stable/)
- [pandas](https://pandas.pydata.org/docs/)
- [random](https://docs.python.org/3/library/random.html)
- [requests](https://requests.readthedocs.io/en/latest/)
- [time](https://docs.python.org/3/library/time.html)

Feel free to run the code below by clicking in the box and pressing the  Run button above.

In [3]:
# import necessary libraries.

from bs4 import BeautifulSoup
import csv
import numpy as np
import pandas as pd
import random
import requests
import time

In [10]:
# total_url = "https://www.loc.gov/collections/directory-of-us-newspapers-in-american-libraries/?all=True&c=50"
# loc_url = "https://www.loc.gov/collections/directory-of-us-newspapers-in-american-libraries/?all=true&c=50&fa=partof_collection:chronicling+america&sp=1"
# chron_am_url = "https://chroniclingamerica.loc.gov/newspapers/"

Let's first get an idea of the newspapers available in Chronicling America. 

In order to do this, we'll go to the Library of Congress's website for the [Directory of US Newspapers and filter by Chroncling America](https://www.loc.gov/collections/directory-of-us-newspapers-in-american-libraries/?all=true&c=50&fa=partof_collection:chronicling+america&sp=1). As of 8/28/2022, this search returns 3683 full text newspapers.

We'll use these URLs and the LCCNs from these URLS to construct our dataset.

However, first we need to "crawl" this website to save these urls. Web crawling is a form of data mining that anticipates programatically "crawling" across webpages. Others have written more comprehensive guides and and tools do [exist](https://programminghistorian.org/en/lessons/fetch-and-parse-data-with-openrefine). 

The following two blocks of code generates a list of urls. Like most Jupyter notebooks, this code is broken down into manageable sections that can be run independently of each other. 
1. The first block of code creates a list of pages from the Directory of US Newspapers that contain Chronicling America urls.
2. The second block, iterates through those pages with Chronicling America urls and saves those urls to a new list. Our code pauses for two seconds, or "sleeps," in between each page so as not to overload the Library of Congress's servers. The end result is a list of 3683 urls. 

In [11]:
total_url = 'https://www.loc.gov/collections/directory-of-us-newspapers-in-american-libraries/?all=true&c=1000&fa=partof_collection:chronicling+america&sp={}'

pages = list(map(lambda x: total_url.format(x), 
                 range(1, 5)))
# len(pages)

In [12]:
# use beautiful soup to grab urls for lccns
links = []
for page in pages:
    response = requests.get(page)
    time.sleep(2)
    soup = BeautifulSoup(response.text, "html.parser")
    for title in soup.find_all("span", "item-description-title"):
        link = title.find("a")["href"]
        #print(link)
        links.append(link)

Let's print out the first 10 links to make sure everything worked correctly:

In [13]:
links[:10]

['https://www.loc.gov/item/sn85026945/',
 'https://www.loc.gov/item/sn93067670/',
 'https://www.loc.gov/item/sn93067668/',
 'https://www.loc.gov/item/sn84026853/',
 'https://www.loc.gov/item/sn85042527/',
 'https://www.loc.gov/item/sn88064057/',
 'https://www.loc.gov/item/sn83045004/',
 'https://www.loc.gov/item/sn83045003/',
 'https://www.loc.gov/item/sn98069055/',
 'https://www.loc.gov/item/sn83016734/']

And that those links out to a single column csv for analysis and storage.

In [19]:
header = ["lccn"]
with open("lc_output.csv", "w") as f:
    write = csv.writer(f) 
    write.writerow(header) 
    for link in links:
        write.writerow([link])

------------

## Part 2

---------

In [None]:
## might be able ot stick w concurrent futures. need to test

In [4]:
import concurrent.futures

In [5]:
df = pd.read_csv("lc_output.csv")  
    
df.head(10) 

Unnamed: 0,lccn
0,https://www.loc.gov/item/sn85026945/
1,https://www.loc.gov/item/sn93067670/
2,https://www.loc.gov/item/sn93067668/
3,https://www.loc.gov/item/sn84026853/
4,https://www.loc.gov/item/sn85042527/
5,https://www.loc.gov/item/sn88064057/
6,https://www.loc.gov/item/sn83045004/
7,https://www.loc.gov/item/sn83045003/
8,https://www.loc.gov/item/sn98069055/
9,https://www.loc.gov/item/sn83016734/


In [6]:
lccns = df.lccn
lccns.head()

0    https://www.loc.gov/item/sn85026945/
1    https://www.loc.gov/item/sn93067670/
2    https://www.loc.gov/item/sn93067668/
3    https://www.loc.gov/item/sn84026853/
4    https://www.loc.gov/item/sn85042527/
Name: lccn, dtype: object

In [8]:
urls = []

for lccn in lccns:
#     print(f"{lccn}?fo=json")
    urls.append(f"{lccn}?fo=json")

In [9]:
len(urls)

3681

In [11]:
# think this can be simplified too, might not be necessary if ues concurrant.futures
with open('user_agents.txt', 'r') as f:
    user_agents_list = [x.strip() for x in f.readlines()]

user_agents_list[:2]

['Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
 'Mozilla/5.0 (Linux; U; Android 4.0.3; de-ch; HTC Sensation Build/IML74K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30']

In [12]:
# grab a random user agent

val = random.randint(0, len(user_agents_list)-1)
# headers = {'User-agent' : user_agents_list[val]}
val

19

In [13]:
# generates random headers to avoid timeouts
def rando_headers():
    val = random.randint(0, len(user_agents_list)-1)
    headers = {'User-agent' : user_agents_list[val]}
#     print(headers)

In [21]:
def parse_json(url):
    datum = {}
    response = requests.get(url, headers=rando_headers(), timeout=15)
    
    json_data = response.json() if response and response.status_code == 200 else None
    
    if json_data and 'item' in json_data:
        
        datum['created_published'] = json_data.get('item').get('created_published', 'none')
        datum['date'] = json_data.get('item').get('date', 'none')
        datum['dates_of_publication'] = json_data.get('item').get('dates_of_publication', 'none')
        datum['description'] = json_data.get('item').get('description', 'none')
        datum['essay'] = json_data.get('item').get('essay', 'none')
        datum['essay_contributor'] = json_data.get('item').get('essay_contributor', 'none')
        datum['language'] = json_data.get('item').get('language', 'none')
        datum['latlong'] = json_data.get('item').get('latlong', 'none')
        datum['location'] = json_data.get('item').get('location', 'none')
        datum['raw_lccn'] = json_data.get('item').get('raw_lccn', 'none')
        datum['subjects'] = json_data.get('item').get('item').get('subjects', 'none')
        datum['title'] = json_data.get('item').get('item').get('title', 'none')
        datum['url'] = json_data.get('item').get('url', 'none')
        
    time.sleep(0.8)
    
    return datum

### Main Loop

In [23]:
%%time

datas = []

for url in urls[:5]:
    try:
        datas.append(parse_json(url))
    except Exception as e:
        print(e)
        
        with open('errors.txt', 'a') as f:
            f.write(f'\n{url}')
        continue
    time.sleep(0.7)

CPU times: user 158 ms, sys: 17.9 ms, total: 176 ms
Wall time: 8.6 s


In [26]:
#printing datas finds our data
#datas
len(datas)

5

In [27]:
with open("raw.csv", "w") as f:
        field_names = ['created_published', 'date', 'dates_of_publication', 'description', 'essay', 'essay_contributor',
                          'language', 'latlong', 'location', 'raw_lccn', 'subjects', 'title', 'url']
        writer = csv.DictWriter(f, field_names)
        writer.writerow({x: x for x in field_names})
        for row in datas:
            writer.writerow(row)

In [28]:
raw = pd.read_csv("raw.csv")  
    
raw

Unnamed: 0,created_published,date,dates_of_publication,description,essay,essay_contributor,language,latlong,location,raw_lccn,subjects,title,url
0,"['Abbeville, S.C. : Charles H. Allen, 1847-186...",1847,1847-1869,"['Weekly Vol. 4, no. 1 (Mar. 3, 1847)-v. 25, n...","<p>\n\tFor nearly a century, the <em>Abbeville...","['University of South Carolina; Columbia, SC']",['english'],"[34.17895, -82.38025]","['united states', 'south carolina', 'abbeville...",sn 85026945,"['Abbeville (S.C.)--Newspapers', 'Abbeville Co...",The Abbeville banner.,https://www.loc.gov/item/sn85026945/
1,"['Abbeville, S.C. : Hugh Wilson']",1865,1865-1865,['Weekly Began in July 1865. Ceased with Aug. ...,<p>\n\tThe short-lived weekly <em>Abbeville Bu...,"['University of South Carolina; Columbia, SC']",['english'],"[34.17895, -82.38025]","['united states', 'abbeville county', 'abbevil...",sn 93067670,"['Abbeville County (S.C.)--Newspapers', 'South...",The Abbeville bulletin.,https://www.loc.gov/item/sn93067670/
2,"['Abbeville, S.C. : Bonham and Perrin']",1884,1884-1887,"['Weekly Began Oct. 1, 1884; ceased in 1887. C...",<p>\n\tThe <em>Abbeville Messenger</em> (1884-...,"['University of South Carolina; Columbia, SC']",['english'],"[34.17895, -82.38025]","['united states', 'abbeville county', 'abbevil...",sn 93067668,"['Abbeville County (S.C.)--Newspapers', 'South...",The Abbeville messenger.,https://www.loc.gov/item/sn93067668/
3,"['Abbeville, S.C. : W.A. Lee & Hugh Wilson, 18...",1869,1869-1924,"['Triweekly, Jan. 7, 1920-Feb. 13, 1924 Began ...","<p>\n\tFor nearly a century, the <em>Abbeville...","['University of South Carolina; Columbia, SC']",['english'],"[34.17895, -82.38025]","['united states', 'south carolina', 'abbeville...",sn 84026853,"['Abbeville (S.C.)--Newspapers', 'Abbeville Co...",The Abbeville press and banner.,https://www.loc.gov/item/sn84026853/
4,"['Abbeville, S.C. : W.A. Lee and Hugh Wilson, ...",1860,1860-1869,"['Weekly Vol. 8, no. 28 (Nov. 9, 1860)-v. 17, ...","<p>\n\tFor nearly a century, the <em>Abbeville...","['University of South Carolina; Columbia, SC']",['english'],"[34.17895, -82.38025]","['united states', 'south carolina', 'abbeville...",sn 85042527,"['Abbeville (S.C.)--Newspapers', 'Abbeville Co...",Abbeville press.,https://www.loc.gov/item/sn85042527/


In [None]:
# not necessary but saving in case I can salvage it

# %%time

# MAX_THREADS = 10
# pause = 1

# def download_urls(dl_urls):
#     threads = min(MAX_THREADS, len(dl_urls))
    
#     with concurrent.futures.ThreadPoolExecutor(max_workers=threads) as executor:
#         executor.map(parse_json, dl_urls)

In [None]:
## Notes:
## left only exists on lc but not chron am. appears a later addition? 
## right only appears only on chron am -- perhaps metadata tagging issues?
## number discrepancies on chron am site -- appears to be hardcoded and not updated
## also need to figure out filtering on lc site.