# **Interfacing with API for U.S. Library of Congress database, "Chronicling America"**
This database contains America's historic newspaper pages from 1789-1963.  Search queries can be adjusted as needed.

For each search query, metadata relative to the query is generated and pickled.  The metadata is then used in the second function to download the desired articles in the desired file format.

**Note for future**: Consider using [aiohttp](https://docs.aiohttp.org/en/stable/) to process url downloads asynchronously.

## Setup

In [3]:
# !pip install wget
import wget
import requests
import json
import os
import pandas as pd

In [None]:
#If running in Colab and wanting to download files to Google Drive

# import os
# from google.colab import drive

# drive.mount('/content/gdrive', force_remount=True)
# root = os.getcwd()
# download_destination = 'gdrive/My Drive/COVID-19/data/spanish_flu_data'
# cwd = os.path.join(root, download_destination)
# os.chdir(cwd)
# print('Current working directory: ', os.getcwd())

## Getting Metadata and Articles: ##
This is a two-step process: 
1) Download the metadata via the API

2) Use the URLs from the metadata to get the PDFs that we're interested in

### __Step 1__

In [None]:
#Set the search pattern
start_date = 1918                                  #year
end_date = 1920                                    #year
items_per_page = 50                                #typically choose 20 or 50, but 100 can be used too
sequence =  0                                      #set to 1 for front pages only
search_term = 'spanish flu'                        #Run for: 'spanish flu' (done), 'spanish influenza' (done), 'flu', 'influenza'               
joined_search_term = search_term.replace(' ', '+') 

#Define initial search URL
search_url = 'https://chroniclingamerica.loc.gov/search/pages/results/?date1={}\
                                                       &sort=relevance&date2={}\
                                                  &searchType=basic&sequence={}\
                                                    &format=json&state=&rows={}\
                                                                   &proxtext={}\
                                            &y=50&x=14&dateFilterType=yearRange'.format(start_date, 
                                                                                        end_date, 
                                                                                        sequence,
                                                                                        items_per_page, 
                                                                                        joined_search_term)
#Get JSON for initial URL and secure total result count
results = requests.get(search_url).json()
total_items = results['totalItems']

#Initialize dataframe for storage
metadata = pd.DataFrame()

#Page calculation
num_pages = int((total_items/items_per_page)+1)

#Cycle through pages, recording all metadata from initial search results
for page_num in range(1, num_pages, 1):
    paginated_search_url = str(search_url + '&page={}'.format(page_num))
    results = requests.get(paginated_search_url).json()['items']

    for idx, article in enumerate(results):
        temp = pd.Series(article, name=idx)
        metadata = metadata.append(temp)

#Make sure all data was retrieved
print('Number of items collected: ', len(metadata))
print('Total items in dataset: ', total_items)

#Pickle results
metadata.to_pickle(''.join((search_term.replace(' ', '_'), '_metadata.pkl')))

### __Step 2__
Downloading the files listed in metadata obtained from Step 1.

Possible format types are: 
* **ocr** - yields an OCR'd XML file
* **jp2** - jpeg image file
* **pdf** - self-explanatory
* **text** - ocr text file
* **sequence**, **title**, and **issue** - data's already in the metadata table

In [1]:
def get_articles(metadata, format='pdf'):
    for lccn, url, article_date, sequence in zip(metadata['lccn'],metadata['url'], metadata['date'], metadata['sequence']):
        resp = requests.get(url)
        url = resp.json()[format] #gets the url associated with the particular file format
        
        #Download the PDF
        pdf_name = ''.join((lccn, '_', article_date, '_', str(int(sequence)), '.', format))
        if not os.path.isfile(pdf_name):
            open(pdf_name, 'wb').write(requests.get(url).content)

#Uncomment the following for a "verbose"

#             print(pdf_name, '--> New download!')
#         else:                           
#             print(pdf_name, '--> Already downloaded!')
#             new_pdf_name = increment_filename(pdf_name)
#             open(new_pdf_name, 'wb').write(requests.get(url).content)
#             print(pdf_name, '-->', new_pdf_name)


In [4]:
#Get articles
metadata = pd.read_pickle('spanish_influenza_metadata.pkl')

get_articles(metadata)

sn95076622_19181024_3.pdf --> New download!
sn85025570_19190319_8.pdf --> New download!
sn85025570_19181120_8.pdf --> New download!
sn85025570_19181009_9.pdf --> New download!
sn85025570_19181030_1.pdf --> New download!
sn85025570_19181016_7.pdf --> New download!
sn85025570_19181204_4.pdf --> New download!
sn85025570_19190806_10.pdf --> New download!
sn85025570_19190108_6.pdf --> New download!
sn91050004_19181011_4.pdf --> New download!
sn88074815_19181025_1.pdf --> New download!
sn88074815_19181122_1.pdf --> New download!
sn87065469_19181212_9.pdf --> New download!
sn87065469_19181017_1.pdf --> New download!
sn87065469_19181024_1.pdf --> New download!
sn87065469_19181031_1.pdf --> New download!
sn96060765_19181011_1.pdf --> New download!
sn90051081_19181228_1.pdf --> New download!
sn90051081_19181005_3.pdf --> New download!
sn90051081_19190301_5.pdf --> New download!
sn86076201_19181019_3.pdf --> New download!
sn86076201_19181221_2.pdf --> New download!
sn89052329_19181017_3.pdf --> N