<img src="../images/APISearchBadge.png" alt="Chronicling America Search Example" width="150px" style="border:1px solid black;" align="right">

# Badge 3 - Chronicling America: Conventional Search versus API Search

What you'll learn in this Notebook:

- What an API is
- How it works
- How to use it yourself to collect and analyse text

## 1. Chronicling America

[Chronicling America](https://chroniclingamerica.loc.gov) is a digital repository produced by the Library of Congress which gives access to digitised copies of historical newspapers pages published in the United States from 1777-1963, the U.S. Newspaper Directory of American newspapers published between 1690-present and other resources. Online users can access the data in Chronicling America via a search interface.

<img src="../images/CA-LOC-homepage.png" alt="Chronicling America Homepage" width="700px" style="border:1px solid black;">


## 2. Conventional Search

Websites whose data can be accessed using conventional search provide users with a search bar, such as the one you will be familiar with from Google. To enable conventional search, the Library of Congress (LOC), which provides access to Chronicling America, needs a **search engine** under the hood. Instead of accessing the whole World Wide Web, LOC's search engine is limited to the digitised collections it provides access to.

The way this works, taking Chronicling America as an example, is that all the text in the newspaper pages is indexed, i.e. it is stored in a large **search index**.  A search index can be thought of a bit like a structured encyclopaedia. When you look for (or query) a word, it is looked up in the index which can then easily retrieve all the documents (in this case, the newspaper pages) that are relevant to the query.  In the case of Chronicling America, the images of the newspaper pages are also stored along with the digitised text.  So when you look for words, the results are displayed as highlighted text in the relevant documents.

Let's start by searching for the phrase 'little alacrity and energy'. If you're trying out this search yourself, you'll see that it takes some time for the search engine to find the relevant sources that match your query. Once it's finished searching, you can see the results for this search in the screenshot below.

<img src="../images/CA-search-example-2.png" alt="Chronicling America Search Example" width="700px" style="border:1px solid black;">

The metadata, provided in text below each image of a newspaper page, shows the page and name of the paper, the publication date and the most relevant snippet matching the search query. Each newspaper image also shows the matching words highlighted in the page which can be helpful when trying to find the relevant phrases on the page. The information on the left-hand panel shows how many matches the query returned, and which newspapers they appear in. 

Using conventional search you can now browse through the results and dig into the data to find interesting mentions and their context. However, doing this is time-consuming and most people do not look past the first page of results returned as it requires a lot of clicking back and forth. But have a go yourself by following this direct link to our search 'little alacrity and energy':

https://www.loc.gov/collections/chronicling-america/?dl=page&ops=AND&qs=little+alacrity+and+energy&searchType=advanced

### 🐛 Mini task 2.1

Let's look at the newspaper pages which were returned.  Can you explain why they were returned?

For example, here is the first result:

<img src="../images/CA-result-3.png" alt="Chronicling America Result 1" width="700px" style="border:1px solid black;">

Here is another result. Can you spot anything odd about this example?

<img src="../images/CA-result-4.png" alt="Chronicling America Result 2" width="700px" style="border:1px solid black;">

In the following section, you will learn how to access such results and their transcribed text programmatically, so that you can do much more with them than just searching and inspecting them manually. To do that we need an Application Programming Interface (API).

## 3. Application Programming Interface (API) Search

### 3.1 What is an API?

API is an acronym for Application Programming Interface, which is a complicated way of describing an interface between two applications, or between a human and an application or database.  You can think of an API a bit like a waiter in a restaurant. You know how to talk to the waiter to order food or pay the bill. The waiter acts as the 'interface' between you and the restaurant, bringing you the food you've ordered while shielding you from all the complicated stuff that's going on behind the scenes such as food deliveries, stock management, cooking and cleaning dishes.

In the context of searching, an API provides a way to access data systematically using structured queries. A search API allows you to search existing items in a data collection or catalogue, in our case the historical newspaper collections hosted on Chronicling America. In this case, the search API is the interface to the data, and you just need to learn how to communicate with it (i.e. formulate your query using syntax the API can understand) to get the right information back, i.e. to return newspaper pages that are relevant to your search query.

A search API allows you to execute a search query using a URL and get back results that match the query. In this sense, it's not that different to conventional search. You just need to learn how the search query URLs are constructed to do the search (instead of typing out your query in the search box and clicking "Go", which can be laborious if you have a lot of queries to run). Another useful difference is that with an API, rather than results just being displayed in your browser, you can often specify to download them immediately onto your computer in different formats.

Let's see how it works in action.

### 3.2 Using the Library of Congress API to search the Chronicaling America data

### Simple Text Search

All Chronicling API searches start with the following base URL (expanded with path parameters): https://www.loc.gov/collections/chronicling-america/

This URL is expanded further using specific search parameters.

E.g. a simple search for articles containing the word "alacrity" is made up of the base URL https://www.loc.gov/, expanded with the path parameters "collections/chronicling-america/" to specify the collection and followed by the search parameters "?ops=AND&qs=alacrity&searchType=advanced", all combined to: 

https://www.loc.gov/collections/chronicling-america/?ops=AND&qs=alacrity&searchType=advanced

`collections/chronicling-america/` is an instruction about which collection to search more specifically.  

Everything after the question mark specifies the search query and parameters. In this case, we want to search the full text of the newspapers, which is done by using `searchType=advanced`, using the query string "alacrity" (`qs=alacrity`) and the AND operator (which searches for all words in the query anywhere in the document).

### Proximate Search

If you use Chronicling America's browser interface to do a search with multiple words, you are asking it to do what is also known as a proximate search. This means that the search engine searches for all the words within each page, but they don't need to appear one after the other.  You may have noticed this when inspecting the results of the conventional search earlier, as some results don't contain the exact search query.

When writing a search query for the API, proximate search can be replicated using the `ops=AND` search parameter and using plus signs between multiple words.  So, the API search for the equivalent browser search for "little alacrity and energy" is:

https://www.loc.gov/collections/chronicling-america/?ops=AND&qs=little+alacrity+and+energy&searchType=advanced

If you click on this link, a new tab should open which contains the results you looked at earlier.

### Phrase Text Search

You can also do a search for the exact phrase by using the `ops=PHRASE` parameter, like this:

https://www.loc.gov/collections/chronicling-america/?ops=PHRASE&qs=little+alacrity+and+energy&searchType=advanced

Note that this only returns a small number of pages containing the exact phrase "little alacrity and energy" but not the pages which also contain all four individual words.  You can achieve the same in the browser search by going to "Advanced Search" and running a search with the phrase "little alacrity and energy".

But hang on, what's so special about API searches, if I can do the exact same thing in the browser? What are they useful for?

### Different Formats and Downloading Results

The most useful thing about Search APIs is that they often allow you to download the data in a specified format directly to your computer, so that you can do further analysis with it.

If you send a query to the Chronicling America API without specifying a file format, you will receive your results in html format, which is the default, meaning the results are displayed in a new tab in your browser.  So, nothing new there.

However, the API also allows you to ask for your results to be delivered in json, pdf or xml format. This way you can easily download the files to your computer or bring them into your notebook for further analysis, for instance if you want to use regular expressions to find strings in the text or perhaps perform sentiment analysis on the text.

**json** stands for JavaScript Object Notation, which is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications (e.g., sending data from the server to the client, so it can be displayed on a web page, or vice versa).  In our case, it's used to store all the results for a search query.

**pdf** stands for Portable Document Format, which is a file format developed by Adobe and is widely used for sharing documents that need to maintain their visual appearance, such as reports, forms and publications (in our case the images of pages of newspapers). We don't work with the images in this class but they are available to you should you want to download them.

**xml** stands for eXtensible Markup Language, which is a markup language for encoding documents in a format that is both human-readable and machine-readable. XML uses tags to structure data hierarchically, making it ideal for storing structured information. In this case, XML format is used to store the underlying transcribed text displayed on each newspaper page.

We'll be working with the results in json and xml format. For example, if we adapt our previous query so it requests the results in json, it will look like this:

https://www.loc.gov/collections/chronicling-america/?ops=PHRASE&qs=little+alacrity+and+energy&searchType=advanced&fo=json

Note that `fo=json` is used to request json format and this search parameter is added after an ampersand (&) which separates multiple search parameter key/value pairs.

When you click on the link above you'll get a new tab which, once it has opened properly, should look like this:
<img src="../images/CA-results-json-2.png" alt="Chronicling America Results in json">

(If your browser loads a page which looks different and has only a few words at top left, try clicking 'Raw Data'.)

This is quite hard to read but with an editor which can read json format it's much easier to figure out what's going on.  Guess what? We can load the json into this notebook and look at it.

To do this, we first need to import the Python packages `requests` and `json`, as well as a bunch of other libraries which we need later in this notebook.

In [None]:
# import various libraries needed in this notebook
from urllib.request import urlopen
import requests
import json
import time
import xml.etree.ElementTree as ET
import re
import math

We then specify the search URL, made up of the base URL and some search parameters.

To request the URL from the API, we use the `requests` package, read the results and load them as json format.  When we call the final `json_data` variable (rather than printing it) then we can see the results with slightly better formatting.

When you run the code cell below, you will see that the results contain a number of results. These correspond to the newspaper pages that matched our earlier search query, including metadata, such as the title of the newspaper, the place of publication, the publication date, etc.

In [None]:
# search query
query = "little+alacrity+and+energy"

# specify the base URL
url = "https://www.loc.gov/collections/chronicling-america/?dl=page&ops=PHRASE&qs="+ query +"&searchType=advanced&fo=json"

# specify all of the search parameters using a dictionary
# "c": 100 request pages of 100 results at a time
# "at": "results,pagination" means include both the actual search results and pagination information
params = {"c": 100, "at": "results,pagination"}

# request URL and store its response
response = requests.get(url, params=params)

# checks the response status... 200 is good but 429 or 499 or any other numbers are bad
if response.status_code == 200:
    # treat response as json and show what's in the json results
    json_data = response.json()
    
    # If there's a 'results' key in the json data
    if 'results' in json_data:

        # Print the first result
        print(f"First result:")
        print(json.dumps(json_data['results'][0], indent=2))
    
else:
    print(f"Error loading URL: Status code {response.status_code}")
        


You can see that json format uses a key-value pair syntax.  So, the key (which you can think of as a label for the type of information, such as you might find in the header of a column in a table), is on the left, followed by a colon and then, on the right, the value (which you can think of as the information itself, such as you might find partway down a column in a table). Here are some examples of key-value pairs in the data we've just downloaded:


```
'date': '1899-07-05',
'dates': ['1899-07-05'],
'description': ['The Farm Second Crop In Peas After the hogs and stock have had the run of the field as long as possible the land should be sowed in some variety of the late matur ing cow pea such as the Unknown This pea sowed as late as the 10th of July will harvest a magnificent crop by the middle of October The value of the second crop planted in peas and without ferti lizers will fall but little short of the value of the wheat crop In dependent of the high value of the pea as a food for all kinds of stock we must not forget their manurial value to the land in drawing and depositing in the soil a large supply of free nitrogen from the air The stubble and roots will supply humus to the soil also and build up the physical condition of the land If the wheat field is laid off into rows three feet apart and the peas drilled and cultivated a heavier crop of peas may be harvested but it is cheap er to broadcast plow in and har row cutting the vines when the pods are ripening than otherwise As many acres'], 
...
'language': ['english'],
'location': ['united states', 'richmond', 'virginia'],
'location_city': ['richmond'],
'location_country': ['united states'],
'location_state': ['virginia'], 
...
'partof_title': ['the central presbyterian (richmond, va.) 1856-1908'],
'publication_frequency': ['weekly'],
'resources': [{'files': 1, 'url': 'https://www.loc.gov/resource/sn89053987/1899-07-05/ed-1/?sp=15'}], 
...
```

A value can belong to a range of data types: it can be a number, a string, a Boolean (so true or false), an array (e.g. an ordered list), an object or null (so left empty).

Key-value pairs are separated by commas, and they can also be nested (e.g. if you look at line 3 of the json output, you will see that `results` contains all the results returned by our search query.)

You will notice that each result for our query is very long as it contains all of the metadata and a short description of each results (i.e., the text following `'description': `), which in this case looks like the first few lines of the text on the page.

You might notice that this text contains some errors or peculiarities. That is because it is machine transcribed, i.e. OCRed (optically character recognised).

**OCR** is short for Optical Character Recognition which is used for digitising historical documents, such as the newspapers in Chronicling America.  OCR is used to turn letters in an image into electronic text.

Though there have been huge advances in OCR in the recent past, it is not 100% accurate, and if you look at the descriptions more closely you will be able to find some OCR errors.

Let's make this data a bit more readable.

You can also store the json data in a list with one entry per result.

In [None]:
# read json results into a list
results = json_data["results"]
print(results[0])

The results are stored in a list of items which are still in json format.  To be able to read the information a bit better, let's store it in a table (also known as a data frame), with the information for each item (i.e. the values) being stored in one row, and one column for each type of information (i.e. the keys).

To do this, we'll use the pandas library, as it has a useful function called `json_normalize` which can read information in json format and flatten it into a data frame (which in the code block below is the `df` variable).

In [None]:
import pandas as pd
df = pd.json_normalize(results)

You can now check out the content of the data frame (`df`) we just created by using the `.head()` function, which you have learnt in an earlier tutorial.  The metadata associated with each newspaper page is now in a table format, and so is much easier to read.

In [None]:
df.head(10)

Note that the data frame even contains the URL pointing to the results for each item (scroll all the way to the right of the dataframe if you cannot see it).  When you follow that URL you'll also be able to get access to other file formats and the images of the newspaper pages themselves.

In [None]:
items = []
for result in results:
    # Get the link to the item record
    if result.get("id"):
        item = result.get("id")
        
        # add the json format search parameter so we can access each page in json later
        new_item = item + '&fo=json'
        items.append(new_item)
        print(new_item)

print('\nSuccess. Your API search query found '+str(len(items))+' related newspaper pages.')


In [None]:
print(items)

Up to now, we have accessed the overall search results data and found each individual link to the actual search results which we will access next to collect the OCRed text for each newspaper page returned by the API. The OCRed text is stored in XML format.

So, first we need to locate the XML files for each newspaper page.  We do this using the following function.  It takes as input a list of json URLs and returns a list of XML URLs.

In [None]:
#Collect XML URLs from a list of newspaper page items.
#Parameters:
    #items (list): List of URLs to newspaper page items
    #file_extension (str): File extension to filter for (default: 'xml')
    #sleep_time (int): Sleep time between requests to avoid overloading API (default: 5 seconds)
    #verbose (bool): Whether to print progress and URLs (default: True)
#Returns: List of XML URLs found
def collect_xml_urls(items, file_extension='xml', sleep_time=5, verbose=True):
    
    # Prints a message for the number of items found in the list
    if verbose:
        print(f'Locating {file_extension.upper()} files for {len(items)} newspaper pages found.')
    
    # We collect the XML URL for each query result and store them in a list
    page_urls = []
    
    # This for loop goes through the list of URLs and collects their json format
    for i, item in enumerate(items):
        try:
            call = requests.get(item)
            
            # A small break is applied to avoid overloading the API
            # if you delete the next line of code, you might be asking the API for too many request
            # per minute/hour and might be locked out of it for a while
            time.sleep(sleep_time)
            
            # If the page loads properly then
            if call.status_code == 200:
                
                # grab the json format for all pages
                page_data = call.json()
                page = page_data['page']
        
                # Go through each page, collect the URLs for the files with the correct extension
                for p in page:
                    if 'url' in p:
                        page_url = p['url']
                        if page_url.endswith(file_extension):
                            page_urls.append(page_url)
                            if verbose:
                                print(f"Found: {page_url}")
            else:
                print(f"Error loading item {i+1}: Status code {call.status_code}")
        
        # this handles various exceptions which can occur if the API request goes wrong
        except requests.RequestException as e:
            print(f"Request error for item {i+1}: {e}")
        except KeyError as e:
            print(f"JSON structure error for item {i+1}: Missing key {e}")
        except Exception as e:
            print(f"Unexpected error for item {i+1}: {e}")
    
    # Prints the total number of XML URLs found
    if verbose:
        print(f"Total {file_extension.upper()} files found: {len(page_urls)}")
    
    return page_urls

When you call the `collect_xml_urls` function, you can set some default parameters (see above: `file_extension='xml', sleep_time=4, verbose=True`) which you can chose to override if you like but don't have to.

Most of the time, users want the default behavior but this gives them some flexibility to customise when needed.

Let's stick with the default parameters for now which specify xml format, verbose printing of output to tell us what's happening as the code is running.

In [None]:
page_urls = collect_xml_urls(items)

### 🐛 Mini task 3.1

Change one or two of the default parameters (e.g. `sleep_time=6` and `verbose=False`) yourself when calling the function to see what happens. Be careful to not set `sleep_time` to less than 5 seconds, otherwise the code will make too many requests to the API per minute and it will block you.

In [None]:
# write your solution here


<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION

    page_urls = collect_xml_urls(items, sleep_time=6, verbose=False)

    ### END SOLUTION
    
</details>

Now, let's move on to check what's stored in the `page_urls` list.

In [None]:
print(page_urls)

The `page_url` list now contains all the links to the XML files with the text for each newspaper page. If you open on one of the links above in a new browser tab, you can see what the XML for each file looks like.

If you scroll down a bit, you can see that each page is made up of a print space, text blocks, text lines and strings, separated by white space (SP). This is how all the textual information on the pages is stored.  For example, see the following string: 

```
<String ID="S7" CONTENT="voice" WC="1" CC="0 0 0 0 0" HEIGHT="150" WIDTH="376" HPOS="1479" VPOS="2996"/>
```

The XML format contains, the string ID (57), the content of the string (i.e., the word "voice") and even its positional coordinates.  This information is used in order to highlight matched query words systematically on page images shown when searching the collection in a browser. 

### 🐛 Mini task 3.2

Take a look at another String in the XML file you opened to see what its text and other relevant information are.

In our case, we are only interested in the text itself and not all of the XML markup around it. To analyse the text, we first need to extract it from the XML. To do that we use a helper function which downloads the XML for each page and extracts the text from it.

In [None]:
# helper function to collect the XML for each page and turn it to txt format
def extract_text_from_xml(url, sleep_time=5):
    """
    Extract and print full text from ALTO XML format.
    """
    try:
        # Get the XML data
        response = requests.get(url)
        time.sleep(sleep_time)
        
        # Checks if the API sent a response correctly
        # If not correct, the code is 429 or another number that's not 200
        if response.status_code != 200:
            print(f"Failed to fetch XML from {url}. Status code: {response.status_code}")
            return ""
        
        # Ensure UTF-8 encoding
        # This is a particular character encoding which is commonly used to represent text
        response.encoding = 'utf-8'
        
        # Grabs the text of the response
        xml_content = response.text
        
        # Treats the text as XML format
        root = ET.fromstring(xml_content)
        
        # Define the ALTO namespaces to try
        # These are slightly different versions of ALTO XML format which are specified at the top of the XML file
        # A file is usually represented in one of these types of XML format (they are all quite similar)
        namespaces = [
            {'alto': 'http://www.loc.gov/standards/alto/ns-v2#'},
            {'alto': 'http://www.loc.gov/standards/alto/ns-v3#'},
            {'alto': 'http://www.loc.gov/standards/alto/ns-v4#'},
            {'alto': 'http://schema.ccs-gmbh.com/ALTO'},
        ]
        
        all_text = []
        
        # This for loop goes through all the name spaces and checks if the document is in one of them,
        # extracts the text lines in the XML and goes through each line to reconstruct the text
        # made up of either strings or white space which are appended to the all_text list
        for ns in namespaces:
            text_lines = root.findall('.//alto:TextLine', ns)
            if text_lines:  # If we found text lines with this namespace, use it
                for line in text_lines:
                    line_text = []
                    
                    # Process all children of the TextLine (String and SP elements)
                    for child in line:
                        if child.tag.endswith('String'):  # Word
                            content = child.get('CONTENT')
                            if content:  # Only add if content exists
                                line_text.append(content)
                        elif child.tag.endswith('SP'):  # Whitespace
                            line_text.append(' ')
                    
                    # Only append non-empty lines
                    if line_text:
                        all_text.append(''.join(line_text))
                break  # Exit the namespace loop once we found content
        
        # The function then joins all the lines into one overall document text by adding newline characters
        return '\n'.join(all_text)
    
    # Errors are handled here, mainly to do with requesting the URL or parsing the XML
    except ET.ParseError as e:
        print(f"XML parsing error for {url}: {e}")
        return ""
    except requests.RequestException as e:
        print(f"Request error for {url}: {e}")
        return ""
    except Exception as e:
        print(f"Unexpected error processing {url}: {e}")
        return ""
    

Next, we run the following bit of code, which extracts the text for each search results page and stores it in the `texts` list.

In [None]:
# Extracts a list of texts for a list of URLs and stores them in a data frame, which we can use later
texts = []

for page_url in page_urls:
    
    print(f"Extracting text from XML: {page_url}")
    t = extract_text_from_xml(page_url)
    texts.append(t)

When you print the beginning of the first text in the list and look at it more closely, you will see the OCRed text for each page.

Although there have been huge advances in OCR in the recent past, it is not 100% accurate, and if you look at the text closely you will be able to find some OCR errors.

In [None]:
print(texts[0][:4000])

In [None]:
# Create new data frame with the URLs and texts
df_texts = pd.DataFrame({
    'url': page_urls,
    'ocr': texts
})

df_texts.head()

Next, we'll show you how to print the text for each newspaper page a bit more neatly and how to find all matches for the search term in them.

In the next bit of code, we use a for loop to extract the values of the `ocr` column containing the OCRed text and run a regular expression search over it.  We use the `findall()` function to find all mentions of "alacrity".

The regular expression used is the following : `'[^\n]*\n.*alacrity.*\n[^\n]*'`

This means "match any line containing the word 'alacrity' but also the line before and after the line it appears in" (to show some context).  This looks quite complicated at first, but let's take it apart to understand the different bits of the regular expression and how they work together. Remember that:

- `\n` means newline character
- `.*` means zero or more characters
- `[^\n]` means not a newline character
- `[^\n]*` means zero or more characters but not newline

So, looking at the first part of the regular expression, `[^\\n]*\\n*.*alacrity` means "match zero or more characters but not newline, followed by an optional newline character followed by zero or more characters and the word 'alacrity'".

Then, looking at the second part of the regular expression, `alacrity.*\\n[^\\n]*` means "match the word 'alacrity', followed by zero or more characters, followed by newline and zero or more characters but not newline".

So, together, the RegEx means "match the line containing the word 'alacrity' as well as the line before and after, if there is one".  This might seem difficult but when you use regular expressions frequently, you'll become used to reading and constructing them.

The `re.IGNORECASE` flag at the end makes the search case-insensitive.

Finally, we'll store the matches from our RegEx search in a list (`all_results`) and then use another for loop to print them more neatly, displaying the number of the item, followed by the match in context.

In [None]:
all_results=[]
# loops through data frame to find all matches of the search term (and the line before/after if there is one)
for (index, row) in df_texts.iterrows():
    item = row.loc['ocr']
    results=re.findall('[^\n]*\\n*.*alacrity.*\n[^\n]*', item, re.IGNORECASE)
    all_results.append(results)

# print all results found
counter=0
for results in all_results:
    counter=counter+1
    print("Item " + str(counter) + ":")
    for result in results:
        highlighted = result.replace('alacrity', '\033[93malacrity\033[0m')  # Yellow highlight
        print(highlighted + "\n")

Look how far you've come. You have:

* downloaded the results for a search query using an API, 
* converted the data into a data frame, have extracted the textual information,
* matched some text using a Regular Expression search, and 
* displayed the results in your notebook.

Well done!

Now the power of using API search should be clearer.  It's very useful for searching and accessing data collections hosted online and especially so for analysing large numbers of results.  Imagine our search retrieved not seven but hundreds or thousands of results.  Using API search combined with Python RegEx search allows you to extract and analyse textual data much faster than having to navigate through it all manually in a browser.


### More advanced API-driven data collection

Up until now, we used quite a specific search query which only returns a few results.  In the following code cell, you'll see how we can collect a subset of data for an API search query that returns more results.

Next, you'll see a function called `safe_api_call` which you can use to ensure that the request to the API is successful and returns some results.  In some cases, an API might be down or a user might be blocked because they have sent too many queries to an API.  So this function checks that the API serves a response as it should.  The code only continues to run if the API returns a result. 

If it does, the code then determines how many results the query returns, so this tells us the number of documents (or items) containing the search query.  In our previous example, we only worked with a small number of results.  Next, you'll learn how to collect and analyse data, when the query returns a lot more results.

For example, for the query (`black cat`) below, we get over 58,900 results with 20 results per page, so 2,974 pages of results.  In a real browser, you would most likely only look at the first few pages of results to find  information that's relevant to you and, in some cases, refine your query to get fewer, more specific results.

Using an API, you could collect and analyse all of the data. However, this would take quite a long time to do.  So in this case, we cap the number of results at 60 (3 pages of 20 results).  This is mainly to show how this kind of data collection works in principle without taking too long and overloading the API.

In some cases, APIs have constraints which allow you search them only up to X times per minute, hour or day. If you exceed this limit, then your computer will be blocked or blacklisted. This is also the case for the LOC API. We want to avoid that from happening when doing these exercises.

The following code cells first contains some safety checking to ensure the API works as it should.

In [None]:
# This safe_api_call function is an extra level of checking that the API returns results as it should be
# sometimes it might be down or overloaded, in which case you have to wait and come back to do this another time
# this function checks a URL three times to see if it loads properly.  If it does then the results are returned.
# If it doesn't then None is returned.
def safe_api_call(url, max_retries=3, sleep_time=5):
    """Make API call with error handling and retries"""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            print(f"Extracting results from page (status code {response.status_code})")
            time.sleep(sleep_time)
            
            if response.status_code == 200:
                # Check if response is actually JSON
                content_type = response.headers.get('content-type', '')
                if 'application/json' in content_type:
                    return response.json()
                else:
                    print(f"Expected JSON but got: {content_type}")
                    print(f"Response preview: {response.text[:200]}")
                    return None
            else:
                print(f"HTTP Error {response.status_code}: {response.text[:200]}")
                return None
                
        except requests.exceptions.RequestException as e:
            print(f"Request failed (attempt {attempt + 1}): {e}")
            if attempt < max_retries - 1:
                time.sleep(sleep_time)
            else:
                return None
        except ValueError as e:
            print(f"JSON parsing failed (attempt {attempt + 1}): {e}")
            print(f"Response content: {response.text[:200]}")
            if attempt < max_retries - 1:
                time.sleep(sleep_time)
            else:
                return None
    return None

Now, we specify the new search URL.

In [None]:
# First we specify a query and add it to the base url
query = "black+cat"
url = "https://www.loc.gov/collections/chronicling-america/?dl=page&ops=PHRASE&qs="+ query + "&searchType=advanced&fo=json&c=20&at=results,pagination"


And request it from the API, which provides an overall count of the results.

Please note that collecting any more than 50 or 60 results in class makes the code very slow, so we will cap the data collection at 60 just to show how it works.

In [None]:
# Trying to get a response from the API using the URL
print(f"Trying URL: {url}")
json_results = safe_api_call(url)

# The rest of the code only continues if API call succeeeds
# Here we get the overall number of results and pages for the search query
overall_count = json_results['pagination']['of']

# We asked for 20 results per page, so we can work out the number of pages mathematically
# The math.ceil() function rounds up the results to the next integer
pages = math.ceil(overall_count / 20)
print("There are " + str(overall_count) + " hits (presented on " + str(pages) + " pages) for the search query: " + query)

print("So, way too many results to download in class.")
print("We'll cap the number at the first 60 to show how it works in principle.")
print("Feel free to experiment with slightly large numbers of results in your own time,")
print("but do not overload the API, or you will get blocked.")


### 🐛 Mini task 3.3

Now, try to find out how many results are returned for the phrase search "ginger cat" in comparison. You can reuse some of the code we just used but please change all the variable names to avoid overriding ones we've already created earlier and want to reuse later (e.g. `url` -> `new_url`, etc. etc.)

In [None]:
# write your solution here


<details><summary style='color:blue'>CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION

    new_query = "ginger+cat"
    new_url = "https://www.loc.gov/collections/chronicling-america/?dl=page&ops=PHRASE&qs="+ new_query + "&searchType=advanced&fo=json&c=20&at=results,pagination"

    print(f"Trying URL: {new_url}")
    new_json_results = safe_api_call(new_url)

    new_overall_count = new_json_results['pagination']['of']

    new_pages = math.ceil(new_overall_count / 20)
    print("There are " + str(new_overall_count) + " hits (presented on " + str(new_pages) + " pages) for the search query: " + new_query)


    ### END SOLUTION
    
</details>

Let's go back to the "black cat" results. 

The following function finds all the JSON URLs in the API response for a phrase search query.  It goes through the individual pages of results to get them all.

Remember, that we are capping the results to maximum 3 pages (3 x 20), so that you don't collect more results if the number is larger.

In [None]:
# Run page 1 search and get a list of results, then look for results on further pages (if there are any)
def get_item_ids(url, items=[], conditional='True', count=0, max_pages=3):
    # Check that the query URL is not an item or resource link.
    exclude = ["loc.gov/item","loc.gov/resource"]
    if any(string in url for string in exclude):
        raise NameError('Your URL points directly to an item or '
                        'resource page (you can tell because "item" '
                        'or "resource" is in the URL). Please use '
                        'a search URL instead. For example, instead '
                        'of \"https://www.loc.gov/item/2009581123/\", '
                        'try \"https://www.loc.gov/maps/?q=2009581123\". ')
    
    json_results = safe_api_call(url)
    results = json_results['results']
    for result in results:
        # Filter out anything that's a collection or web page
        filter_out = ("collection" in result.get("original_format")) \
                or ("web page" in result.get("original_format")) \
                or (eval(conditional)==False)
        if not filter_out:
            # Get the link to the item record
            if result.get("id"):
                item = result.get("id")
                #print(item)
                # Filter out links to Catalog or other platforms
                if item.startswith("http://www.loc.gov/resource"):
                    resource = item  # Assign item to resource
                    items.append(resource)
                if item.startswith("http://www.loc.gov/item"):
                    items.append(item)
                        
    # Repeat the loop on the next page, up to 3 pages
    if json_results["pagination"]["next"] is not None and count < (max_pages - 1):
        next_url = json_results["pagination"]["next"]
        get_item_ids(next_url, items, conditional, count + 1, max_pages)
    return items
    

# Creates ids_list based on url results
ids_list = get_item_ids(url, items=[])

# Adds 'fo=json' to the end of each URL in ids_list
new_items = []
for id in ids_list:
  if not id.endswith('&fo=json'):
    id += '&fo=json'
  new_items.append(id)
ids = new_items

for i in new_items:
    print(i)

print('\nSuccess. Your API search query found the first '+str(len(new_items))+' related newspaper pages.')


Now, you can reuse the code we wrote earlier to extract the XML files and extract their text.  This will take a while (up to 4 min for the next code cell) to extract the XML URLs and then extract the text from them. This is deliberately slowed down as we mustn't overload the API. So sit back and relax while you wait for this following code block to complete.

Here, you have a really good example showing the beauty of functions. Once you have written a function that runs, you can reuse it as many times as you like and only need to write one line of code.

In [None]:
new_page_urls = collect_xml_urls(new_items)

And as before, you can now reuse the code to extract the text from each XML file. Note that the `sleep_time` parameter is set to 1 second only in this case.  This is because the text extraction part of the API actually allows users to access more data per minute.

In [None]:
# Extracts a list of texts for a list of URLs and stores them in a data frame, which we can use later
new_texts = []

for page_url in new_page_urls:
    
    print(f"Extracting text from XML: {page_url}")
    t = extract_text_from_xml(page_url, sleep_time=1)
    new_texts.append(t)

In [None]:
# Print the first 4000 characters of the first text in the list
print(new_texts[0][:4000])

There should be the same number of page URLs and texts extracted from them.  If not, then something has gone wrong higher up.

In [None]:
print(len(new_page_urls))
print(len(new_texts))

Take a look at the data frame to see the beginning of each text extracted.

In [None]:
# Create new data frame with the URLs and texts
all_df_texts = pd.DataFrame({
    'url': new_page_urls,
    'ocr': new_texts
})

all_df_texts.head(10)

Following on from here, you can run similar code as we used before to print out the context containing the search query.  The output contains the results for the first 60 documents.  Note that in some cases, a document contains more than one match.

The code in the next cell prints the results a bit more sophisticated than before. It prints out the link to the XML file, the matched query and its context (previous and next two lines of text, if there are any) and the line numbers in the document. It also prints a delimiter between documents, so you can more easily see what results were returned for each document.

Once you run the code below, take a look at the output to see in what contexts the phrase "black cat" is used in the data you obtained from Chronicling America.

In [None]:
all_results = []

# This for loop goes through the data from and grabs the text (orc_eng) for each matched document
for (index, row) in all_df_texts.iterrows():
    item = row.loc['ocr'] # Get text from dataframe
    url = row.loc['url']  # Get the URL from the dataframe
    
    # This splits the text into lines
    lines = item.split('\n')
    
    # A list which will store the lines that contain the query and their context
    matches_with_context = []
    
    # This for loop goes through the lines in each document
    for i, line in enumerate(lines):
        
        # And searches the query (case-insensitively)
        if re.findall("black cat", line, re.IGNORECASE):

            # Gets 2 lines before and 2 lines after the matched line
            start_idx = max(0, i - 2)
            end_idx = min(len(lines), i + 3)  # +3 because range is exclusive at end
            
            # Extract the context lines with line numbers
            context_lines_with_numbers = []
            for j in range(start_idx, end_idx):
                line_num = j + 1  # Line numbers start at 1
                line_content = lines[j]
            
                # Format with line number
                formatted_line = f"{line_num:4d}: {line_content}"
                context_lines_with_numbers.append(formatted_line)
            
            # create matches and their context
            matches_with_context.append('\n'.join(context_lines_with_numbers))
    
    # add matches and their context to list along with URL
    all_results.append({
        'url': url,
        'matches': matches_with_context
    })

# print the results
counter = 0
for result_dict in all_results:
    counter = counter + 1
    print(f"Document {counter}:")
    print(f"URL: {result_dict['url']}\n")
    
    for match in result_dict['matches']:
        print(match + "\n")  
    print("\n" + "="*50 + "\n")  # Added separator between documents

Notice how quickly the matching text was displayed once you have stored the text locally.  Take a look at some of the examples to see in what contexts the search query was used.

Now that you have the text stored in the `all_df_texts` data frame you can apply further text analysis, like regular expression searches, sentiment analysis or concordance analysis. It should now make sense why APIs are so useful. 

However, please also note that while making data freely available via APIs is extremely useful to researchers such as us, it also makes the data providers vulnerable to getting their data hoovered up by large tech companies. So while it has its benefits for research, it does have some downsides as well.

### 🐛 Mini task 3.4

Discuss with your partner what you have learnt in this notebook and how you might be able to use API Search in your own research, either using the Chronicling America data or other datasets.

### 🦋 Final Task 

We would now like you to explore the Chronicling America newspaper data using your own searches. Try out some conventional searches of interest that return different numbers of results and then adapt the code above to do the same searches using the LOC Search API, download and extract the data and find text snippets within the OCRed text using a regular expressions.

We would recommend using the code under "More advanced API driven data collection" for this as it applies a cap to the number of results, but feel free to experiment with bigger sets of search results by varying the number of results per page and/or the number of pages to be considered. Please ensure that you add a short break (`time.sleep(4)`) after an API request (particularly in loops) so that the API doesn't get overloaded.

Note when you adapt code, it's best to take a copy of the existing code and adapt it, so that you still have the original code working as intended.  So create some code cells below using the plus button (top left), copy some code from above and adapt it to your searches.

You do not need to copy functions but can reuse them simply by calling them. Avoid using the scissor button.  It deletes cells, though you can get them back by clicking on "Edit" -> "Undo Delete Cells".

There is no solution for this exercise but you have all you need above to get going.  Try to not to put all the code into one long code cell so that you can inspect the intermediate data and debug in case there are errors.

In [None]:
# write your solution here and create some further code cells for additional code


## Appendix: Backup code in case the API is down

In [None]:
# This code can be used as a backup, in case the LOC API is down.
# The data for both examples above was provisionally stored in the data/chronicling-america folder.
# You can load it below to use it as input for further text analysis

# This commented out code was used to write the data to a backup data.json file
# The same was done for the little+alacrity+and+energy data.
#data = {'texts': texts, 'urls': page_urls}
#with open("../data/chronicling-america/black+cat/data.json", 'w') as f:
#    json.dump(data, f, indent=2)

# Load the black+cat backup data (there is also little+alacrity+and+energy data stored as a backup)
with open("../data/chronicling-america/black+cat/data.json", 'r') as f:
    data = json.load(f)
    backup_texts = data['texts']
    backup_page_urls = data['urls']
print(backup_texts[0][:1000] + " ...")
print(backup_page_urls)

# Now you have the data stored in the code so you can proceed with some of the code above without
# having to make any requests to the API.

In [None]:
# Create DataFrame with specified column names
all_df_texts_backup = pd.DataFrame({
    'url': backup_page_urls,
    'ocr': backup_texts
})
all_df_texts_backup.head()