<img src="../images/APISearchBadge.png" alt="Chronicling America Search Example" width="150px" style="border:1px solid black;" align="right">

# Badge 3 - Chronicling America: Conventional Search versus API Search

What you'll learn in this Notebook:

- What an API is
- How it works
- How to use it yourself to collect and analyse text

## 1. Chronicling America

[Chronicling America](https://chroniclingamerica.loc.gov) is a digital repository produced by the Library of Congress which gives access to digitised copies of historical newspapers pages published in the United States from 1777-1963, the U.S. Newspaper Directory of American newspapers published between 1690-present and other resources. Online users can access the data in Chronicling America via a search interface.

The screenshot below demonstrates what the website looked like when I accessed it in August 2021.  When you visit the site yourself, it will look slightly different, in that the presented content will refer to the date when you are accessing the page, but structurally it should look fairly similar.

<img src="../images/CA-homepage.png" alt="Chronicling America Homepage" width="700px" style="border:1px solid black;">


## 2. Conventional Search

Websites whose data can be accessed using conventional search provide users with a search bar, such as the one you will be familiar with from Google. To enable conventional search, Chronicling America needs a **search engine** under the hood of its website. Instead of accessing the whole World Wide Web, however Chronicling America's search engine is limited to the digitised collections it provides access to.

The way this works is that all the text in the newspaper pages is indexed, i.e. it is stored in a large **search index**.  A search index can be thought of a bit like a structured encyclopedia. When you look for (or query) a word, it is looked up in the index which can then easily retrieve all the documents (in this case, the newspaper pages) that are relevant to the query.  In the case of Chronicling America, the images of the newspaper pages are also stored along with the digitised text.  So when you look for words, the results are displayed as highlighted text in the relevant documents.

Let's start by searching for the phrase "little alacrity and energy". You can see the results for this search in the screenshot below. The metadata, given in white underlined text underneath each image of a newspaper page, shows the name, publication date and places of publication for each newspaper.

<img src="../images/CA-search-example.png" alt="Chronicling America Search Example" width="700px" style="border:1px solid black;">


### 🐛 Mini task 2.1

Let's look at the newspaper pages which were returned.  Can you explain why they were returned?

For example, here is the first result of the 7 pages that were returned:

<img src="../images/CA-result-1.png" alt="Chronicling America Result 1" width="700px" style="border:1px solid black;">

Here is the last result that was returned. Can you spot anything odd about this example?

<img src="../images/CA-result-2.png" alt="Chronicling America Result 2" width="700px" style="border:1px solid black;">

## API Search

### What an API is

API is an acronym for Application Programming Interface, which is a complicated way of describing an interface between two applications, or between a human and an application or database.  You can think of an API a bit like a waiter in a restaurant. You know how to talk to the waiter to order food or pay the bill. The waiter acts as the 'interface' between you and the restaurant, bringing you the food you've ordered while shielding you from all the complicated stuff that's going on behind the scenes such as food deliveries, stock management, cooking and cleaning dishes.

In the context of searching, an API provides a way to access data systematically using structured queries. A search API allows you to search existing items in a data collection or catalogue, in our case the historical newspaper collections hosted on Chronicling America. In this case, the search API is the interface to the data, and you just need to learn how to communicate with it (i.e. formulate your query using syntax the API can understand) to get the right information back, i.e. to return newspaper pages that are relevant to your search query.

A search API allows you to execute a search query using a URL and get back results that match the query. In this sense, it's not that different to conventional search. You just need to learn how the search query URLs are constructed to do the search (instead of typing out your query in the search box and clicking "Go", which can be laborious if you have a lot of queries to run). Another useful difference is that with an API, rather than results just being displayed in your browser, you can often specify to download them immediately onto your computer in different formats.

Let's see how it works in action.

## Using the Chronicling America API

### Simple Text Search

All Chronicling API searches start with the following base URL: https://chroniclingamerica.loc.gov/ 

The base URL is followed by the search, including specific search parameters.

E.g. a simple search for articles containing the word "alacrity" is made up of the base URL https://chroniclingamerica.loc.gov/ followed by "search/pages/results/?andtext=alacrity", so:

https://chroniclingamerica.loc.gov/search/pages/results/?andtext=alacrity

`/search/pages/results` is an instruction about where to search more specifically (i.e. the content of the pages, rather than other fields such as the newspaper's place of publication), and everything after the question mark specifies the search query and parameters. In this case, we want to search the full text of the newspapers, which is done by using `andtext` set to our search term "alacrity" (`andtext=alacrity`).

You can also run searches the newspaper titles only or in one or more specific newspapers only (e.g. searching only the pages of the Salt Lake Herald, or those of the Chicago Tribune), but we are interested in searching the text in all of the newspapers in the Chronicling America collection in this tutorial.

### Proximate Search

If you use Chronicling America's simple browser interface to do a search with multiple words, you are asking it to do what is known as a proximate search. This means that the search engine searches for all the words within each page, but they don't need to appear one after the other, even if you put them in double quotes.  You may have noticed this when inspecting the results of the conventional search earlier, as one page didn't contain the exact search query.

When writing a search query for the API, proximate search can be replicated using the `proxtext` search parameter and using plus signs between multiple words.  So, the API search for the equivalent browser search for "little alacrity and energy" is:

https://chroniclingamerica.loc.gov/search/pages/results/?proxtext=little+alacrity+and+energy

If you click on this link, a new tab should open which contains all seven results you looked at earlier.

### Phrase Text Search

You can also do a search for the exact phrase by using the `phrasetext` parameter, like this:

https://chroniclingamerica.loc.gov/search/pages/results/?phrasetext=little+alacrity+and+energy

Note that this only returns the six pages containing the phrase "little alacrity and energy" but not the page which contains all four individual words but not in the exact phrase specified.  You can achieve the same in the browser search by going to "Advanced Search" and running a search with the phrase "little alacrity and energy". (If you do not get the number of hits you expect, check the date parameters and make sure you are searching from 1777 to 1963.)

But hang on, what's so special about API searches, if I can do the exact same thing in the browser? What are they useful for?

### Different Formats and Downloading Results

The most useful thing about Search APIs is that they often allow you to download the data in a specified format directly to your computer, so that you can do further analysis with it.

If you send a query to the Chronicling America API without specifying a file format, you will receive your results in html format, which is the default, meaning the results are displayed in a new tab in your browser.  So, nothing new there.

However, the API also allows you to ask for your results to be delivered in json or atom format. This way you can easily download the files to your computer or bring them into your notebook for further analysis, for instance if you want to use regular expressions to find strings in the text or perhaps perform sentiment analysis on the text.

**json** stands for JavaScript Object Notation, which is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications (e.g., sending data from the server to the client, so it can be displayed on a web page, or vice versa).  In our case, it's used to store all the results for a search query.

**atom** is the name of an XML-based Web content and metadata syndication format, and an application-level protocol for publishing and editing Web resources belonging to periodically updated websites. So, atom is a type of XML format but we don't use it in this tutorial.

We'll be working with the results in json format. For example, if we adapt our previous query so it requests the results in json, it will look like this:

https://chroniclingamerica.loc.gov/search/pages/results/?phrasetext=little+alacrity+and+energy&format=json

Note that `format=json` is used to request json format and this search parameter is added after an ampersand (&) which separates multiple search parameter key/value pairs.

When you click on this link you'll get a new tab which, once it has opened properly, should look like this:
<img src="../images/CA-results-json.png" alt="Chronicling America Results in json">

(If your browser loads a page which looks different and has only a few words at top left, try clicking 'Raw Data'.)

This is quite hard to read but with an editor which can read json format it's much easier to figure out what's going on.  Guess what? We can load the json into this notebook and look at it.

To do this, we first need to import the Python packages `urlopen` and `json`. We then specify the URL, open it using the `urlopen()` function, read the results and load them as json format.  When we call the final `json_results` variable (rather than printing it) then we can see the results with slightly better formatting.

When you run the code cell below, you will see that the results contain six items. These correspond to the six newspaper pages that matched our earlier search query, including metadata, such as the title of the newspaper, the place of publication, the publication date, etc.

In [None]:
# import urlopen and json
from urllib.request import urlopen
import json

# store the URL in url variable
url = "https://chroniclingamerica.loc.gov/search/pages/results/?phrasetext=little+alacrity+and+energy&format=json&page=2"

# store the response of URL
response = urlopen(url)

# read the response and load as json format
json_results = json.loads(response.read())
  
# show what is in the json results
json_results


You can see that json format uses a key-value pair syntax.  So, the key (which you can think of as a label for the type of information, such as you might find in the header of a column in a table), is on the left, followed by a colon and then, on the right, the value (which you can think of as the information itself, such as you might find partway down a column in a table). Here are some examples of key-value pairs in the data we've just downloaded:

```
{
...
'city': ['Washington'],
'date': '18400222',
'title': 'The native American. [volume]',
'publisher': 'J. Elliot Jr.',
...
}
```

A value can belong to a range of data types: it can be a number, a string, a Boolean (so true or false), an array (e.g. an ordered list), an object or null (so left empty).

A value can be either a number, a string, a Boolean (so true or false), an array (an ordered list of components), an object or null (so left empty).

Key-value pairs are separated by commas, and they can also be nested (e.g. if you look at line 5 of the json file, you will see that `items` contains all the items returned by our search query.)

You will notice that the result for our query is very long as it contains the OCRed text of each of the six pages (i.e., the text following `'ocr_eng': `).  

**OCR** is short for Optical Character Recognition which is used for digitising historical documents, such as the newspapers in Chronicling America.  OCR is used to turn letters in an image into electronic text.

Though there have been huge advances in OCR in the recent past, it is not 100% accurate, and if you look at the text more closely you will be able to find some OCR errors.

You'll also notice "\n" characters which you should be familiar with from the Regular Expression Notebook.  They signify newline characters.

Next, we'll show you how to print the text for each newspaper page a bit more neatly and how to find all matches for "alacrity" in them. 

First, we need to extract the items from our json results.

In [None]:
# extract items from json format
items = json_results["items"]

The results are stored in a list of items which are still in json format.  To be able to read the information a bit better, let's store it in a table (also known as a data frame), with the information for each item (i.e. the values) being stored in one row, and one column for each type of information (i.e. the keys).

To do this, we'll use pandas, as it has a useful function called `json_normalize` which can read information in json format and flatten it into a data frame (which in the code block below is the `df` variable).

In [None]:
import pandas as pd
df = pd.json_normalize(items)

You can now check out the content of the data frame we just created by using the `.head()` function, which you have learnt in an earlier tutorial.  The metadata associated with each newspaper page is now in a table format, and so is much easier to read.

In [None]:
df.head(10)

Note that the data frame even contains the URL pointing to the results for each item (scroll all the way to the right of the dataframe if you cannot see it).  When you follow that URL you'll also be able to get access to other file formats and the images of the newspaper pages themselves.

For now, let's look at the OCRed text and see what we can do with it.

In the next bit of code, we use a for loop to extract the values of the `ocr_eng` column containing the OCRed text and run a regular expression search over it.  We use the `findall()` function to find all mentions of "alacrity".

The regular expression used is the following : `'[^\n]*\n.*alacrity.*\n[^\n]*'`

This means "match any line containing the word 'alacrity' but also the line before and after the line it appears in" (to show some context).  This looks quite complicated at first, but let's take it apart to understand the different bits of the regular expression and how they work together. Remember that:

- `\n` means newline character
- `.*` means zero or more characters
- `[^\n]` means not a newline character
- `[^\n]*` means zero or more characters but not newline

So, looking at the first part of the regular expression, `[^\\n]*\\n*.*alacrity` means "match zero or more characters but not newline, followed by an optional newline character followed by zero or more characters and the word 'alacrity'".

Then, looking at the second part of the regular expression, `alacrity.*\\n[^\\n]*` means "match the word 'alacrity', followed by zero or more characters, followed by newline and zero or more characters but not newline".

So, together, the RegEx means "match the line containing the word 'alacrity' as well as the line before and after, if there is one".  This might seem difficult but when you use regular expressions frequently, you'll become used to reading and constructing them.

The `re.IGNORECASE` flag at the end makes the search case-insensitive.

Finally, we'll store the matches from our RegEx search in a list (`all_results`) and then use another for loop to print them more neatly, displaying the number of the item, followed by the match in context.

In [None]:
import re
all_results=[]
for (index, row) in df.iterrows():
    item = row.loc['ocr_eng']
    results=re.findall('[^\n]*\\n*.*alacrity.*\n[^\n]*', item, re.IGNORECASE)
    all_results.append(results)

counter=0
for results in all_results:
    counter=counter+1
    print("Item " + str(counter) + ":")
    for result in results:
        print(result + "\n")

Look how far you've come. You have downloaded the results for a search query using an API, have converted the data into a data frame, have extracted the textual information, have matched some text using a Regular Expression search and have displayed the results in your notebook. Well done!

Now the power of using API search should be clearer.  It's very useful for searching and accessing data collections hosted online and especially so for analysing large numbers of results.  Imagine our search retrieved not seven but hundreds or thousands of results.  Using API search combined with Python RegEx search allows you to extract and analyse textual data much faster than having to navigate through it all manually in a browser.


### More advanced API-driven data collection

Up to now, we used quite a specific search query which only returns a few results.  In the following code cell, you'll see how we can collect data for an API search query that returns more results.

After the imports, you'll see a function called `safe_api_call` which you can use to ensure that the request to the API is successful and returns some results.  In some cases, an API might be down or a user might be blocked because they have sent too many queries to an API.  So this function checks that the API serves a response as it should.  The code only continues to run if the API returns a result. 

If so, the code then determines how many results the query returns, so this tells us the number of documents (items) containing the search query.  In our previous example, we only worked with 7 results.  In the following, you'll learn how to collect and analyse data if the query returns a lot more results.

For example, for the query (`black cat`) below, we get over 53,500 results with 20 results per page, so 2,678 pages of results.  In a real browser, you would most likely only look at the first few pages of results to find  information that's relevant to you and, in some cases, re-fine your query to get fewer, more specific results.

Using an API, you could collect and analyse all of the data. However, this would take quite a long time to do.  So in this case, we cap the number of results at 100.  This is mainly to show how this kind of data collection work in principle and also to not overload the API.  In some cases, APIs have constraints which allow you search them only up to X times per minute, hour or day and if you go over this limit, then your computer will be blocked or blacklisted.  We want to avoid that from happening when doing these exercises.

While the following code cell contains quite a substantive bit of code, it basically breaks down into some safety checking code to make sure the API works as it should, a for loop which goes through all of the pages and collects the results, and some code which stores the results into one overall data frame (`all_df`).  Each bit of code contains some print statements to provide an idea of what's currently happening as the code is run. 

In [None]:
# all the imports needed to make this bit of code run
import requests
import pandas as pd
import math
import time


# this safe_api_call function is an extra level of checking that the API returns results as it should be
# sometimes it might be down or overloaded, in which case you have to wait and come back to do this another time
# this function checks a URL three times to see if it loads properly.  If it does then the results are returned.
# If it doesn't then None is returned.
def safe_api_call(url, max_retries=3):
    """Make API call with error handling and retries"""
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            print(f"Status code: {response.status_code}")
            
            if response.status_code == 200:
                # Check if response is actually JSON
                content_type = response.headers.get('content-type', '')
                if 'application/json' in content_type:
                    return response.json()
                else:
                    print(f"Expected JSON but got: {content_type}")
                    print(f"Response preview: {response.text[:200]}")
                    return None
            else:
                print(f"HTTP Error {response.status_code}: {response.text[:200]}")
                return None
                
        except requests.exceptions.RequestException as e:
            print(f"Request failed (attempt {attempt + 1}): {e}")
            if attempt < max_retries - 1:
                time.sleep(2)
            else:
                return None
        except ValueError as e:
            print(f"JSON parsing failed (attempt {attempt + 1}): {e}")
            print(f"Response content: {response.text[:200]}")
            if attempt < max_retries - 1:
                time.sleep(2)
            else:
                return None
    return None

# Main code with error handling
# First we specify a query and add it to the base url.
query = "black+cat"
url = "https://chroniclingamerica.loc.gov/search/pages/results/?format=json&phrasetext=" + query

# Trying to get a response from the API using the URL
print(f"Trying URL: {url}")
json_results = safe_api_call(url)

# If the result is None then the code is exited
if json_results is None:
    print("API call failed. Cannot proceed.")
    sys.exit(1)

# The rest of the code only continues if API call succeeeds
# Here we get the overall number of results and pages for the search query
overall_count = json_results["totalItems"]

# By default Chronicling America returns 20 results per page, so we can work out the number of pages
# The math.ceil() function rounds up the results to the next integer
pages = math.ceil(overall_count / 20)
print("There are " + str(overall_count) + " hits (presented on " + str(pages) + " pages) for the search term: " + query)

# Here we create an empty data frame in which we will store all there results collected
all_df = pd.DataFrame()

# If there are more than 5 pages (more than 100) results then we'll cap the data collection at 100 results.
# Otherwise we'll collect all results
if pages > 5:
    print("To avoid overloading the API and getting blocked, we will only collect data for the first 100 documents.")
    pages = 5
else:
    print("Collecting all results")

# This bit of code loops through the number of pages (5 or lower)
# and downloads the results for each page
for page in range(1, pages + 1):
    print("Downloading results for page: " + str(page))

    # We add a short 2 second break inbetween each API call to not overload it
    time.sleep(2)
    
    # The page_url is the previous url containing the query with two additional paramaters added at the end:
    # the number of results per page (rows) and the number of the page (page)

    page_url = f"{url}&rows=20&page={page}"
    
    # As we are in the loop, the next line of code collects the results for each page
    json_results = safe_api_call(page_url)
    
    # These two if statements check that each page loads and contains results, if not the the loop continues
    # with the next page
    if json_results is None:
        print(f"Failed to get page {page}, skipping...")
        continue
    
    if "items" not in json_results:
        print(f"No items found on page {page}")
        continue
        
    # Here we extract the results for each of the documents (items)
    # and add them to the data frame which we created earlier.
    items = json_results["items"]
    df = pd.json_normalize(items)
    all_df = pd.concat([all_df, df], ignore_index=True)

    print(f"Total records so far: {len(all_df)}")

# This prints the length of the dataframe, i.e. how many documents that match the query it contains.
# As the query "cat" returns way too many results we only collect the first 100.
print(f"Final dataset contains {len(all_df)} documents.")


Take a look at the data frame to see what documents were collected for the search query.

In [None]:
all_df.head(20)

Following on from here, you can run the code we used before to print out the text containing the search query.  The output will contain results for 100 documents (unless the query returned less results).  Note that in some cases, a document contains more than one match.

In the next code cell, the code prints the results is a bit more sophisticated that we did before, in that it prints out the context of the query (previous and next two lines of text, if there are any).  It also prints a delimiter between documents, so you can more easily see what results were returned for each document.

Now, take a look at the output to see in what contexts the phrase "black cat" is used in the data you obtained from Chronicling America.

In [None]:
import re
all_results = []

# This for loop goes through the data from and grabs the text (orc_eng) for each matched document
for (index, row) in all_df.iterrows():
    item = row.loc['ocr_eng']
    
    # This splits the text into lines
    lines = item.split('\n')
    
    # A list which will store the lines that contain the query and their context
    matches_with_context = []
    
    # This for loop goes through the lines in each document
    for i, line in enumerate(lines):
        # And searches for the query (case-insensitively)
        if re.findall("black cat", line, re.IGNORECASE):

            # Gets 2 lines before and 2 lines after the mached line
            start_idx = max(0, i - 2)
            end_idx = min(len(lines), i + 3)  # +3 because range is exclusive at end
            
            # Extract the context lines with line numbers
            context_lines_with_numbers = []
            for j in range(start_idx, end_idx):
                line_num = j + 1  # Line numbers start at 1
                line_content = lines[j]
            
                # Format with line number
                formatted_line = f"{line_num:4d}: {line_content}"
                context_lines_with_numbers.append(formatted_line)
            
            matches_with_context.append('\n'.join(context_lines_with_numbers))
    
    all_results.append(matches_with_context)
    
counter = 0
for results in all_results:
    counter = counter + 1
    print("Document " + str(counter) + ":\n")
    for result in results:
        print(result+"\n")  
    print("\n"+"="*50 + "\n") # Added separator for clarity

### 🦋 Final Task 

We would now like you to explore the Chronicling America newspaper data using your own searches. Try out some conventional searches of interest that return different numbers of results and then adapt the code above to do the same searches using Search API, download the data in json and find text snippets within the OCRed text using a regular expressions.

Note when you adapt code, it's best to take a copy of the existing code and change the copy, so that you still have the original code working as intended.  So create some code cells below using the plus button (top left), copy the code and adapt it to your searches.  Avoid using the scissor button.  It deletes cells, though you can get them back by clicking on "Edit" -> "Undo Delete Cells".

In [None]:
# write your solution here and create some further code cells for additional code
