# Chronicling America: Conventional Search versus API Search

What you'll learn in this Notebook:

- What an API is
- How it works
- How to run it yourself

## 1. Chronicling America

[Chronicling America](https://chroniclingamerica.loc.gov) is a website hosting America's historic newspaper pages from 1777-1963, the U.S. Newspaper Directory of American newspapers published between 1690-present and other resources, so that information within them can be made easily publicly accessbile to online users via search.

This is what the website looked like when I accessed it last.  When you will access, it will look slightly differently, in that the presented content will be more relevant to the date when you'll access the page, but structure wise it will look very roughly the same.

<img src="../images/CA-homepage.png" alt="Chronicling America Homepage" width="700px">


## 2. Conventional Search

Websites which can be accessed using conventional search have a search bar. We are all very familiar with such conventional search as most of know how to search things on Google. To enable conventional search, Cronicling America needs something like a **search engine** under hood of its website. Instead of accessing the whole World Wide Web, Cronicling America's search engine is limited to the digigitised collections it provides access to.

The way this works is that all the data, so all the text in the newspaper pages is indexed, i.e. it is stored in a large **search index**.  A search index can be thought of a bit like a structured encyclopedia.  So when you look for (or query) a word, it is looked up in the index which can then easily retrieve all documents (e.g. newspaper pages) that are relevant to the query.  In the case of Chronicling America, the images of the newspaper pages are also stored along with the text.  So when you look for words, the results are displayed as highlighting in relevant documents.

For example, see the results for the search "little alacrity and energy" below.

<img src="../images/CA-search-example.png" alt="Chronicling America Search Example" width="700px">


### 🐛 Mini task 2.1

Take a look at all newspaper pages which were returned.  Can you explain why they were returned?

For example, here is the first result of the 7 pages that were returned:

<img src="../images/CA-result-1.png" alt="Chronicling America Result 1" width="700px">

Here is the last results that was returned. Can you spot anything odd about this example?

<img src="../images/CA-result-2.png" alt="Chronicling America Result 2" width="700px">



## API Search

### What is an API

API is the acronym for Application Programming Interface which is a complicated way of saying that it's an interace between two applications or between a human and an application or database.  It's a bit like a waiter in a restaurant.  You know how to talk to them to order food and pay the bill and they are the interface between you and the restaurant shielding you for all the complicated stuff that's going on behind the scenes, including food deliveries, managing stock, cooking and cleaning dishes.

In the context of search, an API is a way to access data systematically using structured queries. A search API allows you to search existing items in a data collection or catalogue, for example, in the historical newspaper collections hosted on Chronicling America. In this case, the search API is the interface to the data and you just need to learn how to communicate with it to get the right information back, i.e. to return newspaper pages that are relevant to your search query.

A search API allows you to execute a search query using a URL and get back results that match the query. In that sense it's not that different to conventional search. You just need to learn how the search query URLs are constructed to do the search (instead of typing it and clicking "Go"). The other difference is that with an API, rather than results being just displayed in your browser, you can often specify to download them immediately onto your computer in different formats.

Let's see how it works in action.

## Chronicling America API

### Simple Text Search

All Chronicling API searches start with the following base URL: https://chroniclingamerica.loc.gov/ 

The base URL is followed by the search, including specific search parameters.

E.g. a simple search for articles containing the word "alacrity" is made up of the base URL https://chroniclingamerica.loc.gov/ and "search/titles/results/?andtext=alacrity", so:

https://chroniclingamerica.loc.gov/search/titles/results/?andtext=alacrity

`/search/titles/results` specifies where to search more specifically and everything after the question mark specifies the search query and parameters. In this case, we want to search the text of the newspapers, which is done by using `andtext` set to our search term "alacrity" (`andtext=alacrity`).

You can also search the newspaper titles only, but we are interested in searching the text in all of the newspapers in this lesson.

### Proximate Search

The conventional simple Chronicling America browser search carries out a proximate search when your query contains multiple words.  This means that all words specified are searched within each page, but they don't need to appear one after the other, even if you put them in double quotes.  You should have noticed that when inspecting the results of the conventional search, as one page didn't contain the exact search query.

In the API, proximate search can be replicated using the `proxtext` search parameter and using plus signs between multiple words.  So, the API search for the equivalent browser search for "little alacrity and energy" is:

https://chroniclingamerica.loc.gov/search/pages/results/&proxtext=little+alacrity+and+energy

If you click on this link, a new tab should open which contains all seven results you looked at earlier.

### Phrase Text Search

You can also do a search for the exact phrase by using the `phrasetext` parameter, like this:

https://chroniclingamerica.loc.gov/search/pages/results/?phrasetext=little+alacrity+and+energy

Note, this only returns the six correct pages containing the phrase but not the page which contains all four individual words but not in the exact phrase specified.  You can achieve the same in the browser search by going to "Advanced Search" and running a search with the phrase "little alacrity and energy".

But hang on, what's so special about API searches, if I can do the exact same thing in the browser? What are they useful for?

### Different Formats and Downloading Results

The most useful thing about Search APIs is that they often allow you to download the data in a specified format so that you can do further analysis with it.

The Chronicling America API allows you to return results in html format by default if you don't specify a format, which means the results are displayed in a new tab in your browser.  So, nothing new there.

However, it also allows you to have the results in json or atom format.  This way you can download them to your computer or into your notebok for further analysis, e.g., if you want to do Regular Expression search to find strings in the text or even for sentiment analysis of the text.

**json:** ... JavaScript Object Notation (json) is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications (e.g., sending data from the server to the client, so it can be displayed on a web page, or vice versa).  In our case, it's used to store all the results for a search query.

**atom:** ... atom is the name of an XML-based Web content and metadata syndication format, and an application-level protocol for publishing and editing Web resources belonging to periodically updated websites. So, atom is a type of XML format but we don't use it in this lesson.

We'll be working with the results in json format. For example, our previous query requesting the results in json looks like this:

https://chroniclingamerica.loc.gov/search/pages/results/?phrasetext=little+alacrity+and+energy&format=json

When you click on this link you'll get a new tab which, once it has opened properly, should look like this:
<img src="../images/CA-results-json.png" alt="Chronicling America Results in json">

This is quite hard to read but with an editor which can read json format it's much easier to figure out what's going on.  Guess what? We can load the json into this notebook and look at it.

To do that we need to import the Python packages `urlopen` and the `json` first. We then specify the URL, open it using the `urlopen()` function, read the results and load them as json format.  When you call the final `json_results` variable (rather than printing it) then you can see the results in slightly better formatting.

When you run the next code cell, you can see that the results contain six items, our six newspaper pages that matched our search query, including metadata, such as the title of the newspaper, the place of publication, the publication date, etc.

In [None]:
# import urlopen and json
from urllib.request import urlopen
import json

# store the URL in url variable
url = "https://chroniclingamerica.loc.gov/search/pages/results/?phrasetext=little+alacrity+and+energy&format=json"
  
# store the response of URL
response = urlopen(url)

# read the response and load as json format
json_results = json.loads(response.read())
  
# show what in the json results
json_results

You can see that json format uses a key-value pair syntax.  So, the key, the name of the information, is on the left followed by a colon and the value, the information on the right, e.g.:

```
{
...
'city': ['Washington'],
'date': '18400222',
'title': 'The native American. [volume]',
'publisher': 'J. Elliot Jr.',
...
}
```

A value can be either a number, a string, a Boolean (so true or false), an array (an ordered list of components), an object or null (so left empty). 

Key-value pairs are separated by commas, and they can also be nested, e.g. `items` contains all the items returned by our search query.

You will notice that the result for our query is very long as it contains the OCRed text of each of the six pages (i.e., the text following `'ocr_eng': `).  

**OCR** is short for Optical Character Recognition which is used for digitising historical documents, such as the newspapers in Chronicling America.  OCR is used to turn letters in an image into electronic text.

OCR does not perform 100% accurately and if you look at the text more closely you will be able to find some OCR errors.

You'll also notice "\n" characters which you should be familiar with from the Regular Expression Notebook.  They signify newline characters.  Next, we'll show you how to print the text for each newspaper page a bit more neatly and how to find all matches for "alacrity" in them. 

First, we need to extract the items from our json results.

In [None]:
# extract items from json format
items = json_results["items"]

The results are stored in a list of items which are still in json format.  To be able to read the information a bit better, let's store it in a table (or data frame), with the information for each item being stored in one row (values) and one column for each type of information (keys).

To do that we'll use pandas again which has a useful function called `json_normalize` to read information in json and flatten it into a data frame. 

In [None]:
import pandas as pd
df = pd.json_normalize(items)

Check out the content of the data frame we just created by using the `.head()` function which you have learnt earlier.  Now it's much easier to read the metadata associated with each newspaper page.

In [None]:
df.head(10)

Note, the data frame even contains the URL to the results for each item.  When you follow that you'll also be able to get access to other formats and the images of the newspaper pages themselves.

But let's look at the OCRed text and see what we can do with that. In the next bit of code, we use a for loop to extract the values of the `ocr_eng` column containing the OCRed text and run a regular expression search over it.  We use the `findall()` function to find all mentions of "alacrity".  

The regular expression used is the following : `'[^\n]*\n.*alacrity.*\n[^\n]*'`

It means, match any line containing the word "alacrity" but also the line before and after the one it appears in to show some context.  This looks quite complicated at first but let's take it apart. Remember that:

- `\n` means newline character
- `.*` means zero or more characters
- `[^\n]` means not a newline character
- `[^\n]*` means zero or more characters but not newline

So `[^\n]*\n.*alacrity` means match zero or more characters but not newline, followed by a newline character followed by zero or more characters and the word "alacrity".

`alacrity.*\n[^\n]*` means match the word "alacrity", followed by zero or more characters, followed by newline and zero or more characters but not newline.

So together the RegEx means match the line containing the word "alacrity" as well as the line before and after if there is one.  This might seem difficult but when you use regular expressions frequently, you'll become used to reading and constructing them.

We store the matches of the RegEx search in a list (`all_results`) and then use another for loop to print them more neatly, displaying the number of the item, followed by the match in context.

In [None]:
import re
all_results=[]
for (index, row) in df.iterrows():
    item = row.loc['ocr_eng']
    results=re.findall('[^\n]*\n.*alacrity.*\n[^\n]*', item)
    all_results.append(results)

counter=0
for results in all_results:
    counter=counter+1
    print("Item " + str(counter) + ":")
    for match in results:
        print(match + "\n")
    
    #^.*\bdog and gorilla\b.*\r?\n(?:.*\r?\n){2}((?:.*\r?\n){3})

Look how far you've come. You have downloaded the results for a search query using an API, have converted the data into a data frame, have extracted the textual information, have matched some text using a Regular Expression search and have displayed the results in your notebook. Well done!

Now the power of using API search should be clearer.  It's very useful for searching and accessing data collections hosted online and especially so for analysing large numbers of results.  Imagine our search retrieved not seven but hundreds or thousands of results.  Using API search combined with Python RegEx search allows you to extract and analyse textual data much faster than having to navigate through it all manually in a browser.