# Preamble

Welcome to American Stories! Please run the next two cells (hint: hold Shift and press Enter three times) to install the required packages, and then we move to working with the American Stories dataset.

In [None]:
###Installs
!pip install datasets
!pip install ipympl

In [None]:
#Imports
import json

from datasets import load_dataset
import tqdm as tq

# Introduction

Welcome to the intro notebook to AmericanStories. This is an expanded version of the intro notebook available here: https://colab.research.google.com/drive/1ifzTDNDtfrrTy-i7uaq3CALwIWa7GB9A?ts=648b98bf.

In this notebook, we will do three things:

1. Discuss ways to import the data
2. Explore the way the data is structured, and show simple code snippets for accessing the data
3. Provide three applications that would not be possible with keyword search based methods:

A. We show how to find out whether an *article* mentions two terms simultaneously

B. We show how to find out what other articles on a page with a keyword hit
discuss

C. We showcase a wire-cluster pipeline. Wire-clusters are same/similar articles
that originate in a article that went out over the newswire. With this you
can measure which places were exposed to the same newpaper content


# Part I: importing data

There are four main ways of accessing the American Stories data. A key contribution of our project is to make *articles* available, which requires linking together disjoint content regions. To take full advantange of this, we provide access to the data at the article level, or at the level of a scan, that is, one page of one edition of a single newspaper. For both ways, you can either query by year, or get a dump of all data. In sum, the four ways of interacting with our data:

a. Article level, for selected years

b. Article level, for all years

c. Scan level, for selected years

d. Scan level, for all years

### 1a: Article, level, for selected years

We will now select all articles for the year 1900. Below we will explore the resulting dataset in detail

In [None]:
# let's start with deciding which years we want data for
article_level_desired_years = ["1900", ]

# now let's load our data, we have to specify the huggingface location of our
# data, the fact that we want to have a subset of years, and our desired years
dataset_article_level=load_dataset("dell-research-harvard/AmericanStories",
                                   "subset_years",
                                   year_list=article_level_desired_years
                                   )

### 1b: article level, for all years
The next cell contains to get all articles for all years. It is commented because running it requires a lot of bandwidth and data. Please uncomment, and then run to get all data

In [None]:
# Uncomment the next line and execute this cell to download the entire dataset
#dataset_article_level_all_years = load_dataset("dell-research-harvard/AmericanStories", "all_years")

### 1c: scan level, for selected years
The next cell contains to get all scans for selected years.

In [None]:
# let's start with deciding which years we want data for
scan_level_desired_years = ["1900",]

# now let's load our data, we have to specify the huggingface location of our
# data, the fact that we want to have a subset of years, and our desired years
dataset_scan_level=load_dataset("dell-research-harvard/AmericanStories",
                                "subset_years_content_regions",
                                year_list=scan_level_desired_years
                                )

### 1d: scan level, for all years
The next cell contains to get all scans for all years. It is commented because running it requires a lot of bandwidth and data. Please uncomment, and then run to get all data

In [None]:
# Uncomment the next line and execute this cell to download the entire dataset
#dataset_scan_level_all_years = load_dataset("dell-research-harvard/AmericanStories", "all_years")

# Part II: Exploring the data, with applications

We will do three things in the next section:

1. Discuss the structure of the returned dataset
2. Discuss how to access the data
3. Implement our applications

### IIa: Article level

In [None]:
# let's inspect the structure
print(dataset_article_level)

In [None]:
# We see that we have 1.1mln articles for 1900, and that we have several self-explanatory features for each article. Let's explore the data

In [None]:
# the structure of the output is a dictionary. The keys of the dictionary are the fields we have for each article
dataset_article_level["1900"].features

In [None]:
# let's inspect a random article from 1900
dataset_article_level["1900"][5]

In [None]:
# we can see the headline, and that this appeared in the Evening star. Below we will learn more about the Evening star. For now, let's inspect the article
print(dataset_article_level["1900"][5]["article"])

In [None]:
# it is easy to loop through articles and see if it mentions 'spring style'
# here we will loop through the first ten articles

# let's initialize a dict for our articles. Let's collect the index of the
# article that contains our desired string as well as the article_id in case
# we want to find articles back later
dict_of_articles_containing_spring_style = {}

# we are interesting in spring style
str_of_interest = "spring style"

# let's loop!
for article_n in range(10):

  # let's grab the article data
  article = dataset_article_level["1900"][article_n]

  # let's grab the article text
  article_text = article["article"]

  # check if we see the text
  if str_of_interest in article_text:
    dict_of_articles_containing_spring_style[article_n] = article["article_id"]
  else:
    pass

# let's see which articles feature this text
print(dict_of_articles_containing_spring_style)

## Application: Two strings within the *same article*

Previous methods for querying historical newspapers often relied on keyword
searches.

This made it impossible to verify if the same *article* mentioned
two terms.

Our dataset allows this, let's work through a simple example

In [None]:
# let's see if we can see if an article mentions both spring style
# and infants shoes
str_of_interest_1 = "spring style"
str_of_interest_2 = "infants shoes"

# we want to find articles back later
dict_of_articles_containing_spring_style_infants_shoes = {}

# proceed like before, restricting ourselves to the first 10 articles
for article_n in range(10):

  # let's grab the article data
  article = dataset_article_level["1900"][article_n]

  # let's grab the article text. We need to worry about capitalization here
  article_text = article["article"].lower()

  # we now test for both strings being present
  if (
      str_of_interest_1 in article_text
      and
      str_of_interest_2 in article_text
  ):
    dict_of_articles_containing_spring_style_infants_shoes[article_n] = (
        article["article_id"]
    )
  else:
    pass

# let's see which articles feature this text
print(dict_of_articles_containing_spring_style_infants_shoes)

In [None]:

dataset_article_level


In [None]:
# in our article we found that the Boer War was the biggest story. We can
# see which articles mention both the Boer War, and Winston Churchill who
# was there as a report. Let's try with all articles this time
str_of_interest_1 = "boer"
str_of_interest_2 = "churchill"

# we want to find articles back later
dict_of_articles_containing_boer_war_churchill = {}

# proceed like before, looping through all articles. This should take 3 to 4 min
for article_n in tq.tqdm(range(1118970)):

  # let's grab the article data
  article = dataset_article_level["1900"][article_n]

  # let's grab the article text. We need to worry about capitalization here
  article_text = article["article"].lower()

  # we now test for both strings being present
  if (
      str_of_interest_1 in article_text
      and
      str_of_interest_2 in article_text
  ):
    dict_of_articles_containing_boer_war_churchill[article_n] = (
        article["article_id"]
    )
  else:
    pass


In [None]:
# let's see how many articles we got
print(len(dict_of_articles_containing_boer_war_churchill))

# let's print the results
print(dict_of_articles_containing_boer_war_churchill)

In [None]:
# let's inspect the first one
dataset_article_level["1900"][5538]

# this article is from the Age Herald from May 12th, and describes the progress of the war

In [None]:
# let's print the article
print(dataset_article_level["1900"][5538]["article"])

In [None]:
import pandas as pd
# It is possible to visualze the frequency of appearance over time

# let's initialize a list of our dates
list_of_dates = []

# let's loop over our article indices
for article_n in dict_of_articles_containing_boer_war_churchill.keys():

  # let's grab the date
  date = dataset_article_level["1900"][article_n]['date']

  # let's add to our list
  list_of_dates.append(date)



In [None]:
#  now let's plot the frequency by date
df = pd.DataFrame({'date': list_of_dates})
df.groupby('date').size().plot(kind='bar', figsize=(15,9))

### IIb: Scan level

In [None]:
# let's inspect the structure
print(dataset_scan_level)

In [None]:
# let's inspect the data for 1900. Note that this data is provided as a string.
# We can easily convert to a JSON

# get the string
raw_data_string = dataset_scan_level["1900"][0]['raw_data_string']

# convert to a dict
article_json = json.loads(raw_data_string)

# inspect which features we have
article_json.keys()

Each scan contains several pieces of information:

1. The page it appears on
2. The URL linking to the original page scan
3. Meta-information on the scan (scan and scan_ocr)
4. A dictionary of the bounding boxes that we identified on the page
5. The date of this edition
6. A dictionary of full_articles appearing on this page. Here it is
especially important to note that we include a list of the bounding
boxes that together form the article



In [None]:
# Each scan contains several pieces of information, here we are just accessing the
# first article
article_json['full articles'][0]

In [None]:
# note that this dictionary lists the bounding boxes the article is composed
# of, the article text, and the article id. If we want, we can use this article
# id to link articles we found above to their pages. Note that all article text
# that you can explore using the article data, you can also explore using the
# scan data

### Let's now go back to the evenstar

We will try to find the page that the article we identified above appeared on

In [None]:
# let's start with identifying the article id
article_id_of_interest = dict_of_articles_containing_spring_style_infants_shoes[5]
print(article_id_of_interest)

In [None]:
# we will take our article and loop through articles on page to find the
# relevant page

# we want to find articles back later. Let's record the index of the page and
# then the index of the article that we're looking for
page_with_hit = {}

# proceed like before, looping through all scans (we saw above how many there are).
# This should take 3
for scan_n in tq.tqdm(range(77496)):

  # get the string
  raw_data_string = dataset_scan_level["1900"][scan_n]['raw_data_string']

  # convert to a dict
  article_json = json.loads(raw_data_string)

  # now we have to loop through articles
  for article_n in range(len(article_json["full articles"])):

    # let's grab the article
    article_data = article_json['full articles'][article_n]

    # note that the ids have '.json' attached here
    article_id = article_data["id"].split(".")[0]

    # let's see if the id matches
    if article_id == article_id_of_interest:
      page_with_hit[scan_n] = article_n
    else:
      pass

In [None]:
page_with_hit

## Application: What other articles are printed with our article

Now that we have identified the page we can ask what other articles were
printed on the page that our article of interest appears on!

In [None]:
# let's get the page our articles appears on
raw_data_string = dataset_scan_level["1900"][1]['raw_data_string']

# convert to a dict
article_json = json.loads(raw_data_string)

In [None]:
# let's inspect that we have recovered our article
article_json['full articles'][5]

In [None]:
# now we can inspect what else appeared on that page that day!
article_json['full articles']

In [None]:
# finally, we can inspect the image of the original page!
article_json['scan']

# The jp2 URL is the URL to the image, let's visualize it, this will take a bit of code

In [None]:
import requests
import cv2
import numpy as np
from matplotlib import pyplot as plt

def get_ca_scan_img(ca_url):
  img_download_session = requests.Session()
  response = img_download_session.get(ca_url)
  if response.status_code != 200:
    print(f'Error! {response.status_code}')
    print(f'Please verify that {ca_url} is a valid chronicling america url!')
    response.raise_for_status()

  data = response.content
  ca_img = cv2.imdecode(np.frombuffer(data, np.uint8), cv2.IMREAD_COLOR)
  return ca_img

# grab the url
ca_url = article_json['scan']['jp2_url']

#Get the image scan
ca_img = get_ca_scan_img(ca_url)

# Create a figure for the plot
plt.ion()
fig, ax = plt.subplots(1)
fig.set_size_inches(10, 10)

# Display image
ax.imshow(ca_img)
fig.suptitle('')

plt.show()

In [None]:
# with a bit more work, this routine can of course be expanded to the edition,
# that is, all pages in that newspaper on the same day

# Application: Spelling checking

There is natural variation in the quality of the scans, and of OCR. It is
straightforward to apply a spelling checker to any article. Here we briefly
demonstrate how


In [None]:
# let's return to our article about shoes
shoes_article = dataset_article_level["1900"][5]['article']

In [None]:
# let's have a look
print(shoes_article)

The article text contains a few mistakes that we can correct. Looking at the text of the article, a few errors are apparent:
1. Some spelling mistakes exist, for example "peopie" in the first/second line
2. Some words are divided between multiple lines ("peo- pie", "rep utable", "At tention", etc).
3. Capitalization errors are somewhat common: "i'll", "...lines Of..."

In [None]:
# we have to download a spellchecking package
!pip install symspellpy


In [None]:
# let's initialize the package
import pkg_resources
from symspellpy import SymSpell, Verbosity
import string

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
en_dict = pkg_resources.resource_filename('symspellpy', 'frequency_dictionary_en_82_765.txt')
sym_spell.load_dictionary(en_dict, term_index=0, count_index=1)

In [None]:
# we now create a few functions that take care of the issues flagged above

# thsese two functions implement spelling corrections
def check_word(word):
  no_punc_word = word.strip(string.punctuation)
  if len(no_punc_word) > 0:
    suggestions = sym_spell.lookup(no_punc_word, Verbosity.CLOSEST, max_edit_distance=1, include_unknown=True, transfer_casing=True)
  else:
    return word
  return word.replace(no_punc_word, suggestions[0].term)

def spell_check(text):
  lines = text.split('\n')
  checked_lines = []
  for line in lines:
    words = line.split(' ')
    checked_line = ' '.join([check_word(word) for word in words])
    checked_lines.append(checked_line)
  return '\n'.join(checked_lines)

# this function checks capitalization
def capitalization_check(text):
  lines = text.split('\n')
  checked_lines = []
  for line in lines:
    words = line.split(' ')
    for i in range(1, len(words)):
      if words[i-1][-1] in ['.', '!', '?']:
        words[i] = words[i].capitalize()
      else:
        no_punc_word = words[i].strip(string.punctuation)
        if no_punc_word in sym_spell.words and no_punc_word not in ['i', "i'll"]: # Check that the word is not a propper noun
          words[i] = words[i].replace(no_punc_word, no_punc_word.lower())

    checked_lines.append(' '.join(words))
  return '\n'.join(checked_lines)

# this functions corrects line breaks
def line_merge(text):
  lines = [l.split() for l in text.split('\n')]
  for i in range(len(lines) - 1):
    if len(lines[i]) == 0 or len(lines[i+1]) == 0:
      continue
    elif lines[i][-1][-1] == '-': # Automatically merge if a line ends with a hyphen
      lines[i][-1] = lines[i][-1][:-1] + lines[i+1][0]
      lines[i+1] = lines[i+1][1:]
    elif lines[i][-1].strip(string.punctuation).lower() not in sym_spell.words or lines[i+1][0].strip(string.punctuation).lower() not in sym_spell.words:
      if (lines[i][-1].strip(string.punctuation).lower() + lines[i+1][0].strip(string.punctuation).lower()) in sym_spell.words:
        lines[i][-1] += lines[i+1][0]
        lines[i+1] = lines[i+1][1:]

  return '\n'.join([' '.join(l) for l in lines])


In [None]:
# this functions implements all three methods
def postprocess(text):
  merged = line_merge(text)
  checked = spell_check(merged)
  capitalization_normalized = capitalization_check(checked)
  return capitalization_normalized

In [None]:
# now we can simple pass our text to the spelling correction routine!
print(postprocess(shoes_article))

Postprocessing can significantly help text, but can also create errors. If the dictionary does not include a correctly transcribed proper noun or anachronism, it can "correct" it to an erroneous word.