Skip to content

dell-research-harvard/AmericanStories

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AmericanStories

The American Stories dataset is a collection of full article texts extracted from historical U.S. newspaper images. It includes nearly 20 million scans from the public domain Chronicling America collection maintained by the Library of Congress. The dataset is designed to address the challenges posed by complex layouts and low OCR quality in existing newspaper datasets. It was created using a novel deep learning pipeline that incorporates layout detection, legibility classification, custom OCR, and the association of article texts spanning multiple bounding boxes. It employs efficient architectures specifically designed for mobile phones to ensure high scalability. The dataset offers high-quality data that can be utilized for various purposes. It can be used to pre-train large language models and improve their understanding of historical English and world knowledge. The dataset can also be integrated into retrieval-augmented language models, making historical information more accessible, including interpretations of political events and details about people's ancestors. Additionally, the structured article texts in the dataset enable the use of transformer-based methods for applications such as detecting reproduced content. This significantly enhances accuracy compared to relying solely on existing OCR techniques. The American Stories dataset serves as an invaluable resource for developing multimodal layout analysis models and other multimodal applications. Its vast size and silver quality make it ideal for innovation and research in this domain.

Hugging Face Dataset

The dataset is on the Hugging Face Hub. More information about the dataset can be found in the paper and the linked dataset card.

Accessing the data

Ensure that you have installed the datasets library from Hugging Face.

!pip install datasets
 

There are 4 configurations possible depending upon the use case.

from datasets import load_dataset

#  Download data for the year 1809 at the associated article level (Default)
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "subset_years",
    year_list=["1809", "1810"]
)

# Download and process data for all years at the article level
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "all_years"
)

# Download and process data for 1809 at the scan level
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "subset_years_content_regions",
    year_list=["1809"]
)

# Download ad process data for all years at the scan level
dataset = load_dataset("dell-research-harvard/AmericanStories",
    "all_years_content_regions")

Colab/Jupyter Notebooks

Replication

We provide all models and scripts used to create American Stories. Processing newspaper scans is relatively simple. Follow the instructions below or refer to the "Processing Scans Example Notebook" above.

  1. Clone this repo to a relevant location and install dependencies:
git clone https://github.com/dell-research-harvard/AmericanStories.git`
cd AmericanStories
pip install -r requirements.txt
  1. Download Models from this Dropbox Folder to a american_stories_models folder

  2. Place one or more Newspaper Scans (in .jp2 format) in a scans folder. Example scans can be downloaded here

Note: PDF format scans are supported, but the dependencies are not install by default, because the pikepdf package has caused dependency conflicts on some machines and with some python versions. If you are planning to process pdfs, you can process them in the same way, but must first install pikepdf:

pip install pikepdf
  1. Run process_scans.sh, scan output will be saved in an output folder.