Skip to content

Importing Works from the Internet Archive

bencomp edited this page May 4, 2018 · 7 revisions

FromThePage supports transcribing books hosted on the Internet Archive. This is a great way to explore documents that have already been digitized, and it's actually easier to set up than uploading scans directly to FromThePage.

Option 1: URL-based Imports

  1. Log in as a user who is authorized to own works.
  2. Click the Dashboard link (located next to the login link)
  3. On the left side of the screen, you'll see an area called "Owner Actions".
  4. Click on "Import a Book from the Internet Archive"
  5. Cut-and-paste the URL of the Internet Archive page for the book you want to import, then press "next"
  6. See the section "Import an Internet Archive Book" below

Option 2: Navigate the Internet Archive through FromThePage

Find your institution's collection in Archive.org's OAI repository.

  1. Log in as a user who is authorized to own works.
  2. Click the Dashboard link (located next to the login link)
  3. On the left side of the screen, you'll see an area called "Owner Actions".
  4. Click Explore OAI Repositories
  5. Click "Show All Sets" next to the Archive.org link Wait a very long time (possibly several minutes) for FromThePage to query Archive.org for all its OAI sets. This is a very long list indeed.
  6. Search the page for your institution.
  7. Click "Save for future use" next to the spec.
  8. This should redirect you to the dashboard again. There should now be a link in the owner's section saying "List works to import from your collection".

Find your work in the Archive.org collection.

  1. Click the "List works to import" link. This will query Archive.org for the works it has in that OAI set.
  2. Click the Import button beside one of the field notes.

Import the Internet Archive Book

Clicking "Import" on either of the above options will imports all the relevant Archive.org information about the book, as well as information for each scanned leaf into FromThePage. This process may take a couple of minutes, depending on how many leaves are in the scan. The import process adds this IA book to the user's staging area (accessible via the dashboard), and redirects straight to the Manage Import screen.

Review and convert the Archive.org book

The Manage Import screen shows all the pages imported from Archive.org and provides the following three features:
  • Purge Delete Scans: Some leaves that Archive.org scans are classified as of type="Delete". These are apparently things like color calibration cards and such, and are never displayed by Archive.org. These should be purged, so press this button.

  • Retitle from OCR: this is unique to pre-printed 20th-century daybooks. For these materials, the OCR has done a pretty good job of parsing the date that's printed at the top of each page. I've written code to re-title the numeric page numbers (which are really leaf titles) based on these parsed OCR entries. Press this button and wait a few minutes for the parsing to happen. The pages will be re-titled from OCR, and while they may need correction, it can save a lot of effort for diaries and journals.

  • Convert to FromThePage: This converts an Archive.org-imported book and its leaves into a FromThePage work with corresponding pages. This is the final piece of the IA book importer. It also may take a few minutes to run, so please be patient.

Once the converter is finished, you can access the work from the dashboard. Move on to Preparing a Work for Transcription.