# Determining IA Scanned books
## Problem
Internet Archive's text archiving, containing scanned books and documents, has approximately 38 million records in it by 2023-07. However, to date the scannning labor team has only been able to identify 9 million texts by querying via scanning center: [insert he link]
- this dataset was accumulated first by querying for all scan ids from known scanning centers unearthed through analyzing Open Library's data dumps (which contains 2 million records scanned at 64 unique centers) and cross checking these with texts included in major subcollections of the IA text collection (American Libraries, Canadian Libraries, Books by Language, etc.)
- Through querying collections, we discovered that many of the books in IA's records were not scanned at IA scanning centers. Rather, individual users uploaded public domain books scanned for other mass digitizaiton projects including Google Books, HathiTrust, and Project Gutenberg.


The goal of this jupyter notebook is to try and come up with a more-definitive list of books included in IA's collections that were scanned by IA-employed workers at IA scanning centers. 

## Experimental Design

To do so, I, first, queried IA's advanced search database for texts where the uploader field contained the string "@archive.org" using the below html query: https://archive.org/advancedsearch.php?q=mediatype%3A%28texts%29&uploader%3A%28%40archive.org%29&fl%5B%5D=identifier&rows=14000000&output=csv&callback=callback&save=no

Unlike the scanning center field, which is optional across all IA media, the uploader is a required metadata field for all items in IA's collection. It is not usually displayed on the resource page; rather, it is stored in the meta.xml metadata file for the item. The uploader field stores the email address affiliated with the account of the person responsible for uploading an item. IA's software automatically assigns it. See details about field below or by visiting IA's documentation: **uploader**
- internal use only: Yes
- usage notes: The uploader field determines which account has full access to modify/edit/delete metadata and files from the item without having any special privileges granted. Any other account that wants to modify this item must have some level of administrative privilege granted by Internet Archive.
- definition: Email address of the account that uploaded the item to archive.org.
- required: Yes
- label: Item Uploader
- repeatable: No
- accepted values: Email address
- edit access: IA admin
- defined by: IA software
- example: footage@panix.com

## Hypothesis
Theoretically, all IA employees who are scanning materials at IA scanning centers should be issued an interet archive email address associated with their account. Therefore, this jupyter notebook will test the hypothesis that all items in the 14 million records with the string "@archive.org" in their uploader field will also be contained within the 9 million records scanned at IA scanning centers. 

## The 9 million records
We are fairly confident the 9 million records in this csv file were scanned at IA based scanning centers. 

These records were discovered, first, by accessing 2 million records from Open Library's data dumps and discovering/geocoding 64 unique scanning centers. Next, we queried IA's database for records associated with each scanning center name. However, we had no reason to believe that open library's data dumps contained a text scanned at every scanning center. So, we broadened the search by querying major IA collections for the scanning center field of their contents. When a scanning center was no represented in the list gathered so far, we added it to the list of scanning centers. We performed this search for 14 IA collections: 
- americana
- biodiversity
- booksbylanguage_arabic
- booksbylanguage
- china
- digitallibraryindia
- europeanlibraries
- folkscanomy
- inlibrary
- internetarchivebooks
- JaiGyan
- newspapers
- printdisabled
- toronto

In total, we discovered 9 million records affiliated with 96 scanning centers (note these have not yet been deduplicated): 'shenzhen','hongkong','china','cebu','alberta','sfdowntown', 'tt_sanfrancisco','iala','indiana','euston','uoft','la','nj','santamonica','tt_pts','boston','beltsville','capitolhill','nyc','chapelhill','rich',
 'richmond','richflorida','ill','il','edinburgh','sfciviccenter', 'maryland','raleigh','provo','washingtondc','tt_georgetown','sheridan','arch','miss','durham','durham2','providence','tt_providence','lond','london','manhattan','honolulu','rexburg','valencia','gainesville','tt_osu','tt_numismatic','tt_getty','tt_harvardernstmayr','pretoria','IIIT Hyderabad','CCL HYDERABAD','sacramento','beijing','tt_bangalore','tt_sok','trent','guatemala','tt_swinburne','BookScanUS','tt_amnh','harrisburg','amherst','tt_stanfordlaw','tt_pem','tt_calacademy', 'tt_jakarta','hangzhou','clemson','Osmania University','clarksville','tt_victoria','poughkeepsie','tt_warwick','tt_riks','1dollarscan (zLibro)','RMSC_IIITH','tt_stlouis','saltlakecity','tt_louisville','tt_oberlin','tt_clatsopcounty','brussels','tt_harrisburg','tt_statenisland','utah','hbl_storrs','tt_perkins','tt_harvardwidener','Hong Kong','AP Press Academy Archives','hamilton','tt_nybg','mobot','bali'

In [4]:
import pandas as pd

In [5]:
by_centers = pd.read_csv("/Volumes/Samsung_T5/scanning_labor_in_IA/texts-data.csv")

  centers = pd.read_csv("/Volumes/Samsung_T5/scanning_labor_in_IA/texts-data.csv")


In [14]:
id_centers = by_centers['identifier'].tolist()

In [15]:
id_centers

['geometrysuccessi00thom',
 'berdiemobilittde00hebe',
 'beilsteinshandb28beil',
 'theoryapplicatio00adva',
 'specialeditionus00edbo_0',
 'surinamediscover00coll',
 'tclcvolume99twen00thom',
 'sweetdreams00pame',
 'sitasnakequeenof00fran',
 'sleepingwithherr00sher',
 'talldarkstranger00smit',
 'poemsaboutlove00grav',
 'ponyinpumpkinpat00bagl',
 'poohvisitsdoctor00kath',
 'takingcareofbusi00leej',
 'raisinggoodsport00sell',
 'randomthoughts00crgi',
 'shakespearecount00hill_0',
 'aycubasocioeroti00codr',
 'scienceannual199400lori',
 'parisversailles00gray',
 'nomercyhostofam00wals',
 'mythologyofcrime00vict',
 'namesofchrist00hort',
 'nativeroadscompl00kosi',
 'naturalmedicines00jell',
 'oneshenandoahwin00bunn',
 'opal00boul',
 'operation00ande',
 'publicaffairspri00bald',
 'pumpkintownornot00mcky',
 'puttingonglitzun00hatc',
 'photoshop55image00wein',
 'plantsofoceanthe00sueb',
 'pharmacologyther00grol',
 'peachpitpopulari00simp',
 'promiseoffaith00wild',
 'proudpeopleshe00alfo',
 'psych

In [20]:
# the request timed out at 1 million, so i need to repeat :()
by_uploaders = pd.read_csv('/Volumes/Samsung_T5/scanning_labor_in_IA/ia_scanned_ids.csv')

In [19]:
by_uploaders

Unnamed: 0,identifier
0,DFD29_20171225
1,the-trace-of-interbeing
2,fold1.harrisburg.archive.org
3,SalmosAlabadFELiRe
4,satrss_20170810
...,...
11694995,jstor-30065112
11694996,eoab042
11694997,State-Dept-cable-1975-179735
11694998,jstor-30065212
