# Manifest Builder
In this notebook we're cross walking our current data structures to the [manifest](https://github.com/hock/Manifest/wiki) supply chain data format.

Plan to manually combine scanning centers with more than 1 scanning center id prior to brining into this notebook for real. 

## Manifest Fields

The Manifest [google sheets template](https://docs.google.com/spreadsheets/d/17P3kAShGgpSUV0P8f38zAlWYvF_Vxf8OldY6yriTfns/edit?usp=sharing ) has the following fields: 
- index: integer
- Name: the name of the node (in our case the scanning center name)
- description: a markdown description of the location. Can include links and such.
- Category: types of locations. there can be more than one category per item. Given as # and separated by a comma, i.e., "#scanningcenter,#academiclibrary"
- Images: a url leading to an image. If we wanted to include images of the scanning centers, we could do so using this.
- Location: text description of the location, i.e., Allen County, IN
- Geocode: the latitude and longitudinal coordinates of the location separated by a comma, i.e., "40.72401342,-74.0064435"
- DestinationIndex: the index of any locations that the location should be connected to. For example, the destination index for the physical archive would include Datum Data, IA Hong Kong, and Innodata
- Measure: any measure associated with the value including starttime and endtime. All measures consist of 3 values nested in parantheses (measure_name,measure_value,measure_unit). Measures are separated from each other by comma. For example, "(books_scanned,3800,books),(pages_scanned,252988,pages),(median_turnover,1127.0,days),(days_operated,2267.0,days),(starttime,1445467744,utc)"
  - starttime: when the location becomes relevant to the supply chain given in UTC time, i.e., (starttime,1445467744,utc)
  - endtime: when the location stops being relevant to the supply chain given in UTC time, i.e., (endtime,1445467744,utc)
- Sources: any sources affiliated with the supply chain. In our case, we may include the actual links to texts scanned at that location in IA.
- AdditionalNotes: I don't think these appear on manifest. it would be a good place to include all the values in the "scanningcenter" field that map to that scanning center. For example, UIUC could have this entry in AdditionalNotes: "scanningcenter values affiliated with this center include: "il" and "ill"



In [33]:
import pandas as pd
import datetime

In [53]:
location_key = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/refs/heads/main/metadata-analysis/metadata-records-analysis-csvs/location_key.csv")
location_key

Unnamed: 0,scan_center,lat,long,name
0,1dollarscan (zLibro),37.278389,-121.949260,1 Dollar Scan
1,indiana,41.077129,-85.143200,Allen County Public Library Geneaology Center
2,tt_amnh,40.781304,-73.974049,American Museum of Natural History
3,tt_numismatic,38.649678,-90.328482,Washington University in St. Louis
4,tt_louisville,38.257598,-85.714221,American Printing House for the Blind
...,...,...,...,...
103,tt_victoria,48.463375,-123.310025,University of Victoria
104,tt_warwick,52.381617,-1.561842,University of Warwick
105,tt_stlouis,38.649678,-90.328482,Washington University in St. Louis
106,amherst,42.321888,-72.527668,Yiddish Book Center


In [54]:
location_key['long']

0     -121.949260
1      -85.143200
2      -73.974049
3      -90.328482
4      -85.714221
          ...    
103   -123.310025
104     -1.561842
105    -90.328482
106    -72.527668
107    120.119013
Name: long, Length: 108, dtype: float64

In [55]:

center_names = []
centers = []

for i in range(len(location_key)):
    if location_key.at[i, 'name'] in center_names:
        this_id = location_key.at[i, 'scan_center']
        for this_center in centers: 
            if this_center['name'] == location_key.at[i, 'name']: 
                this_center['scanningcenter_ids'] += str( "||" + this_id)
            else:pass
        # need to append the scanning center id here somehow ... i think this needs to be a dataframe ugh 
    else: 
        centers.append({
            'name': location_key.at[i, 'name'],
            'lat': str(location_key.at[i, 'lat']),
            'long': str(location_key.at[i, 'long']),
            'scanningcenter_ids': str(location_key.at[i, 'scan_center']) 
        })
        center_names.append(location_key.at[i, 'name'])






In [56]:
centers_df = pd.DataFrame(centers)
centers_df

Unnamed: 0,name,lat,long,scanningcenter_ids
0,1 Dollar Scan,37.27838889,-121.9492601,1dollarscan (zLibro)
1,Allen County Public Library Geneaology Center,41.0771285,-85.1432003,indiana
2,American Museum of Natural History,40.78130431,-73.97404878,tt_amnh
3,Washington University in St. Louis,38.649678,-90.32848246,tt_numismatic||tt_stlouis
4,American Printing House for the Blind,38.25759766,-85.71422071,tt_louisville
...,...,...,...,...
71,University of Toronto,43.66441368,-79.39947221,uoft||UofT
72,University of Victoria,48.4633746,-123.3100252,tt_victoria
73,University of Warwick,52.38161719,-1.561841996,tt_warwick
74,Yiddish Book Center,42.32188825,-72.52766762,amherst


In [57]:
def make_geocords(lat, long): 
    return str(lat) + ',' + str(long)

# https://archive.org/search?query=scanningcenter%3A%28cebu%29
def get_ia_links(center_ids): 
    links = []
    for this_id in center_ids.split('||'): 
        links.append("https://archive.org/search?query=scanningcenter%3A%28" + str(this_id.replace(" ", "")) + "%29"
        )
    formatted_links = ""
    if len(links) > 1:
        for i in range(len(links)): 
            if i < (len(links)):
                formatted_links += "(" + links[i] + "),"
            else: 
                formatted_links += "(" + links[i] + ")"
    else: 
        formatted_links =  "(" + links[0] + ")"
    return formatted_links

In [58]:
get_ia_links(centers_df.at[5, 'scanningcenter_ids'])

'(https://archive.org/search?query=scanningcenter%3A%28tt_swinburne%29)'

In [59]:
manifest_data = []
for i in range(len(centers_df)): 
    manifest_data.append(
        {'Index': i,
         'Name': centers_df.at[i, 'name'], 
         'Description': '',
         'Category': '',
         'Images': '',
         'Location':'',
         'Geocode': make_geocords(centers_df.at[i, 'lat'], centers_df.at[i, 'long']),
         'DestinationIndex': '',
         'Measure':'',
         'Sources': get_ia_links(centers_df.at[i, 'scanningcenter_ids']),
         'AdditionalNotes':'Scanning center ids separated by "||": ' + str(centers_df.at[i,'scanningcenter_ids'])
        }
    )

In [9]:
manifest_data

[{'Index': 0,
  'Name': '1 Dollar Scan',
  'Description': '',
  'Category': '',
  'Images': '',
  'Location': '',
  'Geocode': '37.27838889,-121.9492601',
  'DestinationIndex': '',
  'Measure': '',
  'Sources': '(https://archive.org/search?query=scanningcenter%3A%281dollarscan(zLibro)%29)',
  'AdditionalNotes': 'Scanning center ids separated by "||": 1dollarscan (zLibro)'},
 {'Index': 1,
  'Name': 'Allen County Public Library Geneaology Center',
  'Description': '',
  'Category': '',
  'Images': '',
  'Location': '',
  'Geocode': '41.0771285,-85.1432003',
  'DestinationIndex': '',
  'Measure': '',
  'Sources': '(https://archive.org/search?query=scanningcenter%3A%28indiana%29)',
  'AdditionalNotes': 'Scanning center ids separated by "||": indiana'},
 {'Index': 2,
  'Name': 'American Museum of Natural History',
  'Description': '',
  'Category': '',
  'Images': '',
  'Location': '',
  'Geocode': '40.78130431,-73.97404878',
  'DestinationIndex': '',
  'Measure': '',
  'Sources': '(https:/

In [60]:
for i in range(len(manifest_data)):
    # subsetting the location dataframe for only entries with the same entry in the same field (same center)
    locs = location_key[location_key["name"] == manifest_data[i]['Name']].reset_index()
    print(locs)
    try:
        # assuming that every center that shares the same name is the same center and should have the same geographic coordinates 
        manifest_data[i]['Geocode'] = str(locs.at[0, 'lat']) + ',' +  str(locs.at[0, 'long'])
        # print(manifest_data[i]['Geocode'])
    except KeyError:
        pass
    except ValueError:
        pass
    



   index           scan_center        lat       long           name
0      0  1dollarscan (zLibro)  37.278389 -121.94926  1 Dollar Scan
   index scan_center        lat     long  \
0      1     indiana  41.077129 -85.1432   

                                            name  
0  Allen County Public Library Geneaology Center  
   index scan_center        lat       long                                name
0      2     tt_amnh  40.781304 -73.974049  American Museum of Natural History
   index    scan_center        lat       long  \
0      3  tt_numismatic  38.649678 -90.328482   
1    105     tt_stlouis  38.649678 -90.328482   

                                 name  
0  Washington University in St. Louis  
1  Washington University in St. Louis  
   index    scan_center        lat       long  \
0      4  tt_louisville  38.257598 -85.714221   

                                    name  
0  American Printing House for the Blind  
   index   scan_center        lat        long  \
0      5  tt_

In [61]:
len(manifest_data)

76

In [62]:
manifest_df = pd.DataFrame.from_dict(manifest_data)


In [63]:
manifest_df

Unnamed: 0,Index,Name,Description,Category,Images,Location,Geocode,DestinationIndex,Measure,Sources,AdditionalNotes
0,0,1 Dollar Scan,,,,,"37.27838889,-121.9492601",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": 1dollar..."
1,1,Allen County Public Library Geneaology Center,,,,,"41.0771285,-85.1432003",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": indiana"
2,2,American Museum of Natural History,,,,,"40.78130431,-73.97404878",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_amnh"
3,3,Washington University in St. Louis,,,,,"38.649678,-90.32848246",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_numi..."
4,4,American Printing House for the Blind,,,,,"38.25759766,-85.71422071",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_loui..."
...,...,...,...,...,...,...,...,...,...,...,...
71,71,University of Toronto,,,,,"43.66441368,-79.39947221",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": uoft||UofT"
72,72,University of Victoria,,,,,"48.4633746,-123.3100252",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_vict..."
73,73,University of Warwick,,,,,"52.38161719,-1.561841996",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_warwick"
74,74,Yiddish Book Center,,,,,"42.32188825,-72.52766762",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": amherst"


## Category 

adding category info to the scanning centers

In [13]:
center_types = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/refs/heads/main/metadata-analysis/metadata-records-analysis-csvs/scan-center-type.csv")[['name', 'type']]

In [14]:
center_types = center_types.drop_duplicates().reset_index()[['name','type']]

In [15]:
center_types

Unnamed: 0,name,type
0,"Innodata Knowledge Services, Inc.",bpo
1,Hong Kong,bpo
2,University of Alberta,academic
3,Internet Archive Headquarters,hq
4,Datum Data Co. Ltd.,bpo
...,...,...
74,Press Academy of Andhra Pradesh,archive
75,Hamilton Public Library,public
76,New York Botanical Garden,museum
77,Missouri Botanical Garden,museum


In [16]:
for i in range(len(manifest_df)): 
    for j in range(len(center_types)): 
        if manifest_df.at[i, 'Name'] == center_types.at[j, 'name']: 
            manifest_df.at[i, 'Category'] = "#"+ str(center_types.at[j, 'type'])
        else: 
            pass 

for i in range(len(manifest_df)): 
    if manifest_df.at[i, 'Category'] == "#nan":
        manifest_df.at[i, 'Category'] = ''
    else:
        pass


## Measures 

adding measure info. Should be formatted as (measure_name,measure_value,measure_unit). 



In [17]:
geocounts = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/refs/heads/main/metadata-analysis/metadata-records-analysis-csvs/geocounts.csv")[['name', 'total_scans']]

In [18]:
# total pages scanned

pages = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/refs/heads/main/metadata-analysis/metadata-records-analysis-csvs/total_pages_scanned_per_center.csv")
pages

Unnamed: 0.1,Unnamed: 0,name,pages_scanned
0,0,Allen County Public Library Geneaology Center,41400011.0
1,1,American Museum of Natural History,13456.0
2,2,American Printing House for the Blind,94959.0
3,3,Analysis and Policy Observatory (APO),2342.0
4,4,"BYU, Hawaii",339732.0
...,...,...,...
60,60,University of Victoria,581098.0
61,61,University of Warwick,38355.0
62,62,Washington University in St. Louis,2099033.0
63,63,Yiddish Book Center,188933.0


In [64]:
yearly_books = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/refs/heads/main/metadata-analysis/metadata-records-analysis-csvs/annual_scans_per_center.csv")

def get_times(yearly_books, this_center):
    endyear = max(yearly_books.loc[yearly_books['name']== this_center]['year'].tolist())
    endtime = "(endtime,"+str(datetime.datetime(endyear, 12, 31, 0, 0).strftime('%s'))+",utc)"
    startyear = min(yearly_books.loc[yearly_books['name']== this_center]['year'].tolist())
    starttime = "(starttime," + str(datetime.datetime(startyear, 1, 1, 0, 0).strftime('%s')) + ",utc)"
    return {'starttime': starttime, 'endtime': endtime}

In [19]:
def get_measures(geocounts, pages, i, manifest_df): 
    try:
        total_scans = geocounts.loc[geocounts['name']==manifest_df.at[i,'Name']].reset_index()['total_scans'].iloc[0]
        pages = pages.loc[pages['name']==manifest_df.at[i,'Name']].reset_index()['pages_scanned'].iloc[0]
        times = get_time(yearly_books, )
        return "(books_scanned,"+ str(total_scans)+ ",books),(pages_scanned,"+str(pages)+",pages)"
    except IndexError: 
        print(geocounts.loc[geocounts['name']==manifest_df.at[i,'Name']])
    # return geocounts.loc[geocounts['name']==manifest_df.at[i,'Name']].reset_index().at[0,'total_scans']
get_measures(geocounts, pages, 24, manifest_df)

'(books_scanned,195,books),(pages_scanned,60557.0,pages)'

In [20]:
measures = []
for i in range(len(manifest_df)): 
    measures.append({'name':manifest_df.at[i,'Name'],
                     'measure': get_measures(geocounts, pages, i, manifest_df)
    })

            name  total_scans
0  1 Dollar Scan           20
         name  total_scans
8  BookScanUS           42
Empty DataFrame
Columns: [name, total_scans]
Index: []
Empty DataFrame
Columns: [name, total_scans]
Index: []
                                                 name  total_scans
28  International Institute of Information Technol...         8322
Empty DataFrame
Columns: [name, total_scans]
Index: []
                         name  total_scans
36  Missouri Botanical Garden            1
                  name  total_scans
43  Osmania University         2522
                          name  total_scans
46  Perpustakaan Provinsi Bali          560
                               name  total_scans
47  Press Academy of Andhra Pradesh         4150
                                              name  total_scans
49  Regional Mega Scanning Center by IIT, Hyderbad        12555
                name  total_scans
60  Trent University            2
                         name  total_scans
67  

In [21]:
measures
measures_df = pd.DataFrame(data=measures)

In [22]:
for entry in measures: 
    # manifest_df.loc[manifest_df['Name']== entry['name']]['Measures'] = entry['measure']
    index_row = manifest_df.index[manifest_df['Name'] == entry['name']].tolist()[:][0]
    manifest_df.at[index_row, 'Measure'] = entry['measure']

In [23]:
manifest_df

Unnamed: 0,Index,Name,Description,Category,Images,Location,Geocode,DestinationIndex,Measure,Sources,AdditionalNotes
0,0,1 Dollar Scan,,,,,"37.27838889,-121.9492601",,,(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": 1dollar..."
1,1,Allen County Public Library Geneaology Center,,#public,,,"41.0771285,-85.1432003",,"(books_scanned,214706,books),(pages_scanned,41...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": indiana"
2,2,American Museum of Natural History,,#museum,,,"40.78130431,-73.97404878",,"(books_scanned,29,books),(pages_scanned,13456....",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_amnh"
3,3,Washington University in St. Louis,,#academic,,,"38.649678,-90.32848246",,"(books_scanned,18103,books),(pages_scanned,209...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_numi..."
4,4,American Printing House for the Blind,,#archive,,,"38.25759766,-85.71422071",,"(books_scanned,2326,books),(pages_scanned,9495...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_loui..."
...,...,...,...,...,...,...,...,...,...,...,...
71,71,University of Toronto,,#academic,,,"43.66441368,-79.39947221",,"(books_scanned,465884,books),(pages_scanned,15...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": uoft||UofT"
72,72,University of Victoria,,#academic,,,"48.4633746,-123.3100252",,"(books_scanned,6843,books),(pages_scanned,5810...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_vict..."
73,73,University of Warwick,,#academic,,,"52.38161719,-1.561841996",,"(books_scanned,715,books),(pages_scanned,38355...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": tt_warwick"
74,74,Yiddish Book Center,,#academic,,,"42.32188825,-72.52766762",,"(books_scanned,1242,books),(pages_scanned,1889...",(https://archive.org/search?query=scanningcent...,"Scanning center ids separated by ""||"": amherst"


In [24]:
manifest_df.to_csv("/Users/elizabethschwartz/Documents/GitHub/ia_scanning_labor_data/manifest-records/manifest_doc_v1.csv")

did add a chart view but no way to say 12 measures were connected in some way 
cumulative stat with a different node for each center 
we could add a measure that has gradiations in it 
map embed is coming 

IA supply chain 2010, IA supply chain 2011

collections - lists a set of manifests 
load different data layers - any kind of geojson data, inbox data, 

lib/json/cases if you point manifest to #collective 

# Annual Manifest documents 

In [25]:
yearly_books = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/refs/heads/main/metadata-analysis/metadata-records-analysis-csvs/annual_scans_per_center.csv")

In [40]:
for entry in yearly_books['name'].unique().tolist():
    

['Allen County Public Library Geneaology Center',
 'American Museum of Natural History',
 'American Numismatic Society',
 'American Printing House for the Blind',
 'Analysis and Policy Observatory (APO)',
 'BYU, Hawaii',
 'BYU, Idaho Family History Library',
 'BYU, Provo',
 'Boston Public Library',
 'British Library',
 'Brown University',
 'California Acaddemy of Sciences',
 'California State Library',
 'Centre for Strategic and International Studies, Jakarta',
 'Church History Library',
 'Clatsop County Historical Society',
 'Clemson University',
 'Columbia University',
 'Datum Data Co. Ltd.',
 'Duke University',
 'Family Search Library',
 'Georgetown University',
 'Getty Research Institute',
 'Getty Research Institute Valencia Warehouse',
 'Hamilton Public Library',
 'Harvard',
 'Hong Kong',
 'Hopewell Junction',
 'Innodata Knowledge Services, Inc.',
 'Internet Archive Headquarters',
 'Internet Archive Sheridan Headquarters',
 'Internt Archive Physical Archive',
 'John Hopkins Univer

Need to create 3 different measures: 
- start date time: the first year for which we have data for the scanning center- this should be rendered in unix datetime and written `(starttime,xxxxx,utc)` 
- end date time: the last year for which we have data. this should be rendered in unix datetime and written `(endtime,xxxxx,utc)` 
- books scanned per year. Should be rendered such that `(Measure_name,(unix_time_stamp0:number0,unix_time_stamp1:number2),unit)` i.e., `(Productivity,(1546300800:100,1577836800:200,1609459200:300,1640995200:400,1672531200:500),pages scanned per worker)`

We could gather this info for every single month or even day for each center, but I think that would become pretty intense for the javascript to handle pretty quickly. For now, let's make life easier and do it for each year and just assume the center opens / closes when the year begins / ends.

In [50]:
def get_times(yearly_books, this_center):
    endyear = max(yearly_books.loc[yearly_books['name']== this_center]['year'].tolist())
    endtime = "(endtime,"+str(datetime.datetime(endyear, 12, 31, 0, 0).strftime('%s'))+",utc)"
    startyear = min(yearly_books.loc[yearly_books['name']== this_center]['year'].tolist())
    starttime = "(starttime," + str(datetime.datetime(startyear, 1, 1, 0, 0).strftime('%s')) + ",utc)"
    return {'starttime': starttime, 'endtime': endtime}


In [51]:
# start time
startyear = min(yearly_books.loc[yearly_books['name']== 'Yiddish Book Center']['year'].tolist())
starttime = datetime.datetime(startyear, 1, 1, 0, 0).strftime('%s')

In [52]:
for entry in yearly_books['name'].unique().tolist():
    print(get_times(yearly_books, entry))

{'starttime': '(starttime,1009864800,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1451628000,utc)', 'endtime': '(endtime,1514700000,utc)'}
{'starttime': '(starttime,1420092000,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1420092000,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1483250400,utc)', 'endtime': '(endtime,1514700000,utc)'}
{'starttime': '(starttime,1293861600,utc)', 'endtime': '(endtime,1420005600,utc)'}
{'starttime': '(starttime,1293861600,utc)', 'endtime': '(endtime,1483164000,utc)'}
{'starttime': '(starttime,1230789600,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1167631200,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1357020000,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1262325600,utc)', 'endtime': '(endtime,1704002400,utc)'}
{'starttime': '(starttime,1420092000,utc)', 'endtime': '(endtime,1546236000,utc)'}
{'st