# Manifest Builder
In this notebook we're cross walking our current data structures to the [manifest](https://github.com/hock/Manifest/wiki) supply chain data format.

Plan to manually combine scanning centers with more than 1 scanning center id prior to brining into this notebook for real. 

## Manifest Fields

The Manifest [google sheets template](https://docs.google.com/spreadsheets/d/17P3kAShGgpSUV0P8f38zAlWYvF_Vxf8OldY6yriTfns/edit?usp=sharing ) has the following fields: 
- index: integer
- Name: the name of the node (in our case the scanning center name)
- description: a markdown description of the location. Can include links and such.
- Category: types of locations. there can be more than one category per item. Given as # and separated by a comma, i.e., "#scanningcenter,#academiclibrary"
- Images: a url leading to an image. If we wanted to include images of the scanning centers, we could do so using this.
- Location: text description of the location, i.e., Allen County, IN
- Geocode: the latitude and longitudinal coordinates of the location separated by a comma, i.e., "40.72401342,-74.0064435"
- DestinationIndex: the index of any locations that the location should be connected to. For example, the destination index for the physical archive would include Datum Data, IA Hong Kong, and Innodata
- Measure: any measure associated with the value including starttime and endtime. All measures consist of 3 values nested in parantheses (measure_name,measure_value,measure_unit). Measures are separated from each other by comma. For example, "(books_scanned,3800,books),(pages_scanned,252988,pages),(median_turnover,1127.0,days),(days_operated,2267.0,days),(starttime,1445467744,utc)"
  - starttime: when the location becomes relevant to the supply chain given in UTC time, i.e., (starttime,1445467744,utc)
  - endtime: when the location stops being relevant to the supply chain given in UTC time, i.e., (endtime,1445467744,utc)
- Sources: any sources affiliated with the supply chain. In our case, we may include the actual links to texts scanned at that location in IA.
- AdditionalNotes: I don't think these appear on manifest. it would be a good place to include all the values in the "scanningcenter" field that map to that scanning center. For example, UIUC could have this entry in AdditionalNotes: "scanningcenter values affiliated with this center include: "il" and "ill"



In [3]:
import pandas as pd

In [4]:
location_key = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/main/access-ia-metadata-records/data/location_key.csv")
location_key

HTTPError: HTTP Error 404: Not Found

In [2]:
location_key['long']

NameError: name 'location_key' is not defined

In [3]:

center_names = []
centers = []

for i in range(len(location_key)):
    if location_key.at[i, 'name'] in center_names:
        this_id = location_key.at[i, 'scan_center']
        for this_center in centers: 
            if this_center['name'] == location_key.at[i, 'name']: 
                this_center['scanningcenter_ids'] += str( "||" + this_id)
            else:pass
        # need to append the scanning center id here somehow ... i think this needs to be a dataframe ugh 
    else: 
        centers.append({
            'name': location_key.at[i, 'name'],
            'lat': str(location_key.at[i, 'lat']),
            'long': str(location_key.at[i, 'long']),
            'scanningcenter_ids': str(location_key.at[i, 'scan_center']) 
        })
        center_names.append(location_key.at[i, 'name'])






NameError: name 'location_key' is not defined

In [4]:
centers_df = pd.DataFrame(centers)
centers_df

NameError: name 'pd' is not defined

In [51]:
def make_geocords(lat, long): 
    return str(lat) + ',' + str(long)

# https://archive.org/search?query=scanningcenter%3A%28cebu%29
def get_ia_links(center_ids): 
    links = []
    for this_id in center_ids.split('||'): 
        links.append("https://archive.org/search?query=scanningcenter%3A%28" + str(this_id.replace(" ", "")) + "%29"
        )
    formatted_links = ""
    if len(links) > 1:
        for i in range(len(links)): 
            if i < (len(links)):
                formatted_links += "(" + links[i] + "),"
            else: 
                formatted_links += "(" + links[i] + ")"
    else: 
        formatted_links =  "(" + links[0] + ")"
    return formatted_links

In [52]:
get_ia_links(centers_df.at[5, 'scanningcenter_ids'])

'(https://archive.org/search?query=scanningcenter%3A%28tt_swinburne%29)'

In [57]:
manifest_data = []
for i in range(len(centers_df)): 
    manifest_data.append(
        {'Index': i,
         'Name': centers_df.at[i, 'name'], 
         'Description': '',
         'Category': '',
         'Images': '',
         'Location':'',
         'Geocode': make_geocords(centers_df.at[i, 'lat'], centers_df.at[i, 'long']),
         'DestinationIndex': '',
         'Measure':'',
         'Sources': get_ia_links(centers_df.at[i, 'scanningcenter_ids']),
         'AdditionalNotes':'Scanning center ids separated by "||": ' + str(centers_df.at[i,'scanningcenter_ids'])
        }
    )

In [58]:
manifest_data

[{'Index': 0,
  'Name': '1 Dollar Scan',
  'Description': '',
  'Category': '',
  'Images': '',
  'Location': '',
  'Geocode': '37.27838889,-121.9492601',
  'DestinationIndex': '',
  'Measure': '',
  'Sources': '(https://archive.org/search?query=scanningcenter%3A%281dollarscan(zLibro)%29)',
  'AdditionalNotes': 'Scanning center ids separated by "||": 1dollarscan (zLibro)'},
 {'Index': 1,
  'Name': 'Allen County Public Library Geneaology Center',
  'Description': '',
  'Category': '',
  'Images': '',
  'Location': '',
  'Geocode': '41.0771285,-85.1432003',
  'DestinationIndex': '',
  'Measure': '',
  'Sources': '(https://archive.org/search?query=scanningcenter%3A%28indiana%29)',
  'AdditionalNotes': 'Scanning center ids separated by "||": indiana'},
 {'Index': 2,
  'Name': 'American Museum of Natural History',
  'Description': '',
  'Category': '',
  'Images': '',
  'Location': '',
  'Geocode': '40.78130431,-73.97404878',
  'DestinationIndex': '',
  'Measure': '',
  'Sources': '(https:/

In [58]:
for i in range(len(manifest_data)):
    # subsetting the location dataframe for only entries with the same entry in the same field (same center)
    locs = location_key[location_key["name"] == manifest_data[i]['Name']].reset_index()
    print(locs)
    try:
        # assuming that every center that shares the same name is the same center and should have the same geographic coordinates 
        manifest_data[i]['Geocode'] = str(locs.at[0, 'lat']) + ',' +  str(locs.at[0, 'long'])
        # print(manifest_data[i]['Geocode'])
    except KeyError:
        pass
    except ValueError:
        pass
    



   index scan_center          lat         long  \
0      0        cebu  10.31803327  123.9205454   

                                name  
0  Innodata Knowledge Services, Inc.  
   index scan_center          lat        long       name
0      1    hongkong  22.31836625  114.181248  Hong Kong
1     76   Hong Kong  22.31836625  114.181248  Hong Kong
   index scan_center          lat          long                   name
0      2     alberta  53.52319337  -113.5271984  University of Alberta
1     88     alberta  53.52319337  -113.5271984  University of Alberta
   index      scan_center          lat          long  \
0      3       sfdowntown  37.78244229  -122.4717902   
1      5     sanfrancisco  37.78244229  -122.4717902   
2     58             arch  37.80039616  -122.4603638   
3     69  tt_sanfrancisco  37.78244229  -122.4717902   

                            name  
0  Internet Archive Headquarters  
1  Internet Archive Headquarters  
2  Internet Archive Headquarters  
3  Internet Arch

In [60]:
len(manifest_data)

80

In [62]:
manifest_df = pd.DataFrame.from_dict(manifest_data)


In [64]:
manifest_df.to_csv("/Users/elizabethschwartz/Documents/test_manifest.csv")

did add a chart view but no way to say 12 measures were connected in some way 
cumulative stat with a different node for each center 
we could add a measure that has gradiations in it 
map embed is coming 

IA supply chain 2010, IA supply chain 2011

collections - lists a set of manifests 
load different data layers - any kind of geojson data, inbox data, 

lib/json/cases if you point manifest to #collective 