# Manifest Builder
In this notebook we're cross walking our current data structures to the [manifest](https://github.com/hock/Manifest/wiki) supply chain data format.

Plan to manually combine scanning centers with more than 1 scanning center id prior to brining into this notebook for real. 

## Manifest Fields

The Manifest [google sheets template](https://docs.google.com/spreadsheets/d/17P3kAShGgpSUV0P8f38zAlWYvF_Vxf8OldY6yriTfns/edit?usp=sharing ) has the following fields: 
- index: integer
- Name: the name of the node (in our case the scanning center name)
- description: a markdown description of the location. Can include links and such.
- Category: types of locations. there can be more than one category per item. Given as # and separated by a comma, i.e., "#scanningcenter,#academiclibrary"
- Images: a url leading to an image. If we wanted to include images of the scanning centers, we could do so using this.
- Location: text description of the location, i.e., Allen County, IN
- Geocode: the latitude and longitudinal coordinates of the location separated by a comma, i.e., "40.72401342,-74.0064435"
- DestinationIndex: the index of any locations that the location should be connected to. For example, the destination index for the physical archive would include Datum Data, IA Hong Kong, and Innodata
- Measure: any measure associated with the value including starttime and endtime. All measures consist of 3 values nested in parantheses (measure_name,measure_value,measure_unit). Measures are separated from each other by comma. For example, "(books_scanned,3800,books),(pages_scanned,252988,pages),(median_turnover,1127.0,days),(days_operated,2267.0,days),(starttime,1445467744,utc)"
  - starttime: when the location becomes relevant to the supply chain given in UTC time, i.e., (starttime,1445467744,utc)
  - endtime: when the location stops being relevant to the supply chain given in UTC time, i.e., (endtime,1445467744,utc)
- Sources: any sources affiliated with the supply chain. In our case, we may include the actual links to texts scanned at that location in IA.
- AdditionalNotes: I don't think these appear on manifest. it would be a good place to include all the values in the "scanningcenter" field that map to that scanning center. For example, UIUC could have this entry in AdditionalNotes: "scanningcenter values affiliated with this center include: "il" and "ill"



In [26]:
import pandas as pd

In [46]:
location_key = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/main/access-ia-metadata-records/data/location_key.csv")
location_key

Unnamed: 0,scan_center,lat,long,name
0,cebu,10.31803327,123.9205454,"Innodata Knowledge Services, Inc."
1,hongkong,22.31836625,114.181248,Hong Kong
2,alberta,53.52319337,-113.5271984,University of Alberta
3,sfdowntown,37.78244229,-122.4717902,Internet Archive Headquarters
4,shenzhen,113.9856083,"22.598147246440938,",Datum Data Co. Ltd.
...,...,...,...,...
105,AP Press Academy Archives,16.50612678,80.64675882,Press Academy of Andhra Pradesh
106,hamilton,43.26097698,-79.8703973,Hamilton Public Library
107,tt_nybg,40.86181013,-73.8809266,New York Botanical Garden
108,mobot,38.6170646,-90.26278742,Missouri Botanical Garden


In [48]:
center_names = location_key['name'].drop_duplicates().tolist()

In [49]:
# # it makes more sense to just combine the dataframe before cross walking to the manifest format
# center_names = []
# centers = pd.DataFrame(columns = ['name', 'lat', 'long', 'scanningcenter_ids'])

# for i in range(len(location_key)):
#     if location_key.at[i, 'name'] in center_names:
#         pass
#         # need to append the scanning center id here somehow ... i think this needs to be a dataframe ugh 
#     else: 
#         center_names.append(location_key.at[i, 'name'])
#         this_center_data =  pd.DataFrame.from_dict({
#                 'name': location_key.at[i, 'name'],
#                 'lat': location_key.at[i, 'lat'],
#                 'long': location_key.at[i, 'long'],
#                 'scanningcenter_ids': str(location_key.at[i, 'scan_center']) + '||'
#             })
#         print(this_center_data)
#         # centers = pd.concat([centers, this_center_data], ignore_index=True)
#  #        centers.concat(
#  #            {
#  #                'name': location_key.at[i, 'name'],
#  #                'lat': location_key.at[i, 'lat'],
#  #                'long': location_key.at[i, 'long'],
#  #                'scanningcenter_ids': str(location_key.at[i, 'scan_center']) + '||'
#  #            }
#  #        )



#  # pd.concat([new_record, df], ignore_index=True)





In [51]:
center_names

['Innodata Knowledge Services, Inc.',
 'Hong Kong',
 'University of Alberta',
 'Internet Archive Headquarters',
 'Datum Data Co. Ltd.',
 'University of Toronto',
 'Allen County Public Library Geneaology Center',
 'British Library',
 'UCLA',
 'Princeton University',
 'Boston Public Library',
 'National Agricultural Library',
 'Library of Congress',
 'Columbia University',
 'UNC Chapel Hill',
 'Internt Archive Physical Archive',
 'UIUC',
 'National Library of Scotland',
 'San Francisco Public Library',
 'University of Maryland, College Park',
 'North Carolina State University',
 'BYU, Provo',
 'Smithsonian Libraries and Archives',
 'Georgetown University',
 'Internet Archive Sheridan Headquarters',
 'Duke University',
 'Brown University',
 'Natural History Museum Library, London',
 'The Archive of Contemporary Music',
 'BYU, Hawaii',
 'BYU, Idaho Family History Library',
 'Getty Research Institute Valencia Warehouse',
 'University of Florida',
 'The Ohio State University',
 'American Num

In [52]:

    # subsetting the location dataframe for only entries with the same entry in the same field (same center)
    locs = location_key[location_key["name"] == manifest_data[i]['Name']].reset_index()
    # assuming that every center that shares the same name is the same center and should have the same geographic coordinates 
    manifest_data[i]['Geocode'] = str(locs.at[0, 'lat']) + ',' +  str(locs.at[0, 'long'])
    
    print(manifest_data[i]['Geocode'])

10.31803327,123.9205454


In [53]:
manifest_data = []
for i in range(len(center_names)): 
    manifest_data.append(
        {'Index': i,
         'Name': center_names[i], 
         'Description': '',
         'Category': '',
         'Images': '',
         'Location':'',
         'Geocode':'',
         'DestinationIndex': '',
         'Measure':'',
         'Sources':'',
         'AdditionalNotes':''
        }
    )

In [58]:
for i in range(len(manifest_data)):
    # subsetting the location dataframe for only entries with the same entry in the same field (same center)
    locs = location_key[location_key["name"] == manifest_data[i]['Name']].reset_index()
    print(locs)
    try:
        # assuming that every center that shares the same name is the same center and should have the same geographic coordinates 
        manifest_data[i]['Geocode'] = str(locs.at[0, 'lat']) + ',' +  str(locs.at[0, 'long'])
        # print(manifest_data[i]['Geocode'])
    except KeyError:
        pass
    except ValueError:
        pass
    



   index scan_center          lat         long  \
0      0        cebu  10.31803327  123.9205454   

                                name  
0  Innodata Knowledge Services, Inc.  
   index scan_center          lat        long       name
0      1    hongkong  22.31836625  114.181248  Hong Kong
1     76   Hong Kong  22.31836625  114.181248  Hong Kong
   index scan_center          lat          long                   name
0      2     alberta  53.52319337  -113.5271984  University of Alberta
1     88     alberta  53.52319337  -113.5271984  University of Alberta
   index      scan_center          lat          long  \
0      3       sfdowntown  37.78244229  -122.4717902   
1      5     sanfrancisco  37.78244229  -122.4717902   
2     58             arch  37.80039616  -122.4603638   
3     69  tt_sanfrancisco  37.78244229  -122.4717902   

                            name  
0  Internet Archive Headquarters  
1  Internet Archive Headquarters  
2  Internet Archive Headquarters  
3  Internet Arch

In [60]:
len(manifest_data)

80

In [62]:
manifest_df = pd.DataFrame.from_dict(manifest_data)


In [64]:
manifest_df.to_csv("/Users/elizabethschwartz/Documents/test_manifest.csv")