In [1]:
import altair as alt
import pandas as pd
import numpy as np
import geopandas as gpd

# Creating Exploratory Data Visualizations for the Suppliers 
- IA China (veristrong)
- Datum Data (wanli_insutrial and meilin_ge)
- Innodata  (hvg_madaeu) 

Other suppliers are exporters of machinery or book purchases (i.e., Tuling, Shenzhen Shengada, and Better World Books) and are out of scope for now)

there's a methodological argument happening here - or maybe suggestion- that speculative data visualization is a form of speculative bibliography- we don't necessarily need to model complex statistical equations- sometimes we can just count



## IA China 
aka veristrong_industrial, hongkong


In [2]:
bols = "https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv" 

scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/hongkong_scan_counts.csv").mark_area(
    color='lightblue', 
    line=True).encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')
select_ship= alt.selection_single(
  
    on="mouseover", nearest=True, fields=["arrival_date"], empty="none"
)


shipments = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv").mark_rule(opacity=0.5).encode(
    x='arrival_date:T',
    size = "weight_kg:Q",
    tooltip=["arrival_date:T", "weight_kg:Q", "shipping_port:N", "port_entry:N","goods_shipped:N", "hs_code_detail:N"]
).transform_filter(
alt.datum.supplier_location_id == "veristrong_industrial"
).add_selection(select_ship)



alt.layer(
    scans, shipments
).properties(
    width=800, height=300
).interactive()
# don't load in all the files for the json/csv file 

### analyzing this viz
This visualization shows the dates of shipments to the IA China location at the Veristrong Industrial location over time alon with the books scanned per month in the background as context. Things to keep in mind about the data in this chart: 
1) the scan count data comes from book scans exclusively tagged with the scanningcenter tag "scanngingcenter:hongkong"- the shipping data includes everything shipped to the veristrong location. 
2) we only have access to IMPORT data. From conversations with workers at IA, we know that books would have been shipped from IA to the scanning center location initially and then shipped back to the US. Therefore, the shipping data is only telling us half of the story. 

Based on this, what can we learn? 
Well, first we need to hold that the relationship between the actual shipment records, the scanning center, and the books scanned there is speculative. Unlike Dataum Data and Innodata, which show up on IA's 990 tax returns and in some epherma in the archive, we don't know for a fact that the "scanningcenter:hongkong" tag actually corresponds to the veristrong industrial location. Part of this exploratory data analysis is attempting to see if that relationship makes any sense. And based on what we're seeing here, I'd argue it mostly does. 

In support of "scanningcenter:honkong" == "veristrong_industrial"/IA China : 
- books being uploaded from the hongkong scanning center location stop in August 2018 and a shipment that includes book scanning computer devices arrives in the US on February 24, 2019. If we saw more books scanned at this location after that import, then it would seem like the hongkong center was still operational and therefore did not return scanning equipment
- likewise, the hongkong center reaches the height of its scanning activity in July 2018. A few months after that, the heaviest containers enter the US. If the weight of the containers correspond with the number of books scanned, then this pattern seems to match up well-ish 

Against "scanningcenter:hongkong" == "veristrong_industrial" 
There is a blip in the book scanning data where a scanning site in hongkong pops up in 2009, but it only records 2 books/month for multiple months until 2013 (where it records 500 books a month) and then back to 2 before the spikes in 2018. We should have shipping data to a hongkong location is this is the same scanning center, but we don't. However, it is possible that the hongkong tag was previously used for a different scanning center. IA's data tends to be messy, so I'm not too concerned about a couple hundred unaccounted for scans


#### so why do we care?

Why should we care about the possible location of an internet archive scanning center from 5 years ago? Why does it matter that this location may or may not have been in a high rise industrial building in hongkong or perhaps it was somewhere else? 

Well, there are a few reasons we should care: 
1) this suggests a potentially different kind of outsourcing that IA engaged in between working with Datum Data and Innodata that complicates our understanding of what scanning labor arrangements in the IA look like-- more on this later, but this scanning center could be one that internet archive itself actually ran (only contractor from this period unaccounted for is UpWork Global per IA's December 2018 990- is it possible that internet archive hired their own freelance scanning workers and set up their own temporary scanning center in China? if so, why?) 
2) the fact that we cannot answer this question is symptomatic of the abstraction/non-specification that scaled systems demand 
3) refusing to live in the speculative stands to further erase, to further abstract, and further disembody the scanning work we KNOW happened at the hongkong scanning center. even this kind of speculative recovery mission requires us to sit with the discomfort of abstraction that scale creates and it provides with a methodology to imagine what could have been. and doing that work undermines the kind of abstraction, decontextualization, and commodification upon which scale is built. refusing that through a radical act of imagination is powerful. it's a refusal to accept that scale is a natural state of things. messiness, not precision, is natural. My point is, we are uncomfortable with not being certain. But the illusion of certainity is something that really only makes sense in a world conducive to scale and that's not the world we live in. 

**there's power in speculation**

**Speculative Data Visualization and The Great Perhaps**
one thing that scares me about this work is that we read data visualizations as absolute/factual (indeed, they never are) but this project MUST dwell in uncertainty. so it's very important that we know these visualizations are themselves speculative not definitive by any stretch of the imagination. 

should use data archaeology to talk about what we're doing here and also speculative bibliography- expect the relationships between texts we're speculating about aren't really textual; they're material connections the texts serve as witnesses to. 
http://www.digitalhumanities.org/dhq/vol/15/4/000578/000578.html#williams2019 

In [3]:
# for context, just the book scan counts
select_month= alt.selection_single(
  
    on="mouseover", nearest=True, fields=["month_year:T"], empty="none"
)


scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/hongkong_scan_counts.csv").mark_area(
    color='lightblue',  
    line=True).encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned")),
    tooltip = ["month_year:T", "books_scanned:Q"]
).transform_timeunit(
    month='yearmonth(date)').add_selection(select_month)


alt.layer(
    scans
).properties(
    width=800, height=300
).interactive()
# don't load in all the files for the json/csv file 


In [4]:
pd.read_csv("https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv")

Unnamed: 0.1,Unnamed: 0,source,arrival_date,company,company_location_id,company_address,company_lat,company_lon,supplier,supplier_location_id,...,hs_code_detail,bol,port_entry,port_entry_code,port_entry_lat,port_entry_lon,shipping_port,shipping_port_code,shipping_port_lar,shipping_port_lon
0,0,pan,2011-04-22,Internet Archive,300funston,"300 Funston Avenue, San Francisco, CA 94118, USA",37.782455,-122.471569,Tuling InfoTech Co Ltd,18excellence,...,,,,,,,,,,
1,1,pan,2011-07-23,Internet Archive,300funston,"300 Funston Avenue, San Francisco, CA 94118, USA",37.782455,-122.471569,Datum Data Co Ltd,wanli_industrial,...,,,,,,,,,,
2,2,pan,2011-09-10,Internet Archive,300funston,"300 Funston Avenue, San Francisco, CA 94118, USA",37.782455,-122.471569,Datum Data Co Ltd,wanli_industrial,...,,,,,,,,,,
3,3,pan,2011-12-18,Internet Archive,300funston,"300 Funston Avenue, San Francisco, CA 94118, USA",37.782455,-122.471569,Datum Data Co Ltd,wanli_industrial,...,,,,,,,,,,
4,4,pan,2012-10-08,Internet Archive,300funston,"300 Funston Avenue, San Francisco, CA 94118, USA",37.782455,-122.471569,Datum Data Co Ltd,wanli_industrial,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
78,78,combo,2023-01-08,Internet Archive,298cherry,298 Cherry Hill Dr Latrobe Us,40.258357,-79.418523,Innodata Knowledge Services Inc,hvg_mandaue,...,"Plastics and articles thereof ; Other plates, ...",CMDUPH10265489,"Baltimore, Md",1303.0,39.286327,-76.609577,Hong Kong,58201.0,22.338010,114.130381
79,79,combo,2023-01-09,Internet Archive,298cherry,298 Cherry Hill Dr Latrobe Pa 15650,40.258357,-79.418523,Innodata Knowledge Services Inc,hvg_mandaue,...,"Paper and paperboard; articles of paper pulp, ...",FLXT00001845539A,"New York/Newark Area, Newark, Nj",4601.0,40.685215,-74.163685,Singapore,55976.0,1.463013,103.831542
80,80,combo,2023-01-18,Internet Archive,298cherry,298 Cherry Hill Dr Latrobe Latrobe Pa United S...,40.258357,-79.418523,Innodata Knowledge Services Inc,hvg_mandaue,...,"Paper and paperboard; articles of paper pulp, ...",AMAWSCEBZ0021115,"Baltimore, Md",1303.0,39.286327,-76.609577,Kao Hsiung,58309.0,22.600626,120.287198
81,81,pan,2023-02-12,Internet Archive,298cherry,"298 Cherry Hills Dr, Latrobe, PA 15650, USA",40.258357,-79.418523,Innodata Knowledge Services Inc,hvg_mandaue,...,,,,,,,,,,


In [12]:
bols = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/location_key.csv"
)

In [6]:
pd.read_csv(bols).keys()

Index(['Unnamed: 0', 'source', 'arrival_date', 'company',
       'company_location_id', 'company_address', 'company_lat', 'company_lon',
       'supplier', 'supplier_location_id', 'supplier_address', 'supplier_lat',
       'supplier_lon', 'quantity', 'weight_kg', 'number_containers',
       'goods_shipped', 'container_marks', 'container_size', 'container_type',
       'hs_code_detail', 'bol', 'port_entry', 'port_entry_code',
       'port_entry_lat', 'port_entry_lon', 'shipping_port',
       'shipping_port_code', 'shipping_port_lar', 'shipping_port_lon'],
      dtype='object')

In [7]:
pd.read_csv("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/hongkong_scan_counts.csv")

Unnamed: 0,month_year,books_scanned
0,2009-08,1
1,2009-09,2
2,2009-12,1
3,2010-02,1
4,2010-06,1
...,...,...
57,2018-09,26901
58,2018-10,29563
59,2018-11,26944
60,2018-12,19350


## Datum Data Co 

Datum data has 2 different locations, wanli_industrial and meilin ge- we know this from the 990s that show 2 different locations for datum data. I'm assuming that all books scanned by datum data regardless of location are under the "shenzhen" scanning center

In [22]:
# datum data bols 
full_bols = pd.read_csv("https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv")

datum_bols = full_bols.loc[full_bols['supplier_location_id'] == 'wanli_industrial']

datum_bols.to_csv("/Users/elizabethschwartz/Desktop/datum_bols.csv")

In [8]:
bols = "https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv"
scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/shenzhen_scan_counts.csv").mark_area(
    color='orange', 
    line=True).encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')
select_ship= alt.selection_single(
  
    on="mouseover", nearest=True, fields=["arrival_date"], empty="none"
)

scans2 = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/china-scandates.csv").mark_area(
    color='orange', 
 
    line=True).encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('scandate:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')


shipments = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv").mark_rule(opacity=0.5).encode(
    x='arrival_date:T',
    size = "weight_kg:Q",
    tooltip=["arrival_date:T", "weight_kg:Q", "shipping_port:N", "port_entry:N","goods_shipped:N", "hs_code_detail:N"]
).transform_filter( 
    alt.FieldOneOfPredicate(field='supplier_location_id', oneOf=["wanli_industrial", "meilin_ge"])

).add_selection(select_ship)



alt.layer(
    scans, scans2, shipments
).properties(
    width=800, height=300
).interactive()
# don't load in all the files for the json/csv file 

### analysis of the viz

this visualization shows me that  i haven't matched this up correctly.. 

There seems to be another scanning center that datum data ran briefly in late 2020/early 2021 because there is a shipment here that doesn't quite line up completely... 

It’s unclear if Datum Data starts being called ‘china’ instead of shenzhen in IA’s metadata or not. What’s clear is that they continue to be paid millions by IA after 2016 and a few shipments continue. This is also around the same time internet archives starts the Chinese popular books project - see 

video of shenzhen center from 2011
https://archive.org/details/shenzhenscann2011 

chinese popular books project
https://archive.org/details/popularchinesebooks?tab=about 


We know that Datum Data is a scanning partner on some books scanned for the chinese popular books project. (see https://archive.org/details/zhuluojigongyuan0000mich) -- it's possible to do a search of the chinese popular books collection where datum data is in the partner field and the scanningcenter = china 


https://archive.org/search?query=partner%3A%28datum+data+popular+chinese+books%29

If datum data is responsible for the popular chinese books project this would answer a lot of questions. first, it would explain what shipments slowed between 2016 and 2021 while IA continued to pay datum data millions of dollars as reported on the 990s, assuming that IA was sourcing books from china and therefore not shipping them from the US to datum data. 

https://archive.org/advancedsearch.php?q=%22scanningcenter%3A+china%22&fl%5B%5D=partner&rows=5600000&output=json&callback=callback&save=no

scanning center china

39151/81523 are datum data

none of the others have a partner field at all 


collections of the books: https://archive.org/advancedsearch.php?q=%22scanningcenter%3A+china%22&fl%5B%5D=collection&rows=5600000&output=json&callback=callback&save=no

https://archive.org/advancedsearch.php?q=%22collection%3A+china%22&fl%5B%5D=scanningcenter&rows=5600000&output=json&callback=callback&save=no


## Cebu Center/ Innodata 

In [9]:
bols = "https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv" 

scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/cebu_scan_counts.csv").mark_area(
    color='lightgreen', 
    line=True).encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')
select_ship= alt.selection_single(
  
    on="mouseover", nearest=True, fields=["arrival_date"], empty="none"
)


shipments = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/geographic-data/bills-of-lading/combined_ia_bols_manual_dedupe%20-%20deduped_results.csv").mark_rule(opacity=0.5).encode(
    x='arrival_date:T',
    size = "weight_kg:Q",
    tooltip=["arrival_date:T", "weight_kg:Q", "shipping_port:N", "port_entry:N","goods_shipped:N", "hs_code_detail:N"]
).transform_filter(
alt.datum.supplier_location_id == "hvg_mandaue"
).add_selection(select_ship)



alt.layer(
    scans, shipments
).properties(
    width=800, height=300
).interactive()
# don't load in all the files for the json/csv file 

In [10]:
scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/cebu_scan_counts.csv").mark_area(
    color='lightgreen', 
    line=True).encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned")),
    tooltip = ['month_year:T', 'books_scanned:Q']
).transform_timeunit(
    month='yearmonth(date)')

scans

# pd.read_csv(bols)['supplier_location_id'].unique()

## All together now!

In [11]:
hongkong_scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/hongkong_scan_counts.csv").mark_line(
    color='blue').encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')


shenzhen_scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/shenzhen_scan_counts.csv").mark_line(
    color='red').encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')

cebu_scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/cebu_scan_counts.csv").mark_line(
    color='green').encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('books_scanned:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')

china_scans = alt.Chart("https://raw.githubusercontent.com/ers6/ia_bols/main/scan-center-counts/china-scandates.csv").mark_line(
    color='red').encode(
    x=alt.X('yearmonth(month_year):T', axis=alt.Axis(title="Months")),
    y=alt.Y('scandate:Q', axis=alt.Axis(title="Books Scanned"))
).transform_timeunit(
    month='yearmonth(date)')



hongkong_scans + shenzhen_scans + cebu_scans +china_scans
