In [2]:
import pandas as pd
import altair as alt

# Findings Draft 

This is the findings section of the article. I'm drafting it in this jupyter notebook to bring the writing and the data together

## Overview of Findings

Through analysis of the 9 million records combined with oral histories, we found that Internet Archive has outsourced most of its scanning operations to business process outsourcing firms in the global South and working condition, on the whole, are poor. 

## Scanning Labor Geographies

Since the beginning of its book scanning program in 2004, Internet Archive has shifted its scanning centers locations from cultural heritage institutions in the global North to business process outsourcing firms in the global South. With this shift, comes a marked uptake in the number of books scanned and the rate at which workers are scanning. In spite of this, Internet Archive rarely mentions these outsourced scanning centers in its blog posts. When we asked IA management about these centers directly, they ended their involvement in this project. 

### Location of Work over time 
Workers at 79 unique scanning centers created the 9 million records composing our dataset. The scanning centers include academic libraries (number), government libraries (number), public libraries (number), museums (number), archives (number), and business process outsourcing firms (number). However, out of the total records, 6 million were scanned at a business process outsourcing firm in the Phillipines, Innodata, over the past 6 years. 

In [34]:
selection = alt.selection_point(fields=['name'], bind='legend')

scans_per_month_chart = alt.Chart("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/main/csv_files/scans_per_center_per_month.csv").mark_bar().encode(
    x= alt.X('month_year:T', axis=alt.Axis(labelAngle=-4), title="Months"),
    y= alt.Y('books_scanned:Q', title="Books Scanned"),
#     color=alt.Color('name:N', legend=alt.Legend(columns=8, symbolLimit=0)),
    order=alt.Order('name:N',sort='ascending'),
#     opacity=alt.condition(selection, alt.value(1), alt.value(0.15)),
    color=alt.condition(selection, alt.Color('name:N', legend=alt.Legend(columns=8, symbolLimit=0)), alt.value("black"))
    
).add_params(selection).configure_legend(
  orient='bottom'
).properties(
    # Adjust chart width and height to match size of legend
    width=2000,
    height=400
).interactive()

scans_per_month_chart

### summary paragraph- summarizing where scanning centers are located more broadly/trends overtime 

### Innodata specific paragraph


In [46]:
# selection = alt.selection_point(fields=['name'], bind='legend')
innodata_scans_chart = alt.Chart("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/main/csv_files/scans_per_center_per_month.csv").mark_bar().encode(
    x= alt.X('month_year:T', axis=alt.Axis(labelAngle=-4), title="Months"),
    y= alt.Y('books_scanned:Q', title="Books Scanned"),
    order=alt.Order('name:N'),
    color= alt.condition(alt.expr.datum['name'] == 'Innodata Knowledge Services, Inc.', alt.value('red'), alt.value('black'))
).configure_legend(
  orient='bottom'
).properties(
    # Adjust chart width and height to match size of legend
    width=600,
    height=400
).properties(
    title='Books Scanned at Innodata Knowledge Services, Inc. vs. All Books in the Dataset'
)





innodata_scans_chart



In [70]:
# selection = alt.selection_point(fields=['name'], bind='legend')

condition = ['Innodata Knowledge Services, Inc.', 'Datum Data Co. Ltd.',  'Hong Kong']


bpo_scans_vs_total_over_time = alt.Chart("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/main/csv_files/scans_per_center_per_month.csv").mark_bar().encode(
    x= alt.X('month_year:T', axis=alt.Axis(labelAngle=-4), title="Months"),
    y= alt.Y('books_scanned:Q', title="Books Scanned"),
    order=alt.Order('name:N'),
    color= alt.condition(alt.FieldOneOfPredicate(field='name', oneOf=condition), alt.value('red'), alt.value('black'))
).configure_legend(
  orient='bottom'
).properties(
    # Adjust chart width and height to match size of legend
    width=600,
    height=400
).properties(
    title='Books Scanned at Outsourced Centers vs. All Books in the Dataset'
)

bpo_scans_vs_total_over_time.save('/Users/e.schwartz/Documents/GitHub/ia_scanning_labor_data/center_visuals/bpo_scans_vs_total_over_time.json')
bpo_scans_vs_total_over_time.save('/Users/e.schwartz/Documents/GitHub/ia_scanning_labor_data/center_visuals/bpo_scans_vs_total_over_time.html')

bpo_scans_vs_total_over_time

### blog posts paragraph and media coverage self image- innodata is absent

- narratives about IA scanning labor- glorified but focused on scanning workers in the global north 
- innodata, datum data, and the hong kong center are conspiculously absent in these romanticized accounts of scanning workers. even as people employed at these scanning centers have scanned the vast majority of IA's text archive's contents. 
- when we asked about these outsourced centers, IA ceased communication with our team and involvement in the project. 

## Working conditions 

Working conditions are impossible to account for through data archaeology alone as these are inherently tied to human beings--their bodies, minds, and material conditions. As such, our analysis of working conditions must rely on worker accounts and these accounts are disproportionately from workers employed at scanning centers in academic libraries in the global North. While these make workers stories make up the majority of these accounts, they are a small proportion of the total people who work for the internet archive as book scanners. to make up for this, we bring in data analysis of the dataset to gauge a rough idea of turnover and rate-of-work across all the centers. these metrics are not substitutes for the voices of actual workers -- especially people of color and those of the global South. 

### Overwork 

- survey data: 
- indeed 
- using the data to approximate overwork. 

We were unable to reach any scanning workers at the vast majority of the scanning centers. Because of this, we have used the dataset to approximate the rate of work at the scanning centers overtime and place. The ratio of the pages scanned each month to the number of unique people operating the scanning machinery gives us a metric to understand the rate of work across the centers taking into account that different books of different lengths vary in the amount of time they require to scan. We should expect to see some variance in the ratio over places and time because scanning archival materials is a much slower process than scanning circulating collections. Still, the ratios at the outsourced scanning centers, including Datum Data and Innodata, increase lineaerly overtime while the rates at the other centers do not conform to a linear model (they stay relatively flat). this indicates that workers at Innodata are being subjected to increasingly more overwork while the rates of work at the other centers, still likely high, remain consistent. 


In [73]:
selection = alt.selection_point(fields=['name'], bind='legend')

pages_scanned_to_workers_ratio_scatters = alt.Chart("https://raw.githubusercontent.com/ers6/ia_scanning_labor_data/main/csv_files/scans_and_workers_month_stats.csv").mark_circle().encode(
    x= alt.X('month_year:T', axis=alt.Axis(labelAngle=-4), title="Months"),
    y= alt.Y('pages_to_workers:Q', title="Ratio of Pages Scanned to Workers"),
    color=alt.Color('name:N', legend=alt.Legend(columns=8, symbolLimit=0)),
    order=alt.Order('name:N',sort='ascending'),
    opacity=alt.condition(selection, alt.value(1), alt.value(0)),
    tooltip=['name:N', 'pages_to_workers:Q']
).add_params(selection).configure_legend(
  orient='bottom'
).properties(
    # Adjust chart width and height to match size of legend
    width=600,
    height=400
).interactive()

pages_scanned_to_workers_ratio_scatters

### Low pay/bad benefits

### High turnover rates