![AOY Logo](https://raw.githubusercontent.com/BrockDSL/AOYTK/main/AOY_Logo.png)

All
Our
Yesterdays

A toolkit to explore Web Archives
[Homepage](https://brockdsl.github.io/AOTYK)

# AOY-TK User Notebook

This notebook provides a simplified user interface for producing some basic text derivatives using the Archives Unleashed Toolkit (AUT). 


## Mount Google Drive

In order to persist working files / data across multiple runs of the notebook. 

In [None]:
from google.colab import drive 
drive.mount("/content/drive/")

Mounted at /content/drive/


## Colab Setup Cells
The following cells download and install the dependencies required to run the AUT, and by extension, the AOY-TK. There is quite a bit to download and install so running these cells may take a few minutes. 

In [None]:
# this cell downloads and installs the required dependencies for running the AUT
# and creates the environment variables required to use Java, Spark and PySpark
# this cell only needs to be run in Colab, when running this script from the Docker image
# these variables will already have been set appropriately
# create the appropriate environment variables to be able to use Java, Spark and PySpark
%%capture
!apt-get update
!apt-get install -y openjdk-11-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.11.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.1.1-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-1.1.0-fatjar.jar --py-files aut-1.1.0.zip pyspark-shell'

In [None]:
# download the AUT 
%%capture
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0-fatjar.jar"

## Workspace Setup and Getting Data
Before running the cell below, ensure that `aoytk.py` has been uploaded to the workspace.

In [None]:
# import the toolkit
import aoytk

In [None]:
# create a DerivativeGenerator object 
dg = aoytk.DerivativeGenerator()

The following cell allows you to specify a working folder. All subsequent paths will be relative to this one. 

In [None]:
# set the default path for reading/writing files (working directory)
aoytk.display_path_select()

Text(value='', description='Folder path:')

Button(description='Submit', style=ButtonStyle())

Folder path set to: /content/drive/MyDrive/AOY/


The following cell can be used to download W/ARC files from the given URL. Files will be saved in the working folder set in the previous cell. 

A sample W/ARC file can be found at the following URL: https://github.com/archivesunleashed/aut-resources/raw/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

In [None]:
# give the option to download file from a specified URL
aoytk.display_download_file()

Text(value='', description='W/ARC URL: ')

Button(description='Download W/ARC', style=ButtonStyle())

The following cell can be used to create text derivatives from the chosen W/ARC file. The derivative will be saved in the specified output folder. Note that the output folder name needs to be unique (Spark will not output a file to a folder that already exists / has content in it). The output folder path is relative to the working folder set earlier in the notebook, so the output folder will be a sub-directory inside the working folder. 

In [None]:
dg.display_derivative_creation_options()

Dropdown(description='W/ARC file:', options=('ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawlin…

Text(value='output/', description='Output folder:')

Dropdown(description='File type:', options=('csv', 'parquet'), value='csv')

Dropdown(description='Content:', options=('All text content', 'Text content without HTTP headers', 'Text conte…

Button(description='Create derivative', style=ButtonStyle())

Creating derivative file... (this may take several minutes)
Derivative generated, saved to: /content/drive/MyDrive/AOY//all-text-quarter/
Creating derivative file... (this may take several minutes)
Derivative generated, saved to: /content/drive/MyDrive/AOY//no-header-quarter/
Creating derivative file... (this may take several minutes)
Derivative generated, saved to: /content/drive/MyDrive/AOY//no-boiler-quarter/
