![AOY Logo](https://raw.githubusercontent.com/BrockDSL/AOYTK/main/AOY_Logo.png)

All Our Yesterdays - A toolkit to explore web archives

[Homepage](https://brockdsl.github.io/AOTYK)

# Derivative Generation

This notebook provides a simplified user interface for producing some basic text derivatives using the Archives Unleashed Toolkit (AUT).


### Load Libraries and Configure Google Drive

The next cell will create the workspace and authorize the connection to Google Drive to save your output.

**Please be patient!** This cell can take up to 2 minutes to run.

In [6]:
print("Loading Libraries")
# this cell downloads and installs the required dependencies for running the AUT
# and creates the environment variables required to use Java, Spark and PySpark
# this cell only needs to be run in Colab, when running this script from the Docker image
# these variables will already have been set appropriately
# create the appropriate environment variables to be able to use Java, Spark and PySpark
!apt-get -qq update
!apt-get -qq install -y openjdk-11-jdk-headless 
!apt-get -qq install maven 

# download the AUT
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0-fatjar.jar"

!curl -L "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > spark-3.1.1-bin-hadoop2.7.tgz
!tar -xf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

# set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.11.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.1.1-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-1.1.0-fatjar.jar --py-files aut-1.1.0.zip pyspark-shell'

# allow access to Google Drive for reading and writing files
from google.colab import drive 
drive.mount("/content/drive/")

#grab copy of aoytk helper library and import
!wget "https://raw.githubusercontent.com/BrockDSL/AOYTK/main/aoytk.py"
import aoytk

print("...Setup Complete.")

Loading Libraries
--2023-02-26 20:07:04--  https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/96417459/de2d6da5-3a2e-4154-8073-0f3170add809?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230226%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230226T200510Z&X-Amz-Expires=300&X-Amz-Signature=ba7b08cb2778b661cff0d6b6331ce301eb9b786ca3ff8bd03f5f87ba0f8e4b45&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=96417459&response-content-disposition=attachment%3B%20filename%3Daut-1.1.0.zip&response-content-type=application%2Foctet-stream [following]
--2023-02-26 20:07:04--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/96417459/de2d6da5-3a2e-4154-8073-


### Set your folder

Run the next cell to begin AOY and to set your Google Drive location

In [None]:
# create a DerivativeGenerator object 
dg = aoytk.DerivativeGenerator()
# set the default path for reading/writing files (working directory)
aoytk.display_path_select()

The following cell allows you to specify a working folder. All subsequent paths will be relative to this one. 

The following cell can be used to download W/ARC files from the given URL. Files will be saved in the working folder set in the previous cell. 

A sample W/ARC file can be found at the following URL: https://github.com/archivesunleashed/aut-resources/raw/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

In [None]:
# give the option to download file from a specified URL
aoytk.display_download_file()

Text(value='', description='W/ARC URL: ')

Button(description='Download W/ARC', style=ButtonStyle())

The following cell can be used to create text derivatives from the chosen W/ARC file. The derivative will be saved in the specified output folder. Note that the output folder name needs to be unique (Spark will not output a file to a folder that already exists / has content in it). The output folder path is relative to the working folder set earlier in the notebook, so the output folder will be a sub-directory inside the working folder. 

In [None]:
dg.display_derivative_creation_options()

Dropdown(description='W/ARC file:', options=('ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawlin…

Text(value='output/', description='Output folder:')

Dropdown(description='File type:', options=('csv', 'parquet'), value='csv')

Dropdown(description='Content:', options=('All text content', 'Text content without HTTP headers', 'Text conte…

Button(description='Create derivative', style=ButtonStyle())

Creating derivative file... (this may take several minutes)
Derivative generated, saved to: /content/drive/MyDrive/AOY//all-text-quarter/
Creating derivative file... (this may take several minutes)
Derivative generated, saved to: /content/drive/MyDrive/AOY//no-header-quarter/
Creating derivative file... (this may take several minutes)
Derivative generated, saved to: /content/drive/MyDrive/AOY//no-boiler-quarter/
