## Setup AUT and PySpark
This section downloads and installs all of the required dependencies for setting up the AUT and PySpark within Google Colab. 

It is taken directly from [this notebook](https://github.com/archivesunleashed/notebooks/blob/main/PySpark%20Examples/aut_pyspark.ipynb) created by the Archives Unleashed team. 

In [1]:
# download the AUT
%%capture
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0-fatjar.jar"

In [2]:
# download the dependencies (Java, Apache Spark, FindSpark (for PySpark))
%%capture
!apt-get update
!apt-get install -y openjdk-11-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
# create the appropriate environment variables to be able to use Java, Spark and PySpark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.11.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-1.1.0-fatjar.jar --py-files aut-1.1.0.zip pyspark-shell'
# initialize PySpark and the appropriate context variables for use with the AUT
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [4]:
# borrowing their sample data too for now! (from University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp)
%%capture
!mkdir data
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz?raw=true" -O data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz?raw=true" -O data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

## Produce a text derivative file

Given an ARC or WARC file, produce a text derivative. 

In [13]:
# import the AUT
from aut import *
from pyspark.sql.functions import col, desc
import ipywidgets as widgets 

In [14]:
# a messy first guess at derivative generation
def generate_derivative(source_file, output_folder, file_type="csv", text_filters=0):
    # create our WebArchive object from the W/ARC file
    archive = WebArchive(sc, sqlContext, source_file)

    # almost certainly there is a simpler way of doing this, but I don't know how to modularize out the text filtering options
    if text_filters == 0: # all content
        archive.webpages() \
            .select("crawl_date", "domain", "url", remove_html("content")) \
            .write \
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
            .format(file_type) \
            .option("escape", "\"") \
            .option("encoding", "utf-8") \
            .save(output_folder)
    elif text_filters == 1: # remove HTTP headers
        archive.webpages() \
            .select("crawl_date", "domain", "url", remove_html(remove_http_header("content"))) \
            .write \
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
            .format(file_type) \
            .option("escape", "\"") \
            .option("encoding", "utf-8") \
            .save(output_folder)
    else: # remove boilerplate text
        archive.webpages() \
            .select("crawl_date", "domain", "url", extract_boilerplate(remove_http_header("content")).alias("content")) \
            .write \
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
            .format(file_type) \
            .option("escape", "\"") \
            .option("encoding", "utf-8") \
            .save(output_folder)

In [23]:
# text content choices 
content_options = ["All text content", "Text content without HTTP headers", "Text content without boilerplate"]
# *simplest* implementation of input boxes 
input_text = widgets.Text(description="W/ARC file:")
out_text = widgets.Text(description="Output folder:", value="output/")
format_choice = widgets.Dropdown(description="File type:",options=["csv", "parquet"], value="csv")
content_choice = widgets.Dropdown(description="Content:", options=content_options)
display(input_text)
display(out_text)
display(format_choice)
display(content_choice)

Text(value='', description='W/ARC file:')

Text(value='output/', description='Output folder:')

Dropdown(description='File type:', options=('csv', 'parquet'), value='csv')

Dropdown(description='Content:', options=('All text content', 'Text content without HTTP headers', 'Text conte…

In [26]:
# will someday be a button? 
generate_derivative(input_text.value, out_text.value, format_choice.value, content_options.index(content_choice.value))