## Setup AUT and PySpark
This section downloads and installs all of the required dependencies for setting up the AUT and PySpark within Google Colab. 

It is modified from [this notebook](https://github.com/archivesunleashed/notebooks/blob/main/PySpark%20Examples/aut_pyspark.ipynb) created by the Archives Unleashed team. 

In [1]:
# download the AUT
%%capture
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0-fatjar.jar"

In [2]:
# download the dependencies (Java, Apache Spark, FindSpark (for PySpark))
%%capture
!apt-get update
!apt-get install -y openjdk-11-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
# create the appropriate environment variables to be able to use Java, Spark and PySpark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.11.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-1.1.0-fatjar.jar --py-files aut-1.1.0.zip pyspark-shell'
# initialize PySpark and the appropriate context variables for use with the AUT
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

In [4]:
# borrowing their sample data too for now! (from University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp)
%%capture
!mkdir data
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz?raw=true" -O data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz?raw=true" -O data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

In [5]:
import ipywidgets as widgets # for creating UI elements 
import os # for accessing files in folders in the generation step
path = 'content/data' # default path folder

## Get Data

Before we can produce any derivatives, we first need to make sure we can access our W/ARC files from within the program. Data can either be temporarily saved to and accessed from the session storage in Colab, or 

To have data persist between sessions in Colab, it is necessary to mount your Google Drive. 

### Mount Google Drive

To save files and have them persist between sessions, it is necessary to mount your Google Drive. 

This can be done by running the following cells. 

In [6]:
from google.colab import drive
drive.mount('drive/')

Drive already mounted at drive/; to attempt to forcibly remount, call drive.mount("drive/", force_remount=True).


To specify the folder within your Google Drive that you wish to read data from and save data to specify the path in the text pane below. 

To get the path: open the `Files` pane to the left of the screen, navigate through the folder structure to find the folder in your Google Drive you wish to read and write data to. (Your Google Drive folders will be accessible under `drive/MyDrive`) Click on the three dots to the right of the folder name and select `Copy Path`. Paste the path into the text pane below. 

In [8]:
txt_path = widgets.Text(description="Folder path:")
def btn_set_path(btn): 
    global path
    path = txt_path.value
    print(f"Folder path set to: {path}")
btn_txt_submit = widgets.Button(description="Submit")
btn_txt_submit.on_click(btn_set_path)
display(txt_path)
display(btn_txt_submit)

Text(value='', description='Folder path:')

Button(description='Submit', style=ButtonStyle())

Folder path set to: /content/drive/MyDrive/AOY


### Download W/ARC Files From Link

If your W/ARC files are not already available within your Google Drive folder, you can download them by pasting the URL for the file into the box below. 

In [9]:
# Fletcher's code to download a WARC file from a direct link 
import requests
def download_file(url, filepath='', filename=None, loud=True):
  
  if not filename:
    filename = url.split('/')[-1]
    if "?" in filename: 
        filename = filename.split("?")[0]
  
  r = requests.get(url, stream=True)
  if loud:
    total_bytes_dl = 0
    content_len = int(r.headers['Content-Length'])
    prog_bar = widgets.IntProgress(value=1, min=0, max=100, step=1, bar_style='info',orientation='horizontal')
    print(f'Download progress of {filename}:')
    display(prog_bar)

  with open(filepath + filename, 'wb') as fd:
      for chunk in r.iter_content(chunk_size=4096):
          fd.write(chunk)
          if loud:
            total_bytes_dl += 4096
            percent = int((total_bytes_dl / content_len) * 100.0)
            prog_bar.value = percent
  r.close()

In [11]:
txt_url = widgets.Text(description="W/ARC URL: ")
btn_download = widgets.Button(description = "Download W/ARC")
def btn_download_action(btn): 
    url = txt_url.value
    if url != '': 
        download_file(url, path + "/") # download the file to the specified folder set in the above section
    else: 
        print("Please specify a URL in the textbox above.")
btn_download.on_click(btn_download_action)
display(txt_url)
display(btn_download)

Text(value='', description='W/ARC URL: ')

Button(description='Download W/ARC', style=ButtonStyle())

Download progress of ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz:


IntProgress(value=1, bar_style='info')

## Produce a text derivative file

Given an ARC or WARC file, produce a text derivative. 

In [12]:
# import the AUT
from aut import *
from pyspark.sql.functions import col, desc

In [13]:
# a messy first guess at derivative generation
def generate_derivative(source_file, output_folder, file_type="csv", text_filters=0):
    # create our WebArchive object from the W/ARC file
    archive = WebArchive(sc, sqlContext, source_file)

    # almost certainly there is a simpler way of doing this, but I don't know how to modularize out the text filtering options
    if text_filters == 0: # all content
        archive.webpages() \
            .select("crawl_date", "domain", "url", remove_html("content")) \
            .write \
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
            .format(file_type) \
            .option("escape", "\"") \
            .option("encoding", "utf-8") \
            .save(output_folder)
    elif text_filters == 1: # remove HTTP headers
        archive.webpages() \
            .select("crawl_date", "domain", "url", remove_html(remove_http_header("content"))) \
            .write \
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
            .format(file_type) \
            .option("escape", "\"") \
            .option("encoding", "utf-8") \
            .save(output_folder)
    else: # remove boilerplate text
        archive.webpages() \
            .select("crawl_date", "domain", "url", extract_boilerplate(remove_http_header("content")).alias("content")) \
            .write \
            .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
            .format(file_type) \
            .option("escape", "\"") \
            .option("encoding", "utf-8") \
            .save(output_folder)

In [37]:
def btn_create_deriv(btn): 
    # temporary copy-in, need to address scope issues later
    content_options = ["All text content", "Text content without HTTP headers", "Text content without boilerplate"]
    input_file = path + "/" + file_options.value
    output_location = path + "/" + out_text.value
    content_val = content_options.index(content_choice.value)
    generate_derivative(input_file, output_location, format_choice.value, content_val)
    print("Derivative generated, saved to: " + output_location)

In [38]:
# file picker for W/ARC files in the specified folder
data_files = [x for x in os.listdir(path) if x.endswith((".warc", ".arc", "warc.gz", ".arc.gz"))]
file_options = widgets.Dropdown(description="W/ARC file:", options =  data_files)
out_text = widgets.Text(description="Output folder:", value="output/")
format_choice = widgets.Dropdown(description="File type:",options=["csv", "parquet"], value="csv")
# text content choices 
content_options = ["All text content", "Text content without HTTP headers", "Text content without boilerplate"]
content_choice = widgets.Dropdown(description="Content:", options=content_options)
content_val = content_options.index(content_choice.value)
button = widgets.Button(description="Create derivative")
button.on_click(btn_create_deriv)
display(file_options)
display(out_text)
display(format_choice)
display(content_choice)
display(button)

Dropdown(description='W/ARC file:', options=('ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.u…

Text(value='output/', description='Output folder:')

Dropdown(description='File type:', options=('csv', 'parquet'), value='csv')

Dropdown(description='Content:', options=('All text content', 'Text content without HTTP headers', 'Text conte…

Button(description='Create derivative', style=ButtonStyle())

Derivative generated, saved to: /content/drive/MyDrive/AOY/output/something/


In [17]:
archive = WebArchive(sc, sqlContext, path + "/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz")
archive.webpages() \
    .select("crawl_date", "domain", "url", "content") \
    .write \
    .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ") \
    .format("csv") \
    .option("escape", "\"") \
    .option("encoding", "utf-8") \
    .save("html_out/")