<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/master/aut_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with AUT and PySpark

## Environment

## Setup PySpark

The following commands download and install PySpark.


In [0]:
%%capture

!wget "https://github.com/archivesunleashed/aut/releases/download/aut-0.50.0/aut-0.50.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-0.50.0/aut-0.50.0-fatjar.jar"

In [2]:
!ls

aut-0.50.0-fatjar.jar  aut-0.50.0.zip  sample_data


In [0]:
%%capture

!apt-get update
!apt-get install -y openjdk-8-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz" > spark-2.4.5-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-0.50.0-fatjar.jar --py-files aut-0.50.0.zip pyspark-shell'

In [0]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Data

This directory contains sample data that you might want to use with the Archives Unleashed Toolkit. The ARC and WARC file are drawn from the Canadian Political Parties & Political Interest Groups Archive-It Collection, collected by the University of Toronto. We are grateful that they've provided this material to us.

If you use their material, please cite it as (in this case if a website):

    University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp


In [0]:
%%capture
!mkdir data
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz?raw=true" -O data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz?raw=true" -O data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

In [7]:
!ls data

ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz
ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz


In [0]:
from aut import *
from pyspark.sql.functions import desc

In [0]:
archive = WebArchive(sc, sqlContext, "data")

In [11]:
webpages = archive.webpages()
webpages.show(10, True)

+----------+--------------------+--------------------+--------------------+--------+--------------------+
|crawl_date|                 url|mime_type_web_server|      mime_type_tika|language|             content|
+----------+--------------------+--------------------+--------------------+--------+--------------------+
|  20060622|http://www.gca.ca...|           text/html|           text/html|      en|HTTP/1.1 200 OK
...|
|  20060622|http://www.ppforu...|           text/html|           text/html|      en|HTTP/1.1 200 OK
...|
|  20060622|http://communist-...|           text/html|           text/html|      en|HTTP/1.1 200 OK
...|
|  20060622|http://www.canada...|           text/html|           text/html|      en|HTTP/1.1 200 OK
...|
|  20060622|http://www.web.ne...|           text/html|           text/html|      en|HTTP/1.1 200 OK
...|
|  20060622|http://www.ccsd.c...|           text/html|           text/html|      fr|HTTP/1.1 200 OK
...|
|  20060622|http://www.policy...|           te

In [21]:
links = archive.links()
links.show(10, True)

+----------+--------------------+--------------------+--------------------+
|crawl_date|                 src|                dest|              anchor|
+----------+--------------------+--------------------+--------------------+
|  20060622|http://www.gca.ca...|http://www.cleann...|                    |
|  20060622|http://www.gca.ca...|http://www.quidno...|Quid Novis Intern...|
|  20060622|http://www.ppforu...|http://www.adobe....|                    |
|  20060622|http://www.ppforu...|mailto:kelly.cyr@...|           Kelly Cyr|
|  20060622|http://www.ppforu...|http://www.renouf...|   Renouf Publishing|
|  20060622|http://www.ppforu...|http://bayteksyst...|   bayteksystems.com|
|  20060622|http://communist-...|http://www.calend...|  www.calendarix.com|
|  20060622|http://communist-...|http://www.calend...|                    |
|  20060622|http://communist-...|mailto:webmaster@...|webmaster@calenda...|
|  20060622|http://www.ccsd.c...|http://www.ccsd.c...|                    |
+----------+

In [22]:
images = archive.images()
images.show(10, True)

+----------+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|width|height|                 md5|                sha1|               bytes|
+----------+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|  20060622|http://coat.ncf.c...|         smith_a.jpg|      jpg|          image/jpeg|    image/jpeg|  211|   316|f74e58e4d894d7825...|1def0d1954d7e88cc...|/9j/4AAQSkZJRgABA...|
|  20060622|http://cpcml.ca/i...|060501JakartaIndo...|      jpg|          image/jpeg|    image/jpeg|  550|   336|10bb4e6a8a6425f56...|5ee9c80690172e05b...|/9j/4AAQSkZJRgABA...|
|  20060622|http://liberal.ca...|           35082.jpg|      jpg|          image/jpeg|    image/jpeg|  150|   214|53

In [13]:
image_links = archive.image_links()
image_links.show(10, True)

+----------+--------------------+--------------------+
|crawl_date|                 src|           image_url|
+----------+--------------------+--------------------+
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
|  20060622|http://www.gca.ca...|http://www.gca.ca...|
+----------+--------------------+--------------------+
only showing top 10 rows



In [14]:
pdfs = archive.pdfs()
pdfs.show(10, True)

+----------+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+----------+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|  20060622|http://partimarij...|Massicotti_Affida...|      pdf|     application/pdf|application/pdf|4daa676e867d0ac65...|5d7c895db1b592aaa...|JVBERi0xLjIKJSDi4...|
|  20060622|http://www.web.ne...|      securityqs.PDF|      pdf|     application/pdf|application/pdf|eadd48d19fd55e103...|2ab70423309828af9...|JVBERi0xLjIgDQol4...|
|  20060622|http://partimarij...|           Ewing.pdf|      pdf|     application/pdf|application/pdf|8e43fec319e76e0f5...|930cd5ecf521b2b8f...|JVBERi0xLjIKJSDi4...|
|  2006062

In [15]:
audio = archive.audio()
audio.show(10, True)

+----------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|      filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+----------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  20060622|http://www.canadi...|   COLLINS1.RA|       ra|   audio/x-realaudio|audio/x-pn-realaudio|0128cb24f439f13a7...|ff1f9fdc00805d8fe...|LnJh/QAEAAAucmE0A...|
|  20060622|http://www.animal...|2006-01-13.mp3|      mp3|          audio/mpeg|          audio/mpeg|e4b3825ea1ecae26d...|990919d05d6cd4bdb...|//NAxAkWuVKkX9gQA...|
+----------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [16]:
video = archive.video()
video.show(10, True)

+----------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+----------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|  20060622|http://www.bloc.o...|            bloc.wmv|      wmv|      video/x-ms-wmv|video/x-ms-wmv|fc16dd3c9c289a7ce...|1a77f9f3d9b18d31a...|MCaydY5mzxGm2QCqA...|
|  20060622|http://www.noshar...|       HomaCBClQ.WMV|      wmv|          text/plain|video/x-ms-wmv|ef89c319f8ccd119a...|46e34725a78df33d0...|MCaydY5mzxGm2QCqA...|
|  20060622|http://www.bloc.o...|        16juin02.wmv|      wmv|      video/x-ms-wmv|video/x-ms-wmv|5b49c2b15ec631516...|b9ea03cbf9b3dcf96...|MCaydY5mzxGm2QCqA...|
|  20060622|http

In [17]:
spreadsheets = archive.spreadsheets()
spreadsheets.show(10, True)

+----------+---+--------+---------+--------------------+--------------+---+----+-----+
|crawl_date|url|filename|extension|mime_type_web_server|mime_type_tika|md5|sha1|bytes|
+----------+---+--------+---------+--------------------+--------------+---+----+-----+
+----------+---+--------+---------+--------------------+--------------+---+----+-----+



In [18]:
presentation_program_files = archive.presentation_program()
presentation_program_files.show(10, True)

+----------+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+----------+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  20060622|http://www.afn.ca...| aig.pps|      pps|application/vnd.m...|application/vnd.m...|f38d64504487dd373...|b7d60930a981e2bc2...|0M8R4KGxGuEAAAAAA...|
+----------+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [19]:
word_processor_files = archive.word_processor()
word_processor_files.show(10, True)

+----------+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+----------+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|  20060622|http://canadianac...|Some_facts_about_...|      doc|  application/msword|application/msword|f35c8570d81a0f4f5...|64378b21c8ea6bce5...|0M8R4KGxGuEAAAAAA...|
|  20060622|http://www.nawl.c...|Pub_Brief_Antiter...|      doc|  application/msword|application/msword|b0528837322957073...|35f8fdc77d6e92b40...|0M8R4KGxGuEAAAAAA...|
|  20060622|http://www.equalv...|       layton-en.doc|      doc|  application/msword|application/msword|3c28c798bfcc25ffe...|f9a0f96ab31de9cdd...|0M8R4KGxGuEAAA

In [20]:
text_files = archive.text_files()
text_files.show(10, True)

+----------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+----------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|  20060622|http://agoracosmo...|            html.mes|      mes|          text/plain|    text/plain|7dee99acd58abc66c...|096c51df2828f49b8...|PGJyPg0KPHA+PGZvb...|
|  20060622|http://agoracosmo...|           html.mes1|     mes1|          text/plain|    text/plain|58c9b7de5042206c3...|ede3e2b202a8f61d5...|PGJyPg0KPHA+PGZvb...|
|  20060622|http://www.noshar...|       HomaCBClQ.WMV|      WMV|          text/plain|video/x-ms-wmv|ef89c319f8ccd119a...|46e34725a78df33d0...|MCaydY5mzxGm2QCqA...|
|  20060622|http