<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/main/PySpark%20Examples/aut_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with AUT and PySpark

## Environment

## Setup PySpark

The following commands download and install PySpark.


In [1]:
%%capture

!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0.zip"
!wget "https://github.com/archivesunleashed/aut/releases/download/aut-1.1.0/aut-1.1.0-fatjar.jar"

In [4]:
!ls

aut-1.1.0-fatjar.jar  sample_data		 spark-3.1.1-bin-hadoop2.7.tgz
aut-1.1.0.zip	      spark-3.1.1-bin-hadoop2.7


In [3]:
%%capture

!apt-get update
!apt-get install -y openjdk-11-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop2.7.tgz" > spark-3.1.1-bin-hadoop2.7.tgz
!tar -xvf spark-3.1.1-bin-hadoop2.7.tgz
!pip install -q findspark

In [5]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-1.11.0-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop2.7"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars aut-1.1.0-fatjar.jar --py-files aut-1.1.0.zip pyspark-shell'

In [6]:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

## Data

This directory contains sample data that you might want to use with the Archives Unleashed Toolkit. The ARC and WARC file are drawn from the Canadian Political Parties & Political Interest Groups Archive-It Collection, collected by the University of Toronto. We are grateful that they've provided this material to us.

If you use their material, please cite it as (in this case if a website):

    University of Toronto Libraries, Canadian Political Parties and Interest Groups, Archive-It Collection 227, Canadian Action Party, http://wayback.archive-it.org/227/20051004191340/http://canadianactionparty.ca/Default2.asp


In [7]:
%%capture
!mkdir data
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz?raw=true" -O data/ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz
!wget "https://github.com/archivesunleashed/aut-resources/blob/master/Sample-Data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz?raw=true" -O data/ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz

In [8]:
!ls data

ARCHIVEIT-227-QUARTERLY-XUGECV-20091218231727-00039-crawling06.us.archive.org-8091.warc.gz
ARCHIVEIT-227-UOFTORONTO-CANPOLPINT-20060622205612-00009-crawling025.archive.org.arc.gz


In [24]:
from aut import *
from pyspark.sql.functions import col, desc

In [10]:
archive = WebArchive(sc, sqlContext, "data")

In [11]:
webpages = archive.webpages()
webpages.show(10, True)

+--------------+--------------------+--------------------+--------------------+--------------------+--------+--------------------+
|    crawl_date|              domain|                 url|mime_type_web_server|      mime_type_tika|language|             content|
+--------------+--------------------+--------------------+--------------------+--------------------+--------+--------------------+
|20060622205609|              gca.ca|http://www.gca.ca...|           text/html|           text/html|      en|Green Communities...|
|20060622205609|         ppforum.com|http://www.ppforu...|           text/html|           text/html|      en|Speeches - Public...|
|20060622205609|  communist-party.ca|http://communist-...|           text/html|           text/html|      en|Calendar CPC Comi...|
|20060622205610|     canadafirst.net|http://www.canada...|           text/html|           text/html|      en|TERRORIST DESTINA...|
|20060622205610|             web.net|http://www.web.ne...|           text/html|    

In [27]:
webgraph = archive.webgraph()
webgraph.show(10, True)

+--------------+--------------------+--------------------+------+
|    crawl_date|                 src|                dest|anchor|
+--------------+--------------------+--------------------+------+
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      |
+--------------+--------------------+--------------------+------+
only showing top 10 rows



In [14]:
images = archive.images()
images.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|width|height|                 md5|                sha1|               bytes|
+--------------+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+--------------------+--------------------+
|20060622205609|http://coat.ncf.c...|         smith_a.jpg|      jpg|          image/jpeg|    image/jpeg|  211|   316|f74e58e4d894d7825...|1def0d1954d7e88cc...|/9j/4AAQSkZJRgABA...|
|20060622205610|http://cpcml.ca/i...|060501JakartaIndo...|      jpg|          image/jpeg|    image/jpeg|  550|   336|10bb4e6a8a6425f56...|5ee9c80690172e05b...|/9j/4AAQSkZJRgABA...|
|20060622205610|http://liberal.ca...|           35082.jpg|      jpg|          image/jpeg|    im

In [15]:
image_links = archive.imagegraph()
image_links.show(10, True)

+--------------+--------------------+--------------------+--------------------+
|    crawl_date|                 src|           image_url|            alt_text|
+--------------+--------------------+--------------------+--------------------+
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|Picture of the Gr...|
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|                    |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|          Contact Us|
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|            Site Map|
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|      Privacy Policy|
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|                    |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|                    |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|                    |
|20060622205609|http://www.gca.ca...|http://www.gca.ca...|   About Our Members|
|20060622205609|http://www.gca.ca...|htt

In [16]:
pdfs = archive.pdfs()
pdfs.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|                sha1|               bytes|
+--------------+--------------------+--------------------+---------+--------------------+---------------+--------------------+--------------------+--------------------+
|20060622205611|http://partimarij...|Massicotti_Affida...|      pdf|     application/pdf|application/pdf|4daa676e867d0ac65...|5d7c895db1b592aaa...|JVBERi0xLjIKJSDi4...|
|20060622205613|http://www.web.ne...|      securityqs.PDF|      pdf|     application/pdf|application/pdf|eadd48d19fd55e103...|2ab70423309828af9...|JVBERi0xLjIgDQol4...|
|20060622205615|http://partimarij...|           Ewing.pdf|      pdf|     application/pdf|application/pdf|8e43fec319e76e0f5...|930cd5ecf521b2b8f...|JVBERi0x

In [17]:
audio = archive.audio()
audio.show(10, True)

+--------------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|      filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|20060622205553|http://www.canadi...|   COLLINS1.RA|       ra|   audio/x-realaudio|audio/x-pn-realaudio|0128cb24f439f13a7...|ff1f9fdc00805d8fe...|LnJh/QAEAAAucmE0A...|
|20060622204615|http://www.animal...|2006-01-13.mp3|      mp3|          audio/mpeg|          audio/mpeg|e4b3825ea1ecae26d...|990919d05d6cd4bdb...|//NAxAkWuVKkX9gQA...|
+--------------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+--------------

In [18]:
video = archive.video()
video.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|               bytes|
+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20060622205625|http://www.bloc.o...|            bloc.wmv|      wmv|      video/x-ms-wmv|video/x-ms-wmv|fc16dd3c9c289a7ce...|1a77f9f3d9b18d31a...|MCaydY5mzxGm2QCqA...|
|20060622205657|http://www.noshar...|       HomaCBClQ.WMV|      wmv|          text/plain|video/x-ms-wmv|ef89c319f8ccd119a...|46e34725a78df33d0...|MCaydY5mzxGm2QCqA...|
|20060622205643|http://www.bloc.o...|        16juin02.wmv|      wmv|      video/x-ms-wmv|video/x-ms-wmv|5b49c2b15ec631516...|b9ea03cbf9b3dcf96...|MCaydY5mzxGm2Q

In [19]:
spreadsheets = archive.spreadsheets()
spreadsheets.show(10, True)

+----------+---+--------+---------+--------------------+--------------+---+----+-----+
|crawl_date|url|filename|extension|mime_type_web_server|mime_type_tika|md5|sha1|bytes|
+----------+---+--------+---------+--------------------+--------------+---+----+-----+
+----------+---+--------+---------+--------------------+--------------+---+----+-----+



In [20]:
presentation_program_files = archive.presentation_program()
presentation_program_files.show(10, True)

+--------------+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|filename|extension|mime_type_web_server|      mime_type_tika|                 md5|                sha1|               bytes|
+--------------+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+
|20060622205642|http://www.afn.ca...| aig.pps|      pps|application/vnd.m...|application/vnd.m...|f38d64504487dd373...|b7d60930a981e2bc2...|0M8R4KGxGuEAAAAAA...|
+--------------+--------------------+--------+---------+--------------------+--------------------+--------------------+--------------------+--------------------+



In [21]:
word_processor_files = archive.word_processor()
word_processor_files.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server|    mime_type_tika|                 md5|                sha1|               bytes|
+--------------+--------------------+--------------------+---------+--------------------+------------------+--------------------+--------------------+--------------------+
|20060622205609|http://canadianac...|Some_facts_about_...|      doc|  application/msword|application/msword|f35c8570d81a0f4f5...|64378b21c8ea6bce5...|0M8R4KGxGuEAAAAAA...|
|20060622205609|http://www.nawl.c...|Pub_Brief_Antiter...|      doc|  application/msword|application/msword|b0528837322957073...|35f8fdc77d6e92b40...|0M8R4KGxGuEAAAAAA...|
|20060622205629|http://www.equalv...|       layton-en.doc|      doc|  application/msword|application/msword|3c28c798bfcc25ffe...|f9a0f96ab31

In [25]:
# Domain frequency.
webpages.groupBy("domain") \
  .count() \
  .sort(col("count")\
  .desc()) \
  .show(10, True)

+--------------------+-----+
|              domain|count|
+--------------------+-----+
|       equalvoice.ca| 4280|
|          liberal.ca| 1981|
|     davidsuzuki.org|  619|
|policyalternative...|  588|
|       greenparty.ca|  535|
|         fairvote.ca|  442|
|              ndp.ca|  416|
|     canadiancrc.com|   89|
|  communist-party.ca|   39|
|             ccsd.ca|   22|
+--------------------+-----+
only showing top 10 rows



In [28]:
# Domain graph.
webgraph.groupBy("crawl_date", remove_prefix_www(extract_domain("src")).alias("src_domain"), remove_prefix_www(extract_domain("dest")).alias("dest_domain"))\
  .count()\
  .filter((col("dest_domain").isNotNull()) & (col("dest_domain") !=""))\
  .filter((col("src_domain").isNotNull()) & (col("src_domain") !=""))\
  .filter(col("count") > 5)\
  .orderBy(desc("count"))\
  .show(10, True)

+--------------+--------------------+--------------------+-----+
|    crawl_date|          src_domain|         dest_domain|count|
+--------------+--------------------+--------------------+-----+
|20091219001507|     davidsuzuki.org|     davidsuzuki.org|  428|
|20091219002507|       greenparty.ca|       greenparty.ca|  362|
|20091219002455|       greenparty.ca|       greenparty.ca|  358|
|20091218232854|       greenparty.ca|       greenparty.ca|  335|
|20091218231758|     canadiancrc.com|     canadiancrc.com|  321|
|20091219002518|       greenparty.ca|       greenparty.ca|  319|
|20091218232549|       greenparty.ca|       greenparty.ca|  318|
|20091219002433|       greenparty.ca|       greenparty.ca|  316|
|20091218235854|       greenparty.ca|       greenparty.ca|  276|
|20091219002742|policyalternative...|policyalternative...|  238|
+--------------+--------------------+--------------------+-----+
only showing top 10 rows



In [35]:
css = archive.css()
css.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091218231950|http://greenparty...|css_d9b2f523c9c09...|      css|            text/css|    text/plain|717c98cca6f312506...|c5c52ff9828b704ac...|#aggregator .feed...|
|20091218234334|https://greenpart...|css_d9b2f523c9c09...|      css|            text/css|    text/plain|717c98cca6f312506...|c5c52ff9828b704ac...|#aggregator .feed...|
|20091219001031|http://www.e-acti...|action.retrievefi...|      css|            text/css|           N/A|d41d8cd98f00b204e...|da39a3ee5e6b4b0d3...|              

In [30]:
html = archive.html()
html.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20060622205609|http://www.gca.ca...|                    |     html|           text/html|     text/html|fb08072dfaeea19c3...|fe9980d9295bec0c2...|
<!DOCTYPE HTML P...|
|20060622205609|http://www.ppforu...|           index.asp|     html|           text/html|     text/html|7539b526a63262d46...|ebf13d3760939af03...|
<!DOCTYPE HTML ...|
|20060622205609|http://www.noshar...|        image010.gif|     html|           text/html|     text/html|bf8c38affb7d0285a...|838c195afb085a1a1...|<HTML>
<HEAD>


In [31]:
js = archive.js()
js.show(10, True)

+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20091218234450|https://greenpart...|js_1504f0e5300da6...|       js|application/javas...|    text/plain|96cb75df55b03398f...|ee81d43ef27acec8c...|// $Id: jquery.js...|
+--------------+--------------------+--------------------+---------+--------------------+--------------+--------------------+--------------------+--------------------+



In [32]:
json = archive.json()
json.show(10, True)

+----------+---+--------+---------+--------------------+--------------+---+----+-------+
|crawl_date|url|filename|extension|mime_type_web_server|mime_type_tika|md5|sha1|content|
+----------+---+--------+---------+--------------------+--------------+---+----+-------+
+----------+---+--------+---------+--------------------+--------------+---+----+-------+



In [33]:
plain_text = archive.plain_text()
plain_text.show(10, True)

+--------------+--------------------+----------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|  filename|extension|mime_type_web_server|mime_type_tika|                 md5|                sha1|             content|
+--------------+--------------------+----------+---------+--------------------+--------------+--------------------+--------------------+--------------------+
|20060622205620|http://agoracosmo...|  html.mes|      txt|          text/plain|    text/plain|7dee99acd58abc66c...|096c51df2828f49b8...|<br>
<p><font fa...|
|20060622205627|http://agoracosmo...| html.mes1|      txt|          text/plain|    text/plain|58c9b7de5042206c3...|ede3e2b202a8f61d5...|<br>
<p><font fa...|
|20060622205649|http://www.energi...|robots.txt|      txt|          text/plain|    text/plain|fb52df01c5bbd18a9...|ba580ace2af16f6b2...|User-agent: *
Dis...|
|20091219001445|http://www.cbs.co...|robots.txt|    

In [34]:
xml = archive.xml()
xml.show(10, True)

+--------------+--------------------+----------+---------+--------------------+-------------------+--------------------+--------------------+--------------------+
|    crawl_date|                 url|  filename|extension|mime_type_web_server|     mime_type_tika|                 md5|                sha1|             content|
+--------------+--------------------+----------+---------+--------------------+-------------------+--------------------+--------------------+--------------------+
|20091218232000|http://greenparty...|      feed|      rss| application/rss+xml|application/rss+xml|076ce8e27c439c53d...|f0eda8d70e33176f6...|<?xml version="1....|
|20091218232159|http://greenparty...|      feed|      rss| application/rss+xml|application/rss+xml|fe5a4834dab62aaa2...|78c8e57a25e77db94...|<?xml version="1....|
|20091218232233|http://greenparty...|2009-12-26|      rss| application/rss+xml|application/rss+xml|988c1b22efc385d49...|6e8d46808344e2b01...|<?xml version="1....|
|20091218232355|http:/