Document your code
Every project on GitHub comes with a version-controlled wiki to give your documentation the high level of care it deserves. It’s easy to create well-maintained, Markdown or rich text documentation alongside your code.
Sign up for free See pricing for teams and enterprisesUsing the Archives Unleashed Toolkit with PySpark
Table of Contents
- Introduction
- Getting Started
- DataFrames: Collection Analysis
- Using DataFrames to Count Domains
- Using DataFrames to Count Crawl Dates
- Using DataFrames to List URLs
- Turn Your WARCs into Temporary Database Table
- Implementing at Scale
Introduction
PySpark is a valuable tool for exploring and analyzing data at scale. It is slightly different from other Python programs in that it relies on Apache Spark's underlying Scala and Java code to manipulate datasets. You can read more about this in the Spark documentation.
Of notable difference between PySpark and the Scala spark-shell, the latter proves challenging when pasting code to execute in the shell.
There are two ways around this. The first workaround is to create a new Python file with your script, and use spark-submit. For example, you might create a script with your text editor, save it as file.py, and then run it using the following.
spark-submit --jars /path/to/aut-0.18.0-fatjar.jar --driver-class-path /path/to/aut-0.18.0-fatjar.jar --py-files /path/to/aut-0.18.0.zip /path/to/custom/python/file.pyAn easier method is to use the interactive, browser-based Jupyter Notebooks to work with the Archives Unleashed Toolkit (AUT). You can see it in action below.

Jupyter Notebooks are a great tool and we use it for all of our script prototyping. Once we want to use it on more than one WARC file, though, we find it's best to shift over to spark-submit. Our advice is that once it is working on one file in the Notebook, and you want to start crunching your big data, move back to Spark Submit.
Getting Started
To get Jupyter running, you will need to install the following. First, you will require a version of Python 3.7+ installed. We suggest using Anaconda Distribution.
For ease of use, you may want to consider using a virtual environment:
virtualenv python ~/.venv_path
. ~/.venv_path/bin/activateIf for some reason you are missing dependencies, install them with conda install or pip install.
Next, you will need to download the following AUT release files:
With the dependencies downloaded, you are ready to launch your Jupyter Notebook.
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS=notebook /path/to/spark/bin/pyspark --jars /path/to/aut-0.18.0-fatjar.jar --driver-class-path /path/to/aut-0.18.0-fatjar.jar --py-files /path/to/aut-0.18.0.zipA Jupyter Notebook should automatically load in your browser at http://localhost:8888. You may be asked for a token upon first launch, which just offers a bit of security. The token is available in the load screen and will look something like this:
[I 19:18:30.893 NotebookApp] Writing notebook server cookie secret to /run/user/1001/jupyter/notebook_cookie_secret
[I 19:18:31.111 NotebookApp] JupyterLab extension loaded from /home/nruest/bin/anaconda3/lib/python3.7/site-packages/jupyterlab
[I 19:18:31.111 NotebookApp] JupyterLab application directory is /home/nruest/bin/anaconda3/share/jupyter/lab
[I 19:18:31.112 NotebookApp] Serving notebooks from local directory: /home/nruest/Projects/au/aut
[I 19:18:31.112 NotebookApp] The Jupyter Notebook is running at:
[I 19:18:31.112 NotebookApp] http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
[I 19:18:31.112 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 19:18:31.140 NotebookApp]
To access the notebook, open this file in a browser:
file:///run/user/1001/jupyter/nbserver-9702-open.html
Or copy and paste one of these URLs:
http://localhost:8888/?token=87e7a47c5a015cb2b846c368722ec05c1100988fd9dcfe04
Create a new notebook by clicking “New” (near the top right of the Jupyter homepage) and select “Python 3” from the drop down list.
The notebook will open in a new window. In the first cell enter:
from aut import *
archive = WebArchive(sc, sqlContext, "src/test/resources/warc/")
pages = archive.pages()
pages.printSchema()Then hit Shift+Enter, or press the play button.
If you receive no errors, and see the following, you are ready to begin working with your web archives!

DataFrames: Collection Analysis
An additional benefit of using PySpark is its support of DataFrames, which visually presents data in a tabular form and enables effective filtering. In this section, we will use DataFrames as a tool to provide an overview of a collection’s content.
For these examples, we are going to use some AUT sample data. Click the previous link, download it, and extract the zip file.
With the 0.18.0 release, AUT has the following DataFrames available for collections analysis:
audioimagesimage_linkslinkspagespdfspresentation_programspreadsheetstext_filesvideoword_processor
Pages
Create a DataFrame with crawl_date, url, mime_type_web_server, and content:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.pages()
df.show()
Links
Create a DataFrame with crawl_date, source, destination, and anchor text (that's the text that you click on to use the hyperlink):
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.links()
df.show()
Image Links
Create a DataFrame with source page, and image url:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.image_links()
df.show()
Images
Create a DataFrame with image url, filename, extension, mime_type_web_servr, mime_type_tika, width, height, md5, and raw bytes:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.images()
df.show()
PDFs
Create a DataFrame with PDF url, filename, extension, mime_type_web_servr, mime_type_tika, md5, and raw bytes:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.pdfs()
df.show()
Audio
Create a DataFrame with audio url, filename, extension, mime_type_web_servr, mime_type_tika, md5, and raw bytes. :
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.audio()
df.show()
Video
Create a DataFrame with crawl_date, url, mime_type_web_server, and content:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.video()
df.show()
Spreadsheets
Create a DataFrame with presentation program url, filename, extension, mime_type_web_servr, mime_type_tika, md5, and raw bytes:
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.spreadsheets()
df.show()Presentation program files (i.e. PowerPoint)
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.presentation_program()
df.show()Word processor files (i.e. Word)
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.word_processor()
df.show()Plain text files
from aut import *
archive = WebArchive(sc, sqlContext, "/path/to/aut-resources-master/Sample-Data/*gz")
df = archive.text_files()
df.show()Using DataFrames to Count Domains
Using the extract_domain function, and pyspark.sql.functions, we can count domains using any of the above DataFrames.
For example, you can run the following to see the top 10 domains:
from aut import *
from pyspark.sql.functions import desc
archive = WebArchive(sc, sqlContext, "/home/nruest/Downloads/aut-resources-master/Sample-Data/*gz")
df = archive.pages()
df.select(extract_domain("url").alias("Domain")).groupBy("Domain").count().sort(desc("count")).show(n=10)
Using DataFrames to Count Crawl Dates
You can do this for a few other commands as well. For example, by crawl date:
from aut import *
archive = WebArchive(sc, sqlContext, "/home/nruest/Downloads/aut-resources-master/Sample-Data/*gz")
df = archive.pages()
df.select(extract_domain("crawl_date").alias("Crawl Date")).groupBy("Crawl Date").count().show()
Using DataFrames to List URLs
Finally, you can compile a list of the URLs in a collection with the following command:
from aut import *
archive = WebArchive(sc, sqlContext, "/home/nruest/Downloads/aut-resources-master/Sample-Data/*gz")
df = archive.pages()
df.select("url").rdd.flatMap(lambda x: x).take(10)

Turn Your WARCs into a Temporary Database Table
Using any of the above DataFrames, you can begin to integrate Spark SQL commands.
from aut import *
archive = WebArchive(sc, sqlContext, "/home/nruest/Downloads/aut-resources-master/Sample-Data/*gz")
df = archive.pages()
df.createOrReplaceTempView("warc") # create a table called "warc"
dfSQL = spark.sql('SELECT * FROM warc WHERE url LIKE "%ndp.ca%" ORDER BY crawl_date DESC')
dfSQL.show(5)
Implementing at Scale
Now that you've seen what's possible, try using your own files. We recommend the following:
- Write your scripts in a Jupyter notebook based on one WARC or ARC file, and see if the results are what you might want to do.
- Once you're ready to run it at scale, copy the notebook out of the Notebook and into a text editor.
- You may want to swap the
pathvariable to include an entire directly - i.e.path = "/path/to/warcs/*.gz"rather than just pointing to one file. - Use the Spark-Submit command to execute the script.
Spark-Submit has more fine-tuned commands around the amount of memory you are devoting to the process. Please read the documentation here.
As a reminder, spark-submit syntax looks like:
spark-submit --jars /path/to/aut-0.18.0-fatjar.jar --driver-class-path /path/to/aut-0.18.0-fatjar.jar --py-files /path/to/aut-0.18.0.zip /path/to/custom/python/file.pyWhere file.py is the Python script that you've written.