<a href="https://colab.research.google.com/github/archivesunleashed/notebooks/blob/master/au_parquet_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Archives Unleashed Parquet Derivatives

In this notebook, we'll setup an enviroment, then load nine different DataFrames to experiment with.

**[Binary Analysis](https://github.com/archivesunleashed/aut/wiki/Bleeding-Edge#list-of-domains)**
- [Audio](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/binary-analysis.md#Extract-Audio-Information)
- [Images](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/image-analysis.md#Extract-Image-information)
- [PDFs](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/binary-analysis.md#Extract-PDF-Information)
- [Presentation program files](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/binary-analysis.md#Extract-Presentation-Program-Files-Information)
- [Spreadsheets](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/binary-analysis.md#Extract-Spreadsheet-Information)
- [Videos](https://github.com/archivesunleashed/aut-docs-new/blob/master/current/binary-analysis.md#Extract-Video-Information)
- [Word processor files](hhttps://github.com/archivesunleashed/aut-docs-new/blob/master/current/binary-analysis.md#Extract-Word-Processor-Files-Information)

**[Hyperlink Network](https://github.com/archivesunleashed/aut/wiki/Bleeding-Edge#hyperlink-network)**

**[List of Domains](https://github.com/archivesunleashed/aut/wiki/Bleeding-Edge#list-of-domains)**

# Setup Dependencies

First we'll need to setup Java, Apache Spark, and PySpark bindings

In [0]:
%%capture

!apt-get update
!apt-get install -y openjdk-8-jdk-headless -qq 
!apt-get install maven -qq

!curl -L "https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz" > spark-2.4.4-bin-hadoop2.7.tgz
!tar -xvf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

# Dataset

Next, we'll download a dataset to work with. This once comes from Bibliothèque et Archives nationales du Québec. It is Parquet output of the Ministry of Environment of Québec (2011-2014) web archives processed by the Archives Unleashed Toolkit.

In [0]:
%%capture

!curl -L "http://cloud.archivesunleashed.org/environment-qc-parquet.tar.gz" > environment-qc-parquet.tar.gz
!mkdir dataset
!tar -xzvf environment-qc-parquet.tar.gz

# Environment

Next, we'll setup our environment so we can work with the Parquet output in PySpark.

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Datasets

Next, we'll load up our datasets to work with, and show a preview of each.

In [0]:
dataset_audio = spark.read.parquet("parquet/audio")
dataset_audio.show()

+--------------------+--------------------+---------+--------------------+--------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|                 md5|
+--------------------+--------------------+---------+--------------------+--------------+--------------------+
|http://www.mddep....|connaissances-eau...|      mp3|          audio/mpeg|    audio/mpeg|f1f9bd2570b2eef6b...|
|http://www.ragedu...| operation-raton.wmv|      wma|      video/x-ms-wmv|audio/x-ms-wma|4e5763994a906b6aa...|
|http://www.ragedu...| operation-raton.wmv|      wma|      video/x-ms-wmv|audio/x-ms-wma|4e5763994a906b6aa...|
|http://www.ragedu...| operation-raton.wmv|      wma|      video/x-ms-wmv|audio/x-ms-wma|4e5763994a906b6aa...|
|http://www.ragedu...| operation-raton.wmv|      wma|      video/x-ms-wmv|audio/x-ms-wma|4e5763994a906b6aa...|
|http://www.deficl...|110314%20Steven%2...|      mp3|          audio/mpeg|    audio/mpeg|b6722993460210738...|
|

In [0]:
dataset_domains =  spark.read.parquet("parquet/domains")
dataset_domains.show()

+--------------------+-----+
|              Domain|count|
+--------------------+-----+
|www.rechargeonsla...|    1|
|ville.waterloo.qc.ca|    1|
|        uclouvain.be|    1|
|  static.addinto.com|    1|
|    cbks0.google.com|    1|
|  www.crt.gouv.qc.ca|    1|
|       www.bdrvm.com|    1|
|  www.csf.gouv.qc.ca|    1|
|www.jeroenwijerin...|    1|
|www.capitale.gouv...|    1|
|www.senv.mddep.go...|    1|
|     swdlp.apple.com|    1|
|          fmv.fcm.ca|    1|
|           oaq.qc.ca|    1|
|      www.flickr.com|    1|
|cache.addthiscdn.com|    1|
|       www.google.ca|    1|
|        web.undp.org|    1|
|www.conseilinterc...|    1|
| www.cspq.gouv.qc.ca|    1|
+--------------------+-----+
only showing top 20 rows



In [0]:
dataset_images = spark.read.parquet("parquet/images")
dataset_images.show()

+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+
|                 url|            filename|extension|mime_type_web_server|mime_type_tika|width|height|                 md5|
+--------------------+--------------------+---------+--------------------+--------------+-----+------+--------------------+
|https://ws.sharet...|    flipboard_16.png|      png|           image/png|     image/png|   32|    32|c7ecde5871b0582fe...|
|http://www.cehq.g...|        X0002538.jpg|      jpg|          image/jpeg|    image/jpeg|  300|   225|c7eae700de547012d...|
|http://mddefp.gou...|Tmoy_classificati...|      jpg|          image/jpeg|    image/jpeg|  100|    76|c7de8ec7c6e221260...|
|http://www.cehq.g...|        X0002669.jpg|      jpg|          image/jpeg|    image/jpeg|  300|   225|c7d04c884ece4d1a4...|
|http://www.fetede...|  peche-plus-1-g.jpg|      jpg|          image/jpeg|    image/jpeg| 2000|  2997|c7cf4673777dba8ba...|
|http://

In [0]:
dataset_network = spark.read.parquet("parquet/network")
dataset_network.show()

+-----------------+--------------------+-----+
|        SrcDomain|          DestDomain|count|
+-----------------+--------------------+-----+
|mddefp.gouv.qc.ca|            oies.com|    5|
|mddefp.gouv.qc.ca|          unicef.org|    9|
|mddefp.gouv.qc.ca|conseildelafedera...|   75|
|mddefp.gouv.qc.ca|       covabar.qc.ca|   20|
|mddefp.gouv.qc.ca|            obvm.org|   10|
|mddefp.gouv.qc.ca|             loc.gov|    8|
|mddefp.gouv.qc.ca|            stm.info|    1|
|mddefp.gouv.qc.ca|budget.finances.g...|   38|
|mddefp.gouv.qc.ca|corridorappalachi...|   17|
|mddefp.gouv.qc.ca|   saint-adolphe.org|   10|
|mddefp.gouv.qc.ca|          statcan.ca|   10|
|mddefp.gouv.qc.ca|iqa.mddelcc.gouv....|    2|
|mddefp.gouv.qc.ca|     bibliotheque.qc|    1|
|mddefp.gouv.qc.ca|foruminondations2...|    2|
|mddefp.gouv.qc.ca|          ciraig.org|    2|
|mddefp.gouv.qc.ca|fondationdelafaun...|  223|
|mddefp.gouv.qc.ca|especesenperil.gc.ca|   18|
|mddefp.gouv.qc.ca|      cai.gouv.qc.ca|   51|
|mddefp.gouv.

In [0]:
dataset_pdfs = spark.read.parquet("parquet/pdf")
dataset_pdfs.show()

+--------------------+--------------------+---------+--------------------+---------------+--------------------+
|                 url|            filename|extension|mime_type_web_server| mime_type_tika|                 md5|
+--------------------+--------------------+---------+--------------------+---------------+--------------------+
|http://mddefp.gou...|Wildlife_sanctuar...|      pdf|     application/pdf|application/pdf|345f2269c7a747aac...|
|http://mddefp.gou...|Wildlife_sanctuar...|      pdf|     application/pdf|application/pdf|345f2269c7a747aac...|
|http://www.mddep....|Wildlife_sanctuar...|      pdf|     application/pdf|application/pdf|345f2269c7a747aac...|
|http://www.mddefp...|Wildlife_sanctuar...|      pdf|     application/pdf|application/pdf|345f2269c7a747aac...|
|http://www.mddep....|Wildlife_sanctuar...|      pdf|     application/pdf|application/pdf|345f2269c7a747aac...|
|http://www.mddep....|Wildlife_sanctuar...|      pdf|     application/pdf|application/pdf|345f2269c7a747

In [0]:
dataset_presentation_program = spark.read.parquet("parquet/presentation_program")
dataset_presentation_program.show()

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|http://www.enviro...|           JEAP.pptx|     pptx|application/vnd.o...|application/vnd.o...|c3cf8193a17007d48...|
|http://www.mddefp...|phosphore-abitibi...|      ppt|application/vnd.m...|application/vnd.m...|f355c8d364bbd7c69...|
|http://www.mddep....|phosphore-abitibi...|      ppt|application/vnd.m...|application/vnd.m...|f355c8d364bbd7c69...|
|http://mddep.gouv...|phosphore-abitibi...|      ppt|application/vnd.m...|application/vnd.m...|f355c8d364bbd7c69...|
|http://www.mddefp...|phosphore-abitibi...|      ppt|application/vnd.m...|application/vnd.m...|f355c8d364bbd7c69...|
|http://mddep.gouv...|phosphore-abitibi...|      ppt|application

In [0]:
dataset_spreadsheets = spark.read.parquet("parquet/spreadsheets")
dataset_spreadsheets.show()

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|http://mddefp.gou...|liste-etablisseme...|     xlsx|application/vnd.o...|application/vnd.o...|2aa0b7c4f741a13ac...|
|http://mddefp.gou...|30-3-2-adaptation...|     xlsm|application/vnd.m...|application/vnd.m...|b962130c2b1c169fe...|
|http://mddefp.gou...|17-1-ecocamionnag...|     xlsx|application/vnd.o...|application/vnd.o...|5dc7c1910bcd8af6c...|
|http://mddefp.gou...|9-3-financement-m...|     xlsm|application/vnd.m...|application/vnd.m...|35259010efa4b4a7d...|
|http://mddefp.gou...|    14-7-PIEVAL.xlsm|     xlsm|application/vnd.m...|application/vnd.m...|3520b8cdafd7c500e...|
|http://mddefp.gou...|ge-4-gestion-inte...|     xlsm|application

In [0]:
dataset_videos = spark.read.parquet("parquet/videos")
dataset_videos.show()

+--------------------+-------------+---------+--------------------+--------------+--------------------+
|                 url|     filename|extension|mime_type_web_server|mime_type_tika|                 md5|
+--------------------+-------------+---------+--------------------+--------------+--------------------+
|http://r2---sn-9p...|videoplayback|      mp4|           video/mp4|     video/mp4|d68b00e237cac18b5...|
|http://r1---sn-9p...|videoplayback|      mp4|           video/mp4|     video/mp4|d687fc368843346ae...|
|http://r1---sn-9p...|videoplayback|      mp4|           video/mp4|     video/mp4|d6428ab7ca3306dca...|
|http://r2---sn-9p...|videoplayback|      mp4|           video/mp4|     video/mp4|d628294e7df1c7b12...|
|https://r5---sn-t...|videoplayback|      mp4|           video/mp4|     video/mp4|d610d5763cc3d1a5a...|
|http://r1---sn-9p...|videoplayback|      mp4|           video/mp4|     video/mp4|d5e37c1c79441e40b...|
|http://r2---sn-9p...|videoplayback|      mp4|           video/m

In [0]:
dataset_word_processor = spark.read.parquet("parquet/word_processor")
dataset_word_processor.show()

+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|                 url|            filename|extension|mime_type_web_server|      mime_type_tika|                 md5|
+--------------------+--------------------+---------+--------------------+--------------------+--------------------+
|http://mddefp.gou...|     formulaire.docx|     docx|application/vnd.o...|application/vnd.o...|49437b65e78fc8d07...|
|http://www.mddep....|3-aide_particulie...|      doc|  application/msword|  application/msword|4943475a254e3b927...|
|http://www.mddefp...|         annexe5.doc|      doc|  application/msword|  application/msword|491bdb25f3ee1358b...|
|http://www.mddep....|         annexe5.doc|      doc|  application/msword|  application/msword|491bdb25f3ee1358b...|
|http://mddefp.gou...|Attestation-perso...|      doc|  application/msword|  application/msword|4868aa8e259881533...|
|http://www.mddep....|Attestation-perso...|      doc|  applicati

# Data Analysis

Now that we have all of our datasets loaded up, we can begin to work with it!


## Counting total files, and unique files

In [0]:
# Count number of rows (how many images are in the web archive collection).
dataset_images.count()

156166

In [0]:
# How do we get the number of unique images? We have a MD5 digest of each, so we can count the unique md5 values!
from pyspark.sql.functions import countDistinct
dataset_images.select(countDistinct("md5")).show()

+-------------------+
|count(DISTINCT md5)|
+-------------------+
|              18287|
+-------------------+



In [0]:
import pyspark.sql.functions as func
dataset_images.groupBy('filename').count().orderBy('count', ascending=False).show(10)

+-------------+-----+
|     filename|count|
+-------------+-----+
|  carte-p.jpg| 1196|
|   carte2.jpg|  924|
|   carte1.jpg|  875|
|  carte-g.jpg|  660|
|    carte.jpg|  576|
| carte-qc.jpg|  575|
| carte-an.jpg|  575|
|  carte-G.jpg|  484|
|  carte_p.jpg|  473|
|carte_web.jpg|  431|
+-------------+-----+
only showing top 10 rows



In [0]:
import pyspark.sql.functions as func
dataset_images.groupBy('md5').count().orderBy('count', ascending=False).show(10, False)

+--------------------------------+-----+
|md5                             |count|
+--------------------------------+-----+
|5283d313972a24f0e71c47ae3c99958b|192  |
|b09dc3225d5e1377c52c06feddc33bfe|192  |
|a4d3ddfb1a95e87650c624660d67765a|192  |
|e7d1f7750c16bc835bf1cfe1bf322d46|192  |
|89663337857f6d769fbcaed7278cc925|77   |
|497db34fffa0e278f57ae614b4b758a0|64   |
|58e5d8676dfcc4205551314d98fb2624|61   |
|100322cfd242ee75dd5a744526f08d6b|56   |
|7252e42a951b5e449ea02c517839ed6d|53   |
|65274f9eaa4c585b7c35193ebb04e0d7|53   |
+--------------------------------+-----+
only showing top 10 rows



In [0]:
dataset_images.filter("md5 = '5283d313972a24f0e71c47ae3c99958b'").show(192, False)

+---------------------------------------------------------------+----------+---------+--------------------+--------------+-----+------+--------------------------------+
|url                                                            |filename  |extension|mime_type_web_server|mime_type_tika|width|height|md5                             |
+---------------------------------------------------------------+----------+---------+--------------------+--------------+-----+------+--------------------------------+
|http://mddefp.gouv.qc.ca/poissons/assomption/tumeur.jpg        |tumeur.jpg|jpg      |image/jpeg          |image/jpeg    |310  |220   |5283d313972a24f0e71c47ae3c99958b|
|http://mddefp.gouv.qc.ca/poissons/chateauguay/tumeur.jpg       |tumeur.jpg|jpg      |image/jpeg          |image/jpeg    |310  |220   |5283d313972a24f0e71c47ae3c99958b|
|http://mddefp.gouv.qc.ca/poissons/st-francois/tumeur.jpg       |tumeur.jpg|jpg      |image/jpeg          |image/jpeg    |310  |220   |5283d313972a24f0e71c