# AUT PySpark Documentation Examples

Import `aut` Python libraries.

In [1]:
from aut import *
from pyspark.sql.functions import col, desc, explode

Let's create a variable for our W/ARCs.

In [2]:
warcs = "/home/nruest/Projects/au/sample-data/geocities"

# `.all()`

`.all()` schema

In [3]:
WebArchive(sc, sqlContext, warcs)\
  .all()\
  .printSchema

<bound method DataFrame.printSchema of DataFrame[crawl_date: string, domain: string, url: string, mime_type_web_server: string, mime_type_tika: string, content: string, bytes: binary, http_status_code: string, archive_filename: string]>

We're going to use some of the base DataFrames frequently, so we'll create a variable te reference them.

In [4]:
all = WebArchive(sc, sqlContext, warcs)\
        .all()

Select `url` and `http_status_code`, and show 10 with non-truncated columns.

In [5]:
all.select("url", "http_status_code")\
  .show(10, False)

[2022-05-28T15:24:32.937Z - 00000 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


22/05/28 11:24:33 WARN PDFParser: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

22/05/28 11:24:33 WARN TesseractOCRParser: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
22/05/28 11:24:33 WARN SQLite3Parser: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.


+------------------------------------------------------------------------------+----------------+
|url                                                                           |http_status_code|
+------------------------------------------------------------------------------+----------------+
|http://geocities.com/matt_tvet/images/bodyboardhowto_08.gif                   |200             |
|http://geocities.com/casastanca/_derived/products.htm_cmp_citrus010_vbtn_a.gif|200             |
|http://geocities.com/glamshotsph/glampics/n_julian1.jpg                       |503             |
|http://geocities.com/area51/chamber/9963/Images/choc1.jpg                     |200             |
|http://geocities.com/heigl_2k/img/thumbs/?N=D                                 |200             |
|http://geocities.com/~jask16/weapons/naval.html                               |200             |
|http://geocities.com/adams_wedding/images/Thumbnails/0087.jpg                 |200             |
|http://geocities.co

                                                                                

Select `url` and `archive_filename`, and show 10 with truncated columns.

In [6]:
all.select("url", "archive_filename")\
  .show(10, True)

[2022-05-28T15:24:34.584Z - 00001 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------------+--------------------+
|                 url|    archive_filename|
+--------------------+--------------------+
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
|http://geocities....|GEOCITIES-2009102...|
+--------------------+--------------------+
only showing top 10 rows



Select crawl date, MIME Type, bytes, and demonstrate how to apply a "[User Defined Functions](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html)" to a column. In this case, `detect_mime_type_tika()` detects the [MIME type](https://en.wikipedia.org/wiki/MIME) from the [Base64](https://en.wikipedia.org/wiki/Base64) encoded bytes column using [Apache Tika](https://tika.apache.org/).

In [7]:
all.select("crawl_date", detect_mime_type_tika("bytes").alias("udf_tika"), "mime_type_tika")\
  .show(10, True)

[2022-05-28T15:24:34.933Z - 00002 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+----------+--------------+
|    crawl_date|  udf_tika|mime_type_tika|
+--------------+----------+--------------+
|20091027143512| image/gif|     image/gif|
|20091027143512| image/gif|     image/gif|
|20091027143512| text/html|     text/html|
|20091027143512|image/jpeg|    image/jpeg|
|20091027143512| text/html|     text/html|
|20091027143512| text/html|     text/html|
|20091027143511|image/jpeg|    image/jpeg|
|20091027143512| text/html|     text/html|
|20091027143512| text/html|     text/html|
|20091027143511| image/gif|     image/gif|
+--------------+----------+--------------+
only showing top 10 rows



Select content, and remove HTTP Headers, and HTML.

In [8]:
all.filter("crawl_date is not NULL")\
  .filter(~(col("url").rlike(".*robots\\.txt$")) & (col("mime_type_web_server").rlike("text/html") | col("mime_type_web_server").rlike("application/xhtml+xml") | col("url").rlike("(?i).*htm$") | col("url").rlike("(?i).*html$")))\
  .filter(col("http_status_code") == 200)\
  .select(remove_html(remove_http_header("content")).alias("content"))\
  .show(10, True)

[2022-05-28T15:24:35.480Z - 00003 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------------+
|             content|
+--------------------+
|Index of /heigl_2...|
|Naval Missles Ant...|
|Diana Bellessi's ...|
|KELSO COMMUNITY A...|
|The Star Sapphire...|
|Index of /vandenf...|
|Diana Bellessi's ...|
|Index of /kkrhyth...|
|Index of /guinncj...|
|The Clinical Psyc...|
+--------------------+
only showing top 10 rows



Select crawl date, domain, url, and content. Remove HTTP Headers, and remove boilerplate from content.

In [9]:
all.filter("crawl_date is not NULL")\
  .filter(~(col("url").rlike(".*robots\\.txt$")) & (col("mime_type_web_server").rlike("text/html") | col("mime_type_web_server").rlike("application/xhtml+xml") | col("url").rlike("(?i).*htm$") | col("url").rlike("(?i).*html$")))\
  .filter(col("http_status_code") == 200)\
  .select("crawl_date", "domain", "url", extract_boilerplate("content").alias("content"))\
  .show(10, True)

[2022-05-28T15:24:35.896Z - 00004 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+-------------+--------------------+--------------------+
|    crawl_date|       domain|                 url|             content|
+--------------+-------------+--------------------+--------------------+
|20091027143512|geocities.com|http://geocities....|                    |
|20091027143512|geocities.com|http://geocities....|AGM-84 Harpoon A ...|
|20091027143512|geocities.com|http://geocities....|enlazado el ojo a...|
|20091027143512|geocities.com|http://geocities....|                 ...|
|20091027143512|geocities.com|http://geocities....|;         Pater e...|
|20091027143512|geocities.com|http://geocities....|                    |
|20091027143512|geocities.com|http://geocities....|                    |
|20091027143512|geocities

Select mimetypes, url, and filter out images.

In [10]:
all.select("mime_type_tika", "mime_type_web_server", "url")\
  .filter(~col("mime_type_tika").like("image/%") | ~col("mime_type_web_server").like("image/%"))\
  .show(10, False)

[2022-05-28T15:24:36.239Z - 00005 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+--------------------+-------------------------------------------------------------------------------------+
|mime_type_tika|mime_type_web_server|url                                                                                  |
+--------------+--------------------+-------------------------------------------------------------------------------------+
|text/html     |text/html           |http://geocities.com/glamshotsph/glampics/n_julian1.jpg                              |
|text/html     |text/html           |http://geocities.com/heigl_2k/img/thumbs/?N=D                                        |
|text/html     |text/html           |http://geocities.com/~jask16/weapons/naval.html                                      |
|text/html 

Filter out urls that match a pattern, then select the source and detination urls from the results using a few UDFs.

In [11]:
url_pattern = "%http://geocities.com/babiekaos/%"

all.filter("crawl_date is not NULL")\
  .filter(~(col("url").rlike(".*robots\\.txt$")) & (col("mime_type_web_server").rlike("text/html") | col("mime_type_web_server").rlike("application/xhtml+xml") | col("url").rlike("(?i).*htm$") | col("url").rlike("(?i).*html$")))\
  .filter(col("http_status_code") == 200)\
  .filter(col("url").like(url_pattern))\
  .select(explode(extract_links("url", "content")).alias("links"))\
  .select(remove_prefix_www(extract_domain(col("links._1"))).alias("src"), remove_prefix_www(extract_domain(col("links._2"))).alias("dest"))\
  .groupBy("src", "dest")\
  .count()\
  .show(10, False)

[2022-05-28T15:24:37.100Z - 00007 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143243-00104-ia400105.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:24:37.100Z - 00006 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:24:37.102Z - 00008 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143856-00108-ia400107.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:24:37.102Z - 00010 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:24:37

                                                                                

+-------------+---------------+-----+
|src          |dest           |count|
+-------------+---------------+-----+
|geocities.com|perfectdrug.net|1    |
|geocities.com|eatsushi.com   |1    |
|geocities.com|sushilinks.com |1    |
|geocities.com|sushifaq.com   |1    |
|geocities.com|geocities.com  |190  |
+-------------+---------------+-----+



Select source and domain urls, with a count, where the content contains "radio".

In [12]:
content = "%radio%"

all.filter("crawl_date is not NULL")\
  .filter(~(col("url").rlike(".*robots\\.txt$")) & (col("mime_type_web_server").rlike("text/html") | col("mime_type_web_server").rlike("application/xhtml+xml") | col("url").rlike("(?i).*htm$") | col("url").rlike("(?i).*html$")))\
  .filter(col("http_status_code") == 200)\
  .filter(col("content").like(content))\
  .select(explode(extract_links("url", "content")).alias("links"))\
  .select(remove_prefix_www(extract_domain(col("links._1"))).alias("src"), remove_prefix_www(extract_domain(col("links._2"))).alias("dest"))\
  .groupBy("src", "dest")\
  .count()\
  .filter(col("count") > 5)\
  .show(10, True)

[2022-05-28T15:29:13.991Z - 00219 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143351-00117-ia400103.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:29:13.991Z - 00224 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142731-00177-ia400130.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:29:14.006Z - 00218 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143856-00108-ia400107.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:29:14.009Z - 00225 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:29:14



+----------------+--------------------+-----+
|             src|                dest|count|
+----------------+--------------------+-----+
|   geocities.com|   idmg.blogspot.com|    9|
|   geocities.com|         atspace.com|   14|
|   geocities.com|             utm.edu|    8|
|saibabalinks.org|    saibabalinks.org|   80|
|   geocities.com|     worldscouts.net|    6|
|   geocities.com|       terravista.pt|   15|
|   geocities.com|              cbc.ca|   19|
|   geocities.com|provincia.venezia.it|    6|
|   geocities.com|              rai.it|   25|
|   geocities.com|            astra.lu|    9|
+----------------+--------------------+-----+
only showing top 10 rows



                                                                                

Demonstrate `extract_links()` UDF on the url and content columns.

In [13]:
all.filter("crawl_date is not NULL")\
  .filter(~(col("url").rlike(".*robots\\.txt$")) & (col("mime_type_web_server").rlike("text/html") | col("mime_type_web_server").rlike("application/xhtml+xml") | col("url").rlike("(?i).*htm$") | col("url").rlike("(?i).*html$")))\
  .filter(col("http_status_code") == 200)\
  .select(explode(extract_links("url", "content")).alias("links"))\
  .show(10, False)

[2022-05-28T15:34:01.467Z - 00229 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+----------------------------------------------------------------------------------------------------------------+
|links                                                                                                           |
+----------------------------------------------------------------------------------------------------------------+
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/heigl_2k/img/thumbs/?N=A, Name}            |
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/heigl_2k/img/thumbs/?M=A, Last modified}   |
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/heigl_2k/img/thumbs/?S=A, Size}            |
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities

Demonstrate `extract_image_links()` UDF on the url and content columns.

In [14]:
all.filter("crawl_date is not NULL")\
  .filter(~(col("url").rlike(".*robots\\.txt$")) & (col("mime_type_web_server").rlike("text/html") | col("mime_type_web_server").rlike("application/xhtml+xml") | col("url").rlike("(?i).*htm$") | col("url").rlike("(?i).*html$")))\
  .filter(col("http_status_code") == 200)\
  .select(explode(extract_image_links("url", "content")).alias("image_links"))\
  .show(10, False)

[2022-05-28T15:34:01.928Z - 00230 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+---------------------------------------------------------------------------------------------+
|image_links                                                                                  |
+---------------------------------------------------------------------------------------------+
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/blank.gif,      } |
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/back.gif, [DIR]}  |
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]}|
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]}|
|{http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.

## `.webgraph()`

`.webgraph()` schema

In [15]:
WebArchive(sc, sqlContext, warcs)\
  .webgraph()\
  .printSchema

<bound method DataFrame.printSchema of DataFrame[crawl_date: string, src: string, dest: string, anchor: string]>

In [16]:
webgraph = WebArchive(sc, sqlContext, warcs)\
             .webgraph()

Select source and destination urls, count their ocurrances, and filter out counts less than 5.

In [17]:
webgraph.groupBy(extract_domain("src").alias("src"), extract_domain("dest").alias("dest"))\
  .count()\
  .filter(col("count") > 5)\
  .show(10, True)

[2022-05-28T15:34:02.762Z - 00240 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:34:02.764Z - 00235 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:34:02.764Z - 00238 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143451-00102-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:34:02.766Z - 00233 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143856-00108-ia400107.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:34:02



+-------------+--------------------+-----+
|          src|                dest|count|
+-------------+--------------------+-----+
|geocities.com|       manager.co.th|   36|
|geocities.com|      siamrath.co.th|   35|
|geocities.com|          bsdc.or.th|   34|
|geocities.com|Ladies-of-the-Hea...|   10|
|geocities.com| realitypornpass.com|   91|
|geocities.com|     ttrehber.gov.tr|    6|
|geocities.com|            mail.com|   38|
|geocities.com|      198.65.147.231|   14|
|geocities.com|              ed.gov|    8|
|geocities.com|    kodakgallery.com|   44|
+-------------+--------------------+-----+
only showing top 10 rows



                                                                                

Similar to the above example, select source and destination urls, and additionally apply UDFs that extract the domain from the URL, and remove the `www` prefix.

In [18]:
webgraph.groupBy("crawl_date", remove_prefix_www(extract_domain("src")).alias("src"), remove_prefix_www(extract_domain("dest")).alias("dest"))\
  .count()\
  .filter(col("count") > 5)\
  .show(10, True)

[2022-05-28T15:36:41.980Z - 00245 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143351-00117-ia400103.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:36:41.980Z - 00251 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:36:41.980Z - 00246 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:36:41.980Z - 00243 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143243-00104-ia400105.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:36:41



+--------------+-------------+-------------+-----+
|    crawl_date|          src|         dest|count|
+--------------+-------------+-------------+-----+
|20091027143537|geocities.com|          com|    8|
|20091027143540|geocities.com|             |   11|
|20091027143618|geocities.com|             |    8|
|20091027143716|geocities.com|             |   21|
|20091027144043|geocities.com|    yahoo.com|    6|
|20091027144420|geocities.com|geocities.com|  918|
|20091027144511|geocities.com|    qksrv.net|   63|
|20091027144611|geocities.com|             |   24|
|20091027144707|geocities.com|   expage.com|   42|
|20091027144918|geocities.com|    uinet.org|    8|
+--------------+-------------+-------------+-----+
only showing top 10 rows



                                                                                

## `.webpages()`

`.webpages()` schema

In [19]:
WebArchive(sc, sqlContext, warcs)\
  .webpages()\
  .printSchema

<bound method DataFrame.printSchema of DataFrame[crawl_date: string, domain: string, url: string, mime_type_web_server: string, mime_type_tika: string, language: string, content: string]>

In [20]:
webpages = WebArchive(sc, sqlContext, warcs)\
             .webpages()

Select crawl date, domain, url, and content by using two UDFs: `extract_domain()`, and `remove_html()`.

In [21]:
webpages.select("crawl_date", "domain", "url", "content")\
  .show(10, True)

[2022-05-28T15:39:25.558Z - 00253 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+-------------+--------------------+--------------------+
|    crawl_date|       domain|                 url|             content|
+--------------+-------------+--------------------+--------------------+
|20091027143512|geocities.com|http://geocities....|Index of /heigl_2...|
|20091027143512|geocities.com|http://geocities....|Naval Missles Ant...|
|20091027143512|geocities.com|http://geocities....|Diana Bellessi's ...|
|20091027143512|geocities.com|http://geocities....|KELSO COMMUNITY A...|
|20091027143512|geocities.com|http://geocities....|The Star Sapphire...|
|20091027143512|geocities.com|http://geocities....|Index of /vandenf...|
|20091027143512|geocities.com|http://geocities....|Diana Bellessi's ...|
|20091027143512|geocities

[Stage 26:>                                                         (0 + 1) / 1]                                                                                

Select crawl date, domain, url, and content where domain is NOT `geocities.com`.

In [22]:
domains = ["geocities.com"]

webpages.select("crawl_date", "domain", "url", "content")\
  .filter(~col("domain").isin(domains))\
  .show(20, True)

[2022-05-28T15:39:26.333Z - 00254 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


[Stage 27:>                                                         (0 + 1) / 1]

+--------------+-----------------+--------------------+--------------------+
|    crawl_date|           domain|                 url|             content|
+--------------+-----------------+--------------------+--------------------+
|20091027143552|     bravenet.com|http://pub12.brav...|ref=document.refe...|
|20091027143813|  warpedspace.org|http://warpedspac...|aa_Ocean -- Tesse...|
|20091027144101|       weather.bg|http://weather.bg...|Bulgarian weather...|
|20091027144158|         hello.to|http://hello.to/juju|                HEHE|
|20091027145038|sandyhartwell.com|http://lolitanu-e...|lolitanu-eng.lesb...|
|20091027145045|sandyhartwell.com|http://buttsex-en...|buttsex-eng.swex....|
|20091027145045|sandyhartwell.com|http://littleloli...|littlelolita-eng....|
|20091027145047|      webring.com|http://webspace.w...|Harry Farjeon wor...|
|20091027145048|sandyhartwell.com|http://divisio9n-...|divisio9n-of-corp...|
|20091027145046|sandyhartwell.com|http://sex3-eng.n...|sex3-eng.net.sand...|

                                                                                

Select crawl date, domain, and url where url matches the pattern `%http://geocities.com/cancmay%`.

In [23]:
url_pattern = "%http://geocities.com/cancmay%"

webpages.select("crawl_date", "domain", "url")\
  .filter(col("url").like(url_pattern))\
  .show(20, False)

[2022-05-28T15:40:05.516Z - 00255 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


                                                                                

[2022-05-28T15:41:12.849Z - 00257 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143856-00108-ia400107.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:41:12.849Z - 00259 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:41:12.849Z - 00258 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143351-00117-ia400103.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:41:12.849Z - 00256 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143243-00104-ia400105.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])




+--------------+-------------+-----------------------------------------------------------+
|crawl_date    |domain       |url                                                        |
+--------------+-------------+-----------------------------------------------------------+
|20091027143245|geocities.com|http://geocities.com/cancmay/m/make-it-with-you.html       |
|20091027143246|geocities.com|http://geocities.com/cancmay/m/?N=D                        |
|20091027143246|geocities.com|http://geocities.com/cancmay/m/my-girl.html                |
|20091027143246|geocities.com|http://geocities.com/cancmay/m/my-oh-my.html               |
|20091027143247|geocities.com|http://geocities.com/cancmay/m/my-happy-ending.html        |
|20091027143248|geocities.com|http://geocities.com/cancmay/m/my-funny-valentine.html     |
|20091027143249|geocities.com|http://geocities.com/cancmay/m/mambo-italiano.html         |
|20091027143249|geocities.com|http://geocities.com/cancmay/m/man-i-feel-like-a-woman.html|

                                                                                

Select crawl date, domain, url, and content from only September or October 2009.

In [24]:
dates = "2009[10][09]\d\d"
 
webpages.select("crawl_date", "domain", "url", "content")\
  .filter(col("crawl_date").rlike(dates))\
  .show(20, True)

[2022-05-28T15:42:48.866Z - 00260 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+-------------+--------------------+--------------------+
|    crawl_date|       domain|                 url|             content|
+--------------+-------------+--------------------+--------------------+
|20091027143512|geocities.com|http://geocities....|Index of /heigl_2...|
|20091027143512|geocities.com|http://geocities....|Naval Missles Ant...|
|20091027143512|geocities.com|http://geocities....|Diana Bellessi's ...|
|20091027143512|geocities.com|http://geocities....|KELSO COMMUNITY A...|
|20091027143512|geocities.com|http://geocities....|The Star Sapphire...|
|20091027143512|geocities.com|http://geocities....|Index of /vandenf...|
|20091027143512|geocities.com|http://geocities....|Diana Bellessi's ...|
|20091027143512|geocities

Select crawl date, domain, url, and content where content has "radio" in it.

In [25]:
content = "%radio%"

webpages.select("crawl_date", "domain", "url", "content")\
  .filter(col("content").like(content))\
  .show(10, True)

[2022-05-28T15:42:49.100Z - 00261 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


[Stage 31:>                                                         (0 + 1) / 1]

+--------------+-------------+--------------------+--------------------+
|    crawl_date|       domain|                 url|             content|
+--------------+-------------+--------------------+--------------------+
|20091027143513|geocities.com|http://geocities....|The Clinical Psyc...|
|20091027143530|geocities.com|http://geocities....|HUMAN SOUNDS SNOR...|
|20091027143540|geocities.com|http://geocities....|Index of /vilsono...|
|20091027143540|geocities.com|http://geocities....|Estafas esotérica...|
|20091027143601|geocities.com|http://www.geocit...|Short Articles Ab...|
|20091027143603|geocities.com|http://www.geocit...|Shoulder Separati...|
|20091027143609|geocities.com|http://geocities....|DAVID BANNER adop...|
|20091027143622|geocities.com|http://geocities....|New Page 1 By Stu...|
|20091027143624|geocities.com|http://geocities....|Titre: Forever Lo...|
|20091027143628|geocities.com|http://geocities....|Index of /vilsono...|
+--------------+-------------+--------------------+

                                                                                

Select crawl date, domain, url, and content where the domain is `geocities.com`, and the language is french.

In [26]:
domains = ["geocities.com"]
languages = ["fr"]

webpages.select("crawl_date", "domain", "url", "content")\
  .filter(col("domain").isin(domains))\
  .filter(col("language").isin(languages))\
  .show(20, True)

[2022-05-28T15:42:52.732Z - 00262 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


[Stage 32:>                                                         (0 + 1) / 1]

+--------------+-------------+--------------------+--------------------+
|    crawl_date|       domain|                 url|             content|
+--------------+-------------+--------------------+--------------------+
|20091027143526|geocities.com|http://geocities....|  lisa's page -blue-|
|20091027143532|geocities.com|http://geocities....|              Quotes|
|20091027143533|geocities.com|http://www.geocit...|           June June|
|20091027143534|geocities.com|http://geocities....|Titre : Titre : C...|
|20091027143536|geocities.com|http://geocities....|Titre: Une nuit m...|
|20091027143537|geocities.com|http://geocities....|Titre : Sasoi ni ...|
|20091027143539|geocities.com|http://geocities....|Titre : Titre : U...|
|20091027143540|geocities.com|http://geocities....|Auteur : Larva Ti...|
|20091027143542|geocities.com|http://geocities....|Titre : Tsuki to ...|
|20091027143550|geocities.com|http://geocities....|For the first tim...|
|20091027143552|geocities.com|http://geocities....|

                                                                                

Select urls that match a regex pattern.

In [27]:
url_pattern = "http://[^/]+/[^/]+/"

webpages.select("url")\
  .filter(col("url").rlike(url_pattern))\
  .show(10, False)


[2022-05-28T15:42:55.833Z - 00263 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+-------------------------------------------------------------------------------------+
|url                                                                                  |
+-------------------------------------------------------------------------------------+
|http://geocities.com/heigl_2k/img/thumbs/?N=D                                        |
|http://geocities.com/~jask16/weapons/naval.html                                      |
|http://geocities.com/Wellesley/4130/sur.html                                         |
|http://geocities.com/kelsoonbutler/boardDocs/BoardMeetingMinutes_AGM_March12_2002.htm|
|http://geocities.com/ianaxir/star_sapphire.htm                                       |
|http://geocities.com/vandenfromcamden/LJPics/cat/ 

Show top 10 domains.

In [28]:
webpages.groupBy("domain")\
  .count()\
  .sort(desc("count"))\
  .show(10, False)


[2022-05-28T15:42:56.092Z - 00268 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:42:56.092Z - 00264 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:42:56.092Z - 00272 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142731-00177-ia400130.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:42:56.092Z - 00265 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143243-00104-ia400105.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:42:56



+-----------------------+------+
|domain                 |count |
+-----------------------+------+
|geocities.com          |123347|
|infocastfn.com         |430   |
|amazon.com             |205   |
|bagus.com              |133   |
|globalimagegallery.com |130   |
|yahoo.com              |127   |
|physforum.com          |124   |
|viewonbuddhism.org     |122   |
|internetarchaeology.org|121   |
|tvoe.tv                |108   |
+-----------------------+------+
only showing top 10 rows





Demonstrate `detect_language()` UDF on content.

In [29]:
webpages.select("crawl_date", detect_language("content").alias("udf_language"), "language")\
  .show(10, True)

[2022-05-28T15:47:05.559Z - 00474 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+------------+--------+
|    crawl_date|udf_language|language|
+--------------+------------+--------+
|20091027143512|          nl|      nl|
|20091027143512|          en|      en|
|20091027143512|          es|      es|
|20091027143512|          en|      en|
|20091027143512|          en|      en|
|20091027143512|          en|      en|
|20091027143512|          en|      en|
|20091027143512|          hr|      hr|
|20091027143512|          en|      en|
|20091027143512|          en|      en|
+--------------+------------+--------+
only showing top 10 rows



Select crawl date, domain, url, and language where the language is either Spanish or French.

In [30]:
languages = ["es", "fr"]

webpages.filter(col("language").isin(languages))\
  .select("crawl_date", "domain", "url", "language")\
  .show(50, True)

[2022-05-28T15:47:05.950Z - 00475 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


[Stage 37:>                                                         (0 + 1) / 1]

+--------------+-------------+--------------------+--------+
|    crawl_date|       domain|                 url|language|
+--------------+-------------+--------------------+--------+
|20091027143512|geocities.com|http://geocities....|      es|
|20091027143515|geocities.com|http://geocities....|      es|
|20091027143518|geocities.com|http://geocities....|      es|
|20091027143519|geocities.com|http://geocities....|      es|
|20091027143522|geocities.com|http://geocities....|      es|
|20091027143523|geocities.com|http://geocities....|      es|
|20091027143524|geocities.com|http://geocities....|      es|
|20091027143526|geocities.com|http://geocities....|      fr|
|20091027143526|geocities.com|http://geocities....|      es|
|20091027143529|geocities.com|http://geocities....|      es|
|20091027143532|geocities.com|http://geocities....|      fr|
|20091027143532|geocities.com|http://geocities....|      es|
|20091027143533|geocities.com|http://www.geocit...|      fr|
|20091027143534|geocitie

                                                                                

Select crawl date, domain, url, and language where the language is *not* either Spanish or French.

In [31]:
languages = ["es", "fr"]

webpages.filter(~col("language").isin(languages))\
  .select("crawl_date", "domain", "url", "language")\
  .show(50, True)

[2022-05-28T15:47:09.932Z - 00476 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+-------------+--------------------+--------+
|    crawl_date|       domain|                 url|language|
+--------------+-------------+--------------------+--------+
|20091027143512|geocities.com|http://geocities....|      nl|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143512|geocities.com|http://geocities....|      hr|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143512|geocities.com|http://geocities....|      en|
|20091027143513|geociti

[Stage 38:>                                                         (0 + 1) / 1]                                                                                

Select url, where the domain is either guggenheim.org or msnbc.com.

In [32]:
domains = ["guggenheim.org", "msnbc.com"]
 
webpages.select("domain")\
  .filter(col("domain").isin(domains))\
  .show(10, False)

[2022-05-28T15:47:10.673Z - 00477 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143512-00103-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


                                                                                

[2022-05-28T15:48:09.310Z - 00478 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143243-00104-ia400105.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:48:09.310Z - 00481 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142649-00105-ia400111.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:48:09.310Z - 00480 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143351-00117-ia400103.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:48:09.310Z - 00479 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143856-00108-ia400107.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])


                                                                                

[2022-05-28T15:50:13.179Z - 00485 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027142731-00177-ia400130.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:50:13.179Z - 00483 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143841-00136-ia400104.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:50:13.180Z - 00484 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143451-00102-ia400108.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:50:13.180Z - 00482 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143340-00105-ia400105.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
[2022-05-28T15:50:13



+--------------+
|domain        |
+--------------+
|msnbc.com     |
|msnbc.com     |
|msnbc.com     |
|guggenheim.org|
+--------------+



                                                                                