# AUT PySpark Documentation Examples

Import `aut` Python libraries.

In [1]:
from aut import *
from pyspark.sql.functions import col, desc, explode

# `.all()`

`.all()` schema

In [2]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .all()\
  .printSchema

<bound method DataFrame.printSchema of DataFrame[crawl_date: string, url: string, mime_type_web_server: string, mime_type_tika: string, content: string, bytes: binary, http_status_code: string, archive_filename: string]>

Select `url` and `http_status_code`, and show 10 with non-truncated columns.

In [3]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .all()\
  .select("url", "http_status_code")\
  .show(10, False)

+-------------------------------------------------------------------------------------+----------------+
|url                                                                                  |http_status_code|
+-------------------------------------------------------------------------------------+----------------+
|http://geocities.com/heigl_2k/img/thumbs/?N=D                                        |200             |
|http://geocities.com/~jask16/weapons/naval.html                                      |200             |
|http://geocities.com/Wellesley/4130/sur.html                                         |200             |
|http://geocities.com/kelsoonbutler/boardDocs/BoardMeetingMinutes_AGM_March12_2002.htm|200             |
|http://geocities.com/ianaxir/star_sapphire.htm                                       |200             |
|http://geocities.com/vandenfromcamden/LJPics/cat/                                    |200             |
|http://geocities.com/Wellesley/4130/eroica.html       

Select `url` and `archive_filename`, and show 10 with truncated columns.

In [4]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .all()\
  .select("url", "archive_filename")\
  .show(10, True)

+--------------------+--------------------+
|                 url|    archive_filename|
+--------------------+--------------------+
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
|http://geocities....|file:/home/nruest...|
+--------------------+--------------------+
only showing top 10 rows



Select crawl date, MIME Type, bytes, and demonstrate how to apply a "[User Defined Functions](https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html)" to a column. In this case, `detect_mime_type_tika()` detects the [MIME type](https://en.wikipedia.org/wiki/MIME) from the [Base64](https://en.wikipedia.org/wiki/Base64) encoded bytes column using [Apache Tika](https://tika.apache.org/).

In [5]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .all()\
  .select("crawl_date", detect_mime_type_tika("bytes").alias("udf_tika"), "mime_type_tika")\
  .show(10, True)

+----------+---------+--------------+
|crawl_date| udf_tika|mime_type_tika|
+----------+---------+--------------+
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
|  20091027|text/html|     text/html|
+----------+---------+--------------+
only showing top 10 rows



Select mimetypes, url, and filter out images.

In [6]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .all()\
  .select("mime_type_tika", "mime_type_web_server", "url")\
  .filter(~col("mime_type_tika").like("image/%") | ~col("mime_type_web_server").like("image/%"))\
  .show(10, False)

+--------------+--------------------+-------------------------------------------------------------------------------------+
|mime_type_tika|mime_type_web_server|url                                                                                  |
+--------------+--------------------+-------------------------------------------------------------------------------------+
|text/html     |text/html           |http://geocities.com/heigl_2k/img/thumbs/?N=D                                        |
|text/html     |text/html           |http://geocities.com/~jask16/weapons/naval.html                                      |
|text/html     |text/html           |http://geocities.com/Wellesley/4130/sur.html                                         |
|text/html     |text/html           |http://geocities.com/kelsoonbutler/boardDocs/BoardMeetingMinutes_AGM_March12_2002.htm|
|text/html     |text/html           |http://geocities.com/ianaxir/star_sapphire.htm                                       |
|text/ht

## `.webgraph()`

`.webgraph()` schema

In [7]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webgraph()\
  .printSchema

<bound method DataFrame.printSchema of DataFrame[crawl_date: string, src: string, dest: string, anchor: string]>

Select source and destination urls, count their ocurrances, and filter out counts less than 5.

In [8]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webgraph()\
  .groupBy(extract_domain("src").alias("src"), extract_domain("dest").alias("dest"))\
  .count()\
  .filter(col("count") > 5)\
  .show(10, True)

+-----------------+--------------------+-----+
|              src|                dest|count|
+-----------------+--------------------+-----+
|    geocities.com|   service.bfast.com|  139|
|    geocities.com|   www.prizma.net.tr|   26|
|    geocities.com|      www.disney.com|   12|
|www.geocities.com|     www.bigfoot.com|   10|
|    geocities.com|   idmg.blogspot.com|   25|
|    geocities.com| img.photobucket.com|  324|
|    geocities.com|www.buyandselldb.com|16200|
|    geocities.com| www.metroactive.com|    6|
|    geocities.com|  pub16.bravenet.com|   24|
|www.geocities.com|        www.nasa.gov|    6|
+-----------------+--------------------+-----+
only showing top 10 rows



Similar to the above example, select source and destination urls, and additionally apply UDFs that extract the domain from the URL, and remove the `www` prefix.

In [9]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webgraph()\
  .groupBy("crawl_date", remove_prefix_www(extract_domain("src")).alias("src"), remove_prefix_www(extract_domain("dest")).alias("dest"))\
  .count()\
  .filter(col("count") > 5)\
  .show(10, True)

+----------+-------------+--------------------+-----+
|crawl_date|          src|                dest|count|
+----------+-------------+--------------------+-----+
|  20091027|geocities.com|           tv3.co.th|  104|
|  20091027|geocities.com|           itv.co.th|   36|
|  20091027|geocities.com|     roushracing.com|    8|
|  20091027|geocities.com|             come.to|  142|
|  20091027|geocities.com|meltingpot.fortun...|   15|
|  20091027|geocities.com|         gostats.com|   11|
|  20091027|geocities.com|   www2.bravenet.com|   13|
|  20091027|geocities.com|            best.com|   26|
|  20091027|geocities.com|plugin.smileycent...|    9|
|  20091027|geocities.com|           sciam.com|    6|
+----------+-------------+--------------------+-----+
only showing top 10 rows



## `.webpages()`

`.webpages()` schema

In [10]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .printSchema

<bound method DataFrame.printSchema of DataFrame[crawl_date: string, url: string, mime_type_web_server: string, mime_type_tika: string, language: string, content: string]>

Select crawl date, domain, url, and content by using two UDFs: `extract_domain()`, and `remove_html()`.

In [11]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", remove_html("content").alias("content"))\
  .show(10, True)

+----------+-------------+--------------------+--------------------+
|crawl_date|       domain|                 url|             content|
+----------+-------------+--------------------+--------------------+
|  20091027|geocities.com|http://geocities....|Index of /heigl_2...|
|  20091027|geocities.com|http://geocities....|Naval Missles Ant...|
|  20091027|geocities.com|http://geocities....|Diana Bellessi's ...|
|  20091027|geocities.com|http://geocities....|KELSO COMMUNITY A...|
|  20091027|geocities.com|http://geocities....|The Star Sapphire...|
|  20091027|geocities.com|http://geocities....|Index of /vandenf...|
|  20091027|geocities.com|http://geocities....|Diana Bellessi's ...|
|  20091027|geocities.com|http://geocities....|Index of /kkrhyth...|
|  20091027|geocities.com|http://geocities....|Index of /guinncj...|
|  20091027|geocities.com|http://geocities....|The Clinical Psyc...|
+----------+-------------+--------------------+--------------------+
only showing top 10 rows



Filter out urls that match a pattern, then select the source and detination urls from the results using a few UDFs.

In [12]:
url_pattern = "%http://geocities.com/babiekaos/%"

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .filter(col("url").like(url_pattern))\
  .select(explode(extract_links("url", "content")).alias("links"))\
  .select(remove_prefix_www(extract_domain(col("links._1"))).alias("src"), remove_prefix_www(extract_domain(col("links._2"))).alias("dest"))\
  .groupBy("src", "dest")\
  .count()\
  .show(10, False)

+-------------+---------------------+-----+
|src          |dest                 |count|
+-------------+---------------------+-----+
|geocities.com|sushi.perfectdrug.net|1    |
|geocities.com|eatsushi.com         |1    |
|geocities.com|sushilinks.com       |1    |
|geocities.com|sushifaq.com         |1    |
+-------------+---------------------+-----+



Select crawl date, domain, url, and content where domain is `www.geocities.com`.

In [13]:
domains = ["www.geocities.com"]

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", remove_html(remove_http_header("content")).alias("content"))\
  .filter(col("domain").isin(domains))\
  .show(20, True)

+----------+-----------------+--------------------+--------------------+
|crawl_date|           domain|                 url|             content|
+----------+-----------------+--------------------+--------------------+
|  20091027|www.geocities.com|http://www.geocit...|Rita Rittie's Hom...|
|  20091027|www.geocities.com|http://www.geocit...|vilperi Vilperin ...|
|  20091027|www.geocities.com|http://www.geocit...|mina Min� Etusivu...|
|  20091027|www.geocities.com|http://www.geocit...|Latino_Resources ...|
|  20091027|www.geocities.com|http://www.geocit...|kerttu Kertun siv...|
|  20091027|www.geocities.com|http://www.geocit...|CyberFausto Webpa...|
|  20091027|www.geocities.com|http://www.geocit...|ipana Ipanan sivu...|
|  20091027|www.geocities.com|http://www.geocit...|page6 Bookmarks f...|
|  20091027|www.geocities.com|http://www.geocit...|page2 LCO 2148 Bo...|
|  20091027|www.geocities.com|http://www.geocit...|CyberFausto Webpa...|
|  20091027|www.geocities.com|http://www.geocit...|

Select crawl date, domain, and url where url matches the pattern `%http://geocities.com/cancmay%`.

In [14]:
url_pattern = "%http://geocities.com/cancmay%"

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url")\
  .filter(col("url").like(url_pattern))\
  .show(20, False)

+----------+-------------+-----------------------------------------------------------+
|crawl_date|domain       |url                                                        |
+----------+-------------+-----------------------------------------------------------+
|20091027  |geocities.com|http://geocities.com/cancmay/m/make-it-with-you.html       |
|20091027  |geocities.com|http://geocities.com/cancmay/m/?N=D                        |
|20091027  |geocities.com|http://geocities.com/cancmay/m/my-girl.html                |
|20091027  |geocities.com|http://geocities.com/cancmay/m/my-oh-my.html               |
|20091027  |geocities.com|http://geocities.com/cancmay/m/my-happy-ending.html        |
|20091027  |geocities.com|http://geocities.com/cancmay/m/my-funny-valentine.html     |
|20091027  |geocities.com|http://geocities.com/cancmay/m/mambo-italiano.html         |
|20091027  |geocities.com|http://geocities.com/cancmay/m/man-i-feel-like-a-woman.html|
|20091027  |geocities.com|http://geocities.

Select crawl date, domain, url, and content from only September or October 2009.

In [15]:
dates = "2009[10][09]\d\d"
 
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", remove_html(remove_http_header("content")).alias("content"))\
  .filter(col("crawl_date").rlike(dates))\
  .show(20, True)

+----------+-------------+--------------------+--------------------+
|crawl_date|       domain|                 url|             content|
+----------+-------------+--------------------+--------------------+
|  20091027|geocities.com|http://geocities....|Index of /heigl_2...|
|  20091027|geocities.com|http://geocities....|Naval Missles Ant...|
|  20091027|geocities.com|http://geocities....|Diana Bellessi's ...|
|  20091027|geocities.com|http://geocities....|KELSO COMMUNITY A...|
|  20091027|geocities.com|http://geocities....|The Star Sapphire...|
|  20091027|geocities.com|http://geocities....|Index of /vandenf...|
|  20091027|geocities.com|http://geocities....|Diana Bellessi's ...|
|  20091027|geocities.com|http://geocities....|Index of /kkrhyth...|
|  20091027|geocities.com|http://geocities....|Index of /guinncj...|
|  20091027|geocities.com|http://geocities....|The Clinical Psyc...|
|  20091027|geocities.com|http://geocities....|Index of /brokenk...|
|  20091027|geocities.com|http://g

Select crawl date, domain, url, and content where content has "radio" in it.

In [16]:
content = "%radio%"

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", remove_html(remove_http_header("content")).alias("content"))\
  .filter(col("content").like(content))\
  .show(10, True)

+----------+-----------------+--------------------+--------------------+
|crawl_date|           domain|                 url|             content|
+----------+-----------------+--------------------+--------------------+
|  20091027|    geocities.com|http://geocities....|The Clinical Psyc...|
|  20091027|    geocities.com|http://geocities....|HUMAN SOUNDS SNOR...|
|  20091027|    geocities.com|http://geocities....|Index of /vilsono...|
|  20091027|    geocities.com|http://geocities....|Estafas esot�rica...|
|  20091027|www.geocities.com|http://www.geocit...|Short Articles Ab...|
|  20091027|www.geocities.com|http://www.geocit...|Shoulder Separati...|
|  20091027|    geocities.com|http://geocities....|DAVID BANNER adop...|
|  20091027|    geocities.com|http://geocities....|New Page 1 By Stu...|
|  20091027|    geocities.com|http://geocities....|Titre: Forever Lo...|
|  20091027|    geocities.com|http://geocities....|Index of /vilsono...|
+----------+-----------------+--------------------+

Select crawl date, domain, url, and content where the domain is `www.geocities.com`, and the language is french.

In [17]:
domains = ["www.geocities.com"]
languages = ["fr"]

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", remove_html(remove_http_header("content")).alias("content"))\
  .filter(col("domain").isin(domains))\
  .filter(col("language").isin(languages))\
  .show(20, True)

+----------+-----------------+--------------------+--------------------+
|crawl_date|           domain|                 url|             content|
+----------+-----------------+--------------------+--------------------+
|  20091027|www.geocities.com|http://www.geocit...|           June June|
|  20091027|www.geocities.com|http://www.geocit...|       femaletrouble|
|  20091027|www.geocities.com|http://www.geocit...|surprise1 Pour pr...|
|  20091027|www.geocities.com|http://www.geocit...|rose Ma vie en ro...|
|  20091027|www.geocities.com|http://www.geocit...|eucalyptus Eucaly...|
|  20091027|www.geocities.com|http://www.geocit...|radioh Radiohead ...|
|  20091027|www.geocities.com|http://www.geocit...|ej5 pr�c�dent - M...|
|  20091027|www.geocities.com|http://www.geocit...|apier An Pierl� D...|
|  20091027|www.geocities.com|http://www.geocit...|ej9 pr�c�dent Nou...|
|  20091027|www.geocities.com|http://www.geocit...|ej94 pr�c�dent Je...|
|  20091027|www.geocities.com|http://www.geocit...|

Select source and domain urls, with a count, where the content contains "radio".

In [18]:
content = "%radio%"

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .filter(col("content").like(content))\
  .select(explode(extract_links("url", "content")).alias("links"))\
  .select(remove_prefix_www(extract_domain(col("links._1"))).alias("src"), remove_prefix_www(extract_domain(col("links._2"))).alias("dest"))\
  .groupBy("src", "dest")\
  .count()\
  .filter(col("count") > 5)\
  .show(10, True)

+----------------+--------------------+-----+
|             src|                dest|count|
+----------------+--------------------+-----+
|   geocities.com|   idmg.blogspot.com|    9|
|saibabalinks.org|    saibabalinks.org|   14|
|   geocities.com|   service.bfast.com|   15|
|   geocities.com| img.photobucket.com|   23|
|   geocities.com|     worldscouts.net|    6|
|   geocities.com|       terravista.pt|   15|
|   geocities.com|free.hostdepartme...|   30|
|   geocities.com|provincia.venezia.it|    6|
|   geocities.com|   privacy.yahoo.com|    6|
|   geocities.com|              rai.it|    7|
+----------------+--------------------+-----+
only showing top 10 rows



Select urls that match a regex pattern.

In [19]:
url_pattern = "http://[^/]+/[^/]+/"

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("url")\
  .filter(col("url").rlike(url_pattern))\
  .show(10, False)


+-------------------------------------------------------------------------------------+
|url                                                                                  |
+-------------------------------------------------------------------------------------+
|http://geocities.com/heigl_2k/img/thumbs/?N=D                                        |
|http://geocities.com/~jask16/weapons/naval.html                                      |
|http://geocities.com/Wellesley/4130/sur.html                                         |
|http://geocities.com/kelsoonbutler/boardDocs/BoardMeetingMinutes_AGM_March12_2002.htm|
|http://geocities.com/ianaxir/star_sapphire.htm                                       |
|http://geocities.com/vandenfromcamden/LJPics/cat/                                    |
|http://geocities.com/Wellesley/4130/eroica.html                                      |
|http://geocities.com/kkrhythm/kyotei/photo/?M=A                                      |
|http://geocities.com/guinncj/im

Show top 10 domains.

In [20]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select(remove_prefix_www(extract_domain("url")).alias("Domain"))\
  .groupBy("domain")\
  .count()\
  .sort(desc("count"))\
  .show(10, False)


+-----------------------+------+
|domain                 |count |
+-----------------------+------+
|geocities.com          |123109|
|infocastfn.com         |430   |
|rcm.amazon.com         |201   |
|bagus.com              |133   |
|globalimagegallery.com |130   |
|physforum.com          |124   |
|viewonbuddhism.org     |122   |
|internetarchaeology.org|121   |
|us.geocities.com       |121   |
|spb.tvoe.tv            |108   |
+-----------------------+------+
only showing top 10 rows



Select content, and remove HTTP Headers, and HTML.

In [21]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select(remove_html(remove_http_header("content")).alias("content"))\
  .show(10, True)

+--------------------+
|             content|
+--------------------+
|Index of /heigl_2...|
|Naval Missles Ant...|
|Diana Bellessi's ...|
|KELSO COMMUNITY A...|
|The Star Sapphire...|
|Index of /vandenf...|
|Diana Bellessi's ...|
|Index of /kkrhyth...|
|Index of /guinncj...|
|The Clinical Psyc...|
+--------------------+
only showing top 10 rows



Select crawl date, domain, url, and content. Remove HTTP Headers, and remove boilerplate from content.

In [22]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", extract_boilerplate(remove_http_header("content")).alias("content"))\
  .show(10, True)

+----------+-------------+--------------------+--------------------+
|crawl_date|       domain|                 url|             content|
+----------+-------------+--------------------+--------------------+
|  20091027|geocities.com|http://geocities....|                    |
|  20091027|geocities.com|http://geocities....|AGM-84 Harpoon A ...|
|  20091027|geocities.com|http://geocities....|enlazado el ojo a...|
|  20091027|geocities.com|http://geocities....|                 ...|
|  20091027|geocities.com|http://geocities....|;         Pater e...|
|  20091027|geocities.com|http://geocities....|                    |
|  20091027|geocities.com|http://geocities....|                    |
|  20091027|geocities.com|http://geocities....|                    |
|  20091027|geocities.com|http://geocities....|                    |
|  20091027|geocities.com|http://geocities....|  There are   psy...|
+----------+-------------+--------------------+--------------------+
only showing top 10 rows



Demonstrate `detect_language()` UDF on content.

In [23]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("crawl_date", detect_language("content").alias("udf_language"), "language")\
  .show(10, True)

+----------+------------+--------+
|crawl_date|udf_language|language|
+----------+------------+--------+
|  20091027|          en|      nl|
|  20091027|          en|      en|
|  20091027|          es|      es|
|  20091027|          en|      en|
|  20091027|          en|      en|
|  20091027|          en|      en|
|  20091027|          en|      en|
|  20091027|          en|      hr|
|  20091027|          en|      en|
|  20091027|          en|      en|
+----------+------------+--------+
only showing top 10 rows



Demonstrate `extract_links()` UDF on the url and content columns.

In [24]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select(explode(extract_links("url", "content")).alias("links"))\
  .show(10, False)

+----------------------------------------------------------------------------------------------------------------------------+
|links                                                                                                                       |
+----------------------------------------------------------------------------------------------------------------------------+
|[http://geocities.com/docmarck2000/orientation.htm, mailto:webmaster@theclinicalpsychologist.net?subject=Sugesstion, E-mail]|
|[http://geocities.com/krooyaimaha/ptld2902.htm, http://www.dhamma.org, ]                                                    |
|[http://geocities.com/krooyaimaha/ptld2902.htm, http://www.dhammathai.org, Dhammathai]                                      |
|[http://geocities.com/krooyaimaha/ptld2902.htm, http://www.yuwasong.com, Yuwasong]                                          |
|[http://geocities.com/krooyaimaha/ptld2902.htm, http://www.luangta.com, Luangta]                              

Demonstrate `extract_image_links()` UDF on the url and content columns.

In [25]:
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select(explode(extract_image_links("url", "content")).alias("image_links"))\
  .show(10, False)

+---------------------------------------------------------------------------------------------+
|image_links                                                                                  |
+---------------------------------------------------------------------------------------------+
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/blank.gif,      ] |
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/back.gif, [DIR]]  |
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]]|
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]]|
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]]|
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]]|
|[http://geocities.com/heigl_2k/img/thumbs/?N=D, http://geocities.com/icons/image2.gif, [IMG]]|
|[http://geocities.com/heigl_2k/img/thum

Select crawl date, domain, url, and language where the language is either Spanish or French.

In [26]:
languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .filter(col("language").isin(languages))\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", "language")\
  .show(50, True)

+----------+-----------------+--------------------+--------+
|crawl_date|           domain|                 url|language|
+----------+-----------------+--------------------+--------+
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      fr|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|    geocities.com|http://geocities....|      fr|
|  20091027|    geocities.com|http://geocities....|      es|
|  20091027|www.geocities.com|http://www.geocit...|      fr|
|  20091027|    geocitie

Select crawl date, domain, url, and language where the language is *not* either Spanish or French.

In [27]:
languages = ["es", "fr"]

WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .filter(~col("language").isin(languages))\
  .select("crawl_date", extract_domain("url").alias("domain"), "url", "language")\
  .show(50, True)

+----------+-----------------+--------------------+--------+
|crawl_date|           domain|                 url|language|
+----------+-----------------+--------------------+--------+
|  20091027|    geocities.com|http://geocities....|      nl|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      hr|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      bn|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocities.com|http://geocities....|      en|
|  20091027|    geocitie

Select url, where the domain is *not* either www.archive.org, www.sloan.org.

In [28]:
domains = ["www.archive.org", "www.sloan.org"]
 
WebArchive(sc, sqlContext, "/home/nruest/Projects/au/sample-data/geocities")\
  .webpages()\
  .select("url")\
  .filter(~(extract_domain("url").isin(domains)))\
  .show(10, False)

+-------------------------------------------------------------------------------------+
|url                                                                                  |
+-------------------------------------------------------------------------------------+
|http://geocities.com/heigl_2k/img/thumbs/?N=D                                        |
|http://geocities.com/~jask16/weapons/naval.html                                      |
|http://geocities.com/Wellesley/4130/sur.html                                         |
|http://geocities.com/kelsoonbutler/boardDocs/BoardMeetingMinutes_AGM_March12_2002.htm|
|http://geocities.com/ianaxir/star_sapphire.htm                                       |
|http://geocities.com/vandenfromcamden/LJPics/cat/                                    |
|http://geocities.com/Wellesley/4130/eroica.html                                      |
|http://geocities.com/kkrhythm/kyotei/photo/?M=A                                      |
|http://geocities.com/guinncj/im