# Building a corpus with title + text for a selected set of URLs

In [1]:
import org.archive.archivespark._
import org.archive.archivespark.implicits._
import org.archive.archivespark.enrich.functions._
import org.archive.archivespark.specific.warc._
import org.archive.archivespark.specific.warc.enrichfunctions._
import org.archive.archivespark.specific.warc.implicits._
import org.archive.archivespark.specific.warc.specs._
import org.apache.hadoop.io.compress.GzipCodec

## Loading the dataset

In the example, the Web archive dataset will be loaded from local WARC / CDX files, using `WarcCdxHdfsSpec`, but any other [Data Specification (DataSpec)](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) can be used here as well in order to load your records from different local or remote sources.

In [2]:
val path = "/data/archiveit/ArchiveIt-Collection-2950"
val cdxPath = path + "/cdx/*.cdx.gz"
val warcPath = path + "/warc"

In [3]:
val records = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))

### Filtering records

We can filter out videos, images, stylesheets and any other files except for webpages ([mime type](https://en.wikipedia.org/wiki/Media_type) *text/html*), as well as webpages that were unavailable when they were crawled either ([status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) == 200).

*It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual Web archive records, which is the core efficiency feature of ArchiveSpark.*

In [4]:
val pages = records.filter(r => r.mime == "text/html" && r.status == 200)

The following counts show that we filtered a very big portion, which makes the subsequent processing way more efficient:

In [5]:
records.count

52922792

In [6]:
pages.count

16062674

A peek at the first record of the filtered dataset (in pretty JSON format) shows that it indeed consists of HTML pages with successful status:

In [7]:
pages.peekJson

{
  "record" : {
    "redirectUrl" : "-",
    "timestamp" : "20111222043804",
    "digest" : "YVOEIYJ45I7QNNFBQTCPKIQAQJIE4B46",
    "originalUrl" : "http://english.cntv.cn/program/newsupdate/20110504/109544.shtml",
    "surtUrl" : "cn,cntv,english)/program/newsupdate/20110504/109544.shtml",
    "mime" : "text/html",
    "compressedSize" : 10855,
    "meta" : "-",
    "status" : 200
  }
}

## Select relevant records based on a given set of URLs

In this example, the desired URLs are stored in a comma separated file format along with some additional information. So we need to load these records, split them by comma and select the URL:

In [8]:
val urlRecords = sc.textFile("relevant_urls.csv")

In [9]:
urlRecords.peek

True,http://15ocroatia.org/,2950,31104006172,system,2012-01-23 18:59:27+00:00,False,200,205223,2012-02-09 03:19:25+00:00,system,2013-02-15 01:55:53+00:00,,,,,,,,,,,True,1156,normal,http://15ocroatia.org/,True

In [10]:
val urls = urlRecords.map(_.split(",")).map(_(1))

In [11]:
urls.peek

http://15ocroatia.org/

In order to filter our Web archive dataset based on these URLs, we convert them to the canonical SURT format in order to get rid of slight, negligible differences:

In [12]:
val surtUrls = urls.map(org.archive.archivespark.utils.SURT.fromUrl)

In [13]:
surtUrls.peek

org,15ocroatia)/

We finally collect these URLs, convert them to a set and broadcast this across our computing environment (if the set of URLs it too big, a `join` operation should be used here instead of a broadcast, for an example see the recipe on [Extracting embedded resources from webpages](Extract_Embeds.ipynb)):

In [14]:
val selectedUrls = sc.broadcast(surtUrls.collect.toSet)

### Filter the pages in our dataset

In [15]:
val filtered = pages.filter(r => selectedUrls.value.contains(r.surtUrl))

In [16]:
filtered.count

38301

## Enrich the dataset with the desired information (title + text)

To access the content of an HTML page, ArchiveSpark comes with an `Html` Enrich Function:

In [17]:
filtered.enrich(Html).peekJson

{
  "record" : {
    "redirectUrl" : "-",
    "timestamp" : "20111220013011",
    "digest" : "RJKJCKKSWCM7QDOC6U2FLDKKVCOJNMU5",
    "originalUrl" : "http://blog.alexanderhiggins.com/",
    "surtUrl" : "com,alexanderhiggins,blog)/",
    "mime" : "text/html",
    "compressedSize" : 43070,
    "meta" : "-",
    "status" : 200
  },
  "payload" : {
    "string" : {
      "html" : {
        "body" : "<body> \n <!--\r\nFind out if this theme has the preset mode deactivated. If its not deactivated then lets show off a bit. Party time!\r\n~~~~~~~~~~~~~~~~~~~~~~~~ ~~~ --> \n <!--\r\nThe High Bar\r\n~~~~~~~~~~~~~~~~~~~~~~~~ ~~~ --> \n <script src=\"http://blog.alexanderhiggins.com/msg.js\" type=\"text/javascript\">\r\n</script> \n <center class=\"homealert\"> \n  <script type=\"tex...

As we can see, by default `Html` extracts the body of the page. To customize this, it provides different ways to specify which tags to extract:
* `Html.first("title")` will extract the (first) title tag instead
* `Html.all("a")` will extract all anchors / hyperlinks (the result is a list instead of a single item)
* `Html("p", 2)` will extract the third paragraph of the page (index 2 = third match)

Fore more details as well as additional [Enrich Functions](https://github.com/helgeho/ArchiveSpark/blob/master/docs/EnrichFuncs.md), please read the [docs](https://github.com/helgeho/ArchiveSpark/blob/master/docs/README.md).

In [18]:
filtered.enrich(Html.first("title")).peekJson

{
  "record" : {
    "redirectUrl" : "-",
    "timestamp" : "20111220013011",
    "digest" : "RJKJCKKSWCM7QDOC6U2FLDKKVCOJNMU5",
    "originalUrl" : "http://blog.alexanderhiggins.com/",
    "surtUrl" : "com,alexanderhiggins,blog)/",
    "mime" : "text/html",
    "compressedSize" : 43070,
    "meta" : "-",
    "status" : 200
  },
  "payload" : {
    "string" : {
      "html" : {
        "title" : "<title>Alexander Higgins Blog - The Latest Buzz, Analysis, and News Without the Snooze!</title>"
      }
    }
  }
}

As we are only interested in the text without the HTML tags (`<title>`), we need to use the `HtmlText` Enrich Function. This, by default, depends on the default version of `Html`, hence it would extract the text of the body, i.e., the complete text of the page. In order to change this dependency to get only the title, we can use the `.on`/`.of` method that all Enrich Functions provide. Now we can give this new Enrich Function a name (`Title`) to reuse it later:

In [19]:
val Title = HtmlText.of(Html.first("title"))

In [20]:
filtered.enrich(Title).peekJson

{
  "record" : {
    "redirectUrl" : "-",
    "timestamp" : "20111220013011",
    "digest" : "RJKJCKKSWCM7QDOC6U2FLDKKVCOJNMU5",
    "originalUrl" : "http://blog.alexanderhiggins.com/",
    "surtUrl" : "com,alexanderhiggins,blog)/",
    "mime" : "text/html",
    "compressedSize" : 43070,
    "meta" : "-",
    "status" : 200
  },
  "payload" : {
    "string" : {
      "html" : {
        "title" : {
          "text" : "Alexander Higgins Blog - The Latest Buzz, Analysis, and News Without the Snooze!"
        }
      }
    }
  }
}

In addition to the title, we would also like to have the full text of the page. This will be our final dataset, so we assign it to a new variable (`enriched`):

In [21]:
val enriched = filtered.enrich(Title).enrich(HtmlText)

In [22]:
enriched.peekJson

{
  "record" : {
    "redirectUrl" : "-",
    "timestamp" : "20111220013011",
    "digest" : "RJKJCKKSWCM7QDOC6U2FLDKKVCOJNMU5",
    "originalUrl" : "http://blog.alexanderhiggins.com/",
    "surtUrl" : "com,alexanderhiggins,blog)/",
    "mime" : "text/html",
    "compressedSize" : 43070,
    "meta" : "-",
    "status" : 200
  },
  "payload" : {
    "string" : {
      "html" : {
        "title" : {
          "text" : "Alexander Higgins Blog - The Latest Buzz, Analysis, and News Without the Snooze!"
        },
        "body" : {
          "text" : "Alexander Higgins Blog The Latest Buzz, Analysis, and News Without the Snooze! Home Headlines Authors About Subscribe, Friend or Follow Advertise Economy Environment Headlines Health Member Submitted Projects Society Technology ...

## Save the created corpus

The dataset can either be saves in JSON format as shown in the peek operations above, which is supported by ArchiveSpark, or it can be converted to some custom format and saved the raw text (using Spark's `saveAsTextFile`): 

### Save as JSON
By adding a `.gz` extension to the path, ArchiveSpark will automatically compress the output using Gzip

In [23]:
enriched.saveAsJson("title-text_dataset.json.gz")

### Save in a custom format

The Enrich Functions (`Title` and `HtmlText`) can be used as accessors to read the corresponding values, so we can create a tab separated format as follows:

In [24]:
val tsv = enriched.map{r =>
    // replace tab and newlines with a space
    val title = r.valueOrElse(Title, "").replaceAll("[\\t\\n]", " ")
    val text = r.valueOrElse(HtmlText, "").replaceAll("[\\t\\n]", " ")
    // concatenate URL, timestamp, title and text with a tab
    Seq(r.originalUrl, r.timestamp, title, text).mkString("\t")
}

In [25]:
tsv.peek

http://blog.alexanderhiggins.com/	20111220013011	Alexander Higgins Blog - The Latest Buzz, Analysis, and News Without the Snooze!	Alexander Higgins Blog The Latest Buzz, Analysis, and News Without the Snooze! Home Headlines Authors About Subscribe, Friend or Follow Advertise Economy Environment Headlines Health Member Submitted Projects Society Technology The Alexander Higgins Show Uncategorized US Videos Web Development World Email Subscription Comments Posts table, td, th { vertical-align:top !important; } .itemblock { /*height:85px;*/ } .pinned a { background-color:#FFFFCC} .homeheadline, .pinned a { display:block;padding:5px; border:1px solid #cccccc; margin:3px; padding: 3px; width:636px; font-weight:bold; } New Feature: Click Here To Submit A Story Have a feed you ...

In [26]:
tsv.saveAsTextFile("title-text_dataset.tsv.gz", classOf[GzipCodec])