# Extracting hyperlinks from webpages

In [1]:
import org.archive.archivespark._
import org.archive.archivespark.implicits._
import org.archive.archivespark.enrich.functions._
import org.archive.archivespark.specific.warc._
import org.archive.archivespark.specific.warc.enrichfunctions._
import org.archive.archivespark.specific.warc.implicits._
import org.archive.archivespark.specific.warc.specs._
import org.apache.hadoop.io.compress.GzipCodec

## Loading the dataset

In the example, the Web archive dataset will be loaded from local WARC / CDX files, using `WarcCdxHdfsSpec`, but any other [Data Specification (DataSpec)](https://github.com/helgeho/ArchiveSpark/blob/master/docs/DataSpecs.md) can be used here as well in order to load your records from different local or remote sources.

In [2]:
val path = "/data/archiveit/ArchiveIt-Collection-2950"
val cdxPath = path + "/cdx/*.cdx.gz"
val warcPath = path + "/warc"

In [3]:
val records = ArchiveSpark.load(WarcCdxHdfsSpec(cdxPath, warcPath))

### Filtering irrelevant records

As for link extraction we are not interested in videos, images, stylesheets and any other files except for webpages ([mime type](https://en.wikipedia.org/wiki/Media_type) *text/html*), and we are not interested in webpages that were unavailable when they were crawled either ([status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) == 200), we will filter those out.

*It is important to note that this filtering is done only based on metadata, so up to this point ArchiveSpark does not even touch the actual Web archive records, which is the core efficiency feature of ArchiveSpark.*

In [4]:
val pages = records.filter(r => r.mime == "text/html" && r.status == 200)

By looking at the first record in our remaining dataset, we can see that this indeed is of type *text/html* and was *online* (status 200) at the time of crawl:

In [5]:
pages.peekJson

{
  "record" : {
    "surtUrl" : "cn,cntv,english)/program/newsupdate/20110504/109544.shtml",
    "timestamp" : "20111222043804",
    "originalUrl" : "http://english.cntv.cn/program/newsupdate/20110504/109544.shtml",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "YVOEIYJ45I7QNNFBQTCPKIQAQJIE4B46",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 10855
  }
}

## Enriching metadata

This is the point when ArchiveSpark actually accesses the full records in order to enrich our metadata records with the desired information. To do so, we define the required [Enrich Functions](https://github.com/helgeho/ArchiveSpark/blob/master/docs/EnrichFuncs.md) (`Links`, `LinkUrls`, `LinkTexts`) based on existing ones (`Html`, `SURT`, `HtmlAttribute`, `HtmlText`).

`Html.all` extracts all hyperlinks / anchors (tag `a`) from the pages. This results in a list of multiple values, one for each link. From these we want to extract the link target (attribute `href`) of each link. This can be done by changing the dependency of the `Html` Enrich Function using the `ofEach` operation ([see the docs for more details](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md)) Although this again will result in multiple values, it is only one for each link, so we use the single dependency operation `of` to apply `SURT` on these and convert the URLs into SURT format. Similarly, we apply `HtmlText` on each link to get the anchor text of the link (the default depency of `HtmlText` would be the text of the whole page).

In [6]:
val Links = Html.all("a")
val LinkUrls = SURT.of(HtmlAttribute("href").ofEach(Links))
val LinkTexts = HtmlText.ofEach(Links)

To enrich the filtered records in our dataset with the link information, we call `enrich` on it with every Enrich Function that we explicitely want to have in the dataset. As we are not interested in the raw `a` tags, we do not enrich it with `Links` here.

In [7]:
val pagesWithLinks = pages.enrich(LinkUrls).enrich(LinkTexts)

A look at the first record shows what we get:

In [8]:
pagesWithLinks.peekJson

{
  "record" : {
    "surtUrl" : "cn,cntv,english)/program/newsupdate/20110504/109544.shtml",
    "timestamp" : "20111222043804",
    "originalUrl" : "http://english.cntv.cn/program/newsupdate/20110504/109544.shtml",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "YVOEIYJ45I7QNNFBQTCPKIQAQJIE4B46",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 10855
  },
  "payload" : {
    "string" : {
      "html" : {
        "a" : [ {
          "attributes" : {
            "href" : {
              "SURT" : "cn,cntv,english)/"
            }
          },
          "text" : "Homepage"
        }, {
          "attributes" : {
            "href" : {
              "SURT" : "cn,cntv,english)/live"
            }
          },
          "text" : "CCTV Live"
     ...

# Saving the derived corpus

If we want to save our derived corpus with the link information in this JSON format as shown above, we could simply call `.saveAsJson` on it now. This preserves the entire lineage as it implicitely documents for each value what its parent was and were it was extracted from, with the exact metadata of each record included as well. JSON is a very universal format and can be read by many third-party tools to post-process this datset.

*By adding a .gz extension to the output path, the data will be automatically compressed with GZip.*

In [9]:
pagesWithLinks.saveAsJson("pages-with-links.gz")

### Saving plain links (src, timestamp, dst, text)

Instead, we can also keep only the hyperlink information as a temporal edgelist with the source URL, the timestamp of the capture, the destination URL of each link as well as the anchor text if available.

There are two preferred ways to achieve this with ArchiveSpark:

1. We create a single value using the Enrich Function `Values` that combines destination URL and text for each link. Then, in a map, can access these values and create our very own output format by adding additional information, like source and timestmap.
2. We create a single value using the Enrich Function `Values` for each link like before, but this time we include the source and timestamp in this value, so that we only need to [flat map the values (`flatMapValues`)](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md).

#### 1. Custom `map`

In [10]:
val LinkRepresentation = Values("link-dst-text", LinkUrls, LinkTexts).onEach(Links)

In [11]:
pagesWithLinks.enrich(LinkRepresentation).peekJson

{
  "record" : {
    "surtUrl" : "cn,cntv,english)/program/newsupdate/20110504/109544.shtml",
    "timestamp" : "20111222043804",
    "originalUrl" : "http://english.cntv.cn/program/newsupdate/20110504/109544.shtml",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "YVOEIYJ45I7QNNFBQTCPKIQAQJIE4B46",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 10855
  },
  "payload" : {
    "string" : {
      "html" : {
        "a" : [ {
          "attributes" : {
            "href" : {
              "SURT" : "cn,cntv,english)/"
            }
          },
          "text" : "Homepage",
          "link-dst-text" : [ "cn,cntv,english)/", "Homepage" ]
        }, {
          "attributes" : {
            "href" : {
              "SURT" : "cn,cntv,english)/live...

In [12]:
val links = pagesWithLinks.enrich(LinkRepresentation).flatMap { record =>
    record.valueOrElse(LinkRepresentation, Seq.empty).map { case Array(dst, text) =>
        Seq(record.surtUrl, record.timestamp, dst, text).mkString("\t")
    }
}

Print the first 10 lines of this dataset to see what we get:

In [13]:
links.take(10).foreach(println)

cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/	Homepage
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/live	CCTV Live
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/newsletter	Newsletter
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/english/special/application/contact/index.shtml	Feedback
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/sitemap	Site Map
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,passport)/app_pass/verify/english/new/register.jsp	Sign Up
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,passport)/app_pass/verify/english/new/login.jsp	Sign In
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/about	Help
cn,cntv,english)/program/newsupdate/20110504/109544

Save as text file (GZip compressed)

In [14]:
links.saveAsTextFile("links.gz", classOf[GzipCodec])

#### 2. `flatMapValues` ([ArchiveSpark Operations](https://github.com/helgeho/ArchiveSpark/blob/master/docs/Operations.md))

A pointer to values in the CDX meta record can be based on the [`Root` Enrich Function](https://github.com/helgeho/ArchiveSpark/blob/master/docs/EnrichFuncs.md):

In [15]:
val SurtURL = Root[CdxRecord].map("surtUrl") { cdx: CdxRecord => cdx.surtUrl}
val Timestamp = Root[CdxRecord].map("timestamp") { cdx: CdxRecord => cdx.timestamp}
val LinkRepresentation = Values("src-timestamp-dst-text", SurtURL, Timestamp, LinkUrls, LinkTexts).onEach(Links)

In [16]:
pagesWithLinks.enrich(SurtURL).enrich(Timestamp).enrich(LinkRepresentation).peekJson

{
  "record" : {
    "surtUrl" : "cn,cntv,english)/program/newsupdate/20110504/109544.shtml",
    "timestamp" : "20111222043804",
    "originalUrl" : "http://english.cntv.cn/program/newsupdate/20110504/109544.shtml",
    "mime" : "text/html",
    "status" : 200,
    "digest" : "YVOEIYJ45I7QNNFBQTCPKIQAQJIE4B46",
    "redirectUrl" : "-",
    "meta" : "-",
    "compressedSize" : 10855
  },
  "timestamp" : "20111222043804",
  "payload" : {
    "string" : {
      "html" : {
        "a" : [ {
          "attributes" : {
            "href" : {
              "SURT" : "cn,cntv,english)/"
            }
          },
          "text" : "Homepage",
          "src-timestamp-dst-text" : [ "cn,cntv,english)/program/newsupdate/20110504/109544.shtml", "20111222043804", "cn,cntv,english)/",...

We concatenate the link properties delimited by a tab (`\t`) values before saving them as text:

In [17]:
val links = pagesWithLinks.enrich(SurtURL).enrich(Timestamp).flatMapValues(LinkRepresentation).map(_.mkString("\t"))

In [18]:
links.take(10).foreach(println)

cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/	Homepage
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/live	CCTV Live
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/newsletter	Newsletter
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/english/special/application/contact/index.shtml	Feedback
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/sitemap	Site Map
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,passport)/app_pass/verify/english/new/register.jsp	Sign Up
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,passport)/app_pass/verify/english/new/login.jsp	Sign In
cn,cntv,english)/program/newsupdate/20110504/109544.shtml	20111222043804	cn,cntv,english)/about	Help
cn,cntv,english)/program/newsupdate/20110504/109544

In [19]:
links.saveAsTextFile("links1.gz", classOf[GzipCodec])

## Graph Analysis

To reduce the size of this dataset for this example, we keep only the link graph under *.de*, i.e., only links that point from a *.de* page to a *.de* page are considered:

In [20]:
val srcDst = pages.filter(_.surtUrl.startsWith("de,")).enrich(LinkUrls).flatMap(r => r.valueOrElse(LinkUrls, Seq.empty).map(dst => (r.surtUrl, dst)))

In order analyze an extracted Web graph with tools like [Spark's GraphX](https://spark.apache.org/graphx/), we are only interested in the URLs, which need to be assigned IDs first:

In [21]:
val urlIdMap = srcDst.flatMap{case (src, dst) => Iterator(src, dst)}.distinct.zipWithUniqueId.collectAsMap

In [22]:
val ids = sc.broadcast(urlIdMap)

In [23]:
val edges = srcDst.map{case (src, dst) => (ids.value(src), ids.value(dst))}

In [24]:
import org.apache.spark.graphx._
val graph = Graph.fromEdgeTuples(edges, true)

In [25]:
graph.numVertices

257067