# Link Analysis

Link Analysis is frequently used to visualize the relationships between nodes in a graph. In this case, we study the hyperlinks between domains that contain a certain term on the Web.

## Import Dependencies

In [1]:
%AddDeps com.lucidworks.spark spark-solr 3.6.0 --transitive
%AddDeps com.google.guava guava 15.0 --transitive
%AddDeps org.jsoup jsoup 1.11.3 --transitive

Marking com.lucidworks.spark:spark-solr:3.6.0 for download
-> Failed to resolve org.restlet.jee:org.restlet.ext.servlet:2.3.0
    -> not found: /tmp/toree-tmp-dir2042813687803005472/toree_add_deps/cache/org.restlet.jee/org.restlet.ext.servlet/ivy-2.3.0.xml
    -> download error: Caught java.io.IOException: Server returned HTTP response code: 403 for URL: https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.3.0/org.restlet.ext.servlet-2.3.0.pom (Server returned HTTP response code: 403 for URL: https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.3.0/org.restlet.ext.servlet-2.3.0.pom) while downloading https://repo1.maven.org/maven2/org/restlet/jee/org.restlet.ext.servlet/2.3.0/org.restlet.ext.servlet-2.3.0.pom
-> Failed to resolve org.restlet.jee:org.restlet:2.3.0
    -> not found: /tmp/toree-tmp-dir2042813687803005472/toree_add_deps/cache/org.restlet.jee/org.restlet/ivy-2.3.0.xml
    -> download error: Caught java.io.IOException: Server returned 

In [2]:
import sys.process._

"pip install matplotlib" !

"pip install networkx" !





0

## Query Solr

First we extract links referenced by websites in the ClueWeb09b collection that contain the word "jaguar".

In [2]:
import com.lucidworks.spark.rdd.SelectSolrRDD
import com.google.common.net.InternetDomainName
import org.jsoup.Jsoup
import org.apache.hadoop.fs.{FileSystem, Path}

import scala.collection.JavaConverters._
import java.net.URL

// Solr's ZooKeeper URL
val SOLR = "192.168.1.111:9983"

// The Solr collection
val INDEX = "cw09b-url"

// The Solr query
val QUERY = "contents:jaguar"

// The number of partitions
val PARTITIONS = 8

// The limit for number of rows to process
val LIMIT = 10000

val source_urls = new SelectSolrRDD(SOLR, INDEX, sc, maxRows = Some(LIMIT))
.rows(10000)
.query(QUERY)
.repartition(PARTITIONS)
.mapPartitions(docs => {
    docs.map(doc => {
        val url = doc.get("url") + ""
        try { (InternetDomainName.from(new URL(url.substring(1, url.length - 1)).getHost).topPrivateDomain().name(), doc.get("raw") + "") }
        catch {
            case e: Exception => println("")
            ("", "")
        }
    })
    .filter(!_._2.isEmpty)
})

SOLR = 192.168.1.111:9983
INDEX = cw09b-url
QUERY = contents:jaguar
PARTITIONS = 8
LIMIT = 10000
OUT_DIR = link_analysis
source_urls = MapPartitionsRDD[417] at mapPartitions at <console>:456


lastException: Throwable = null


MapPartitionsRDD[417] at mapPartitions at <console>:456

## Compute Links

We then randomly sample 1% of the retrieved documents and extract the top three most frequently-occurring outgoing links.

In [3]:
// The output path
val OUT_DIR = "link_analysis"

// Delete old output dir
FileSystem.get(sc.hadoopConfiguration).delete(new Path(OUT_DIR), true)

val zipped_urls = source_urls.sample(withReplacement=false, fraction=0.01, seed=42)
.flatMap(record => {
        val target_urls = Jsoup.parse(record._2)
          .select("a[href]")
          .asScala
          .map(link => link.attr("abs:href"))
          .filter(!_.isEmpty)
          .map(link => {
            try { InternetDomainName.from(new URL(link).getHost).topPrivateDomain().name() }
            catch {
              case e: Exception => println("")
                ""
            }
          })
          .distinct
          .take(3)
        val src_host = (1 to target_urls.size).map(_ => record._1)
        src_host zip target_urls
      })
      .distinct
      .filter(x => x._1 != x._2)
      .map(pair => pair._1 + ";" + pair._2)
      .coalesce(1)

zipped_urls.saveAsTextFile(OUT_DIR)
zipped_urls.take(1)

OUT_DIR = link_analysis
zipped_urls = CoalescedRDD[425] at coalesce at <console>:457


Array(wikipedia.org;49ers.com)

## Generate Network Graph

The output contains a list of semi-column separated domain pairs.
You may directly feed this file into your favorite visualization tool to create a network graph.
We use Gephi with Multilevel Layout in our paper.

In [4]:
import sys.process._

// Remove the old output directory
"rm -rf network_graph.png /tmp/link_analysis" !

// Copy new output from HDFS to local filesystem
"hdfs dfs -copyToLocal link_analysis /tmp/link_analysis" !

// Draw the graph
"python draw_graph.py" !

/Nhome/raclancy/miniame: 
Type: Graph
Number of nodes: 182
Number of edges: 149
Average degree:   1.6374
The iterable function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use np.iterable instead.
  if not cb.iterable(width):




0

![](network_graph.png)