# Link Analysis

Link Analysis is frequently used to visualize the relationships between nodes in a graph. In this case, we study the hyperlinks between domains that contain a certain term on the Web.

## Import Dependencies

In [1]:
%%init_spark
launcher.conf.spark.executor.instances = 8
launcher.conf.spark.executor.cores = 32
launcher.conf.spark.executor.memory = '8G'
launcher.conf.spark.driver.memory = '4G'
launcher.jars = ["sparksolrini.jar"]

In [2]:
import sys.process._

"pip install matplotlib" !

"pip install networkx" !

Intitializing Scala interpreter ...

Spark Web UI available at http://desktop:4040
SparkContext available as 'sc' (version = 2.4.0, master = local[*], app id = local-1563129268111)
SparkSession available as 'spark'




import sys.process._
res0: Int = 0


## Query Solr

First we extract links referenced by websites in the ClueWeb09b collection that contain the word "jaguar".

In [5]:
import com.lucidworks.spark.rdd.SelectSolrRDD
import com.google.common.net.InternetDomainName
import org.jsoup.Jsoup
import org.apache.hadoop.fs.{FileSystem, Path}

import scala.collection.JavaConverters._
import java.net.URL

// Solr's ZooKeeper URL
val SOLR = "192.168.1.111:9983"

// The Solr collection
val INDEX = "cw09b-url"

// The Solr query
val QUERY = "contents:jaguar"

// The number of partitions
val PARTITIONS = 8

// The limit for number of rows to process
val LIMIT = 10000

val source_urls = new SelectSolrRDD(SOLR, INDEX, sc, maxRows = Some(LIMIT))
.rows(10000)
.query(QUERY)
.repartition(PARTITIONS)
.mapPartitions(docs => {
    docs.map(doc => {
        val url = doc.get("url") + ""
        try { (InternetDomainName.from(new URL(url.substring(1, url.length - 1)).getHost).topPrivateDomain().name(), doc.get("raw") + "") }
        catch {
            case e: Exception => println("")
            ("", "")
        }
    })
    .filter(!_._2.isEmpty)
})

import com.lucidworks.spark.rdd.SelectSolrRDD
import com.google.common.net.InternetDomainName
import org.jsoup.Jsoup
import org.apache.hadoop.fs.{FileSystem, Path}
import scala.collection.JavaConverters._
import java.net.URL
SOLR: String = 192.168.1.111:9983
INDEX: String = cw09b-url
QUERY: String = contents:jaguar
PARTITIONS: Int = 8
LIMIT: Int = 10000
source_urls: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[9] at mapPartitions at <console>:55


## Compute Links

We then randomly sample 1% of the retrieved documents and extract the top three most frequently-occurring outgoing links.

In [6]:
// The output path
val OUT_DIR = "link_analysis"

// Delete old output dir
FileSystem.get(sc.hadoopConfiguration).delete(new Path(OUT_DIR), true)

val zipped_urls = source_urls.sample(withReplacement=false, fraction=0.01, seed=42)
.flatMap(record => {
        val target_urls = Jsoup.parse(record._2)
          .select("a[href]")
          .asScala
          .map(link => link.attr("abs:href"))
          .filter(!_.isEmpty)
          .map(link => {
            try { InternetDomainName.from(new URL(link).getHost).topPrivateDomain().name() }
            catch {
              case e: Exception => println("")
                ""
            }
          })
          .distinct
          .take(3)
        val src_host = (1 to target_urls.size).map(_ => record._1)
        src_host zip target_urls
      })
      .distinct
      .filter(x => x._1 != x._2)
      .map(pair => pair._1 + ";" + pair._2)
      .coalesce(1)

zipped_urls.saveAsTextFile(OUT_DIR)
zipped_urls.take(1)























OUT_DIR: String = link_analysis
zipped_urls: org.apache.spark.rdd.RDD[String] = CoalescedRDD[17] at coalesce at <console>:66
res1: Array[String] = Array(wikipedia.org;49ers.com)


## Generate Network Graph

The output contains a list of semi-column separated domain pairs.
You may directly feed this file into your favorite visualization tool to create a network graph.
We use Gephi with Multilevel Layout in our paper.

In [7]:
import sys.process._

// Remove the old output directory
"rm -rf network_graph.png /tmp/link_analysis" !

// Copy new output from HDFS to local filesystem
"hdfs dfs -copyToLocal link_analysis /tmp/link_analysis" !

warning:  there were three feature warnings; re-run with -feature for details

Generate the graph

In [None]:
%%python

import networkx as nx
import matplotlib.pyplot as plt

g = nx.read_edgelist('/tmp/link_analysis/part-00000', delimiter=';', create_using=nx.Graph(), nodetype=str)

print(nx.info(g))

nx.draw_networkx(g, arrows=True, node_size=20, with_labels=False)

plt.show()
plt.savefig('network_graph.png')

![](network_graph.png)