# WSDL Diversity Index

Nwala developed a measure called the [WSDL Diversity Index](https://ws-dl.blogspot.com/2018/05/2018-05-04-exploration-of-url-diversity.html), which can be used to measure the difference between seed lists. The idea is further developed in [Jones et al. (2018)](https://arxiv.org/abs/1806.06878). In our case we can treat each request to the SavePageNow service as a seed. What does the diversity of SavePageNow URLs look like? 

Let's load up our urls dataset into Spark that was created with the URLs notebook. Since we have millions of URLs we're going to try to keep our processing in Spark as much as possible, which means we can't really use the implementation that Nwala [released](https://github.com/anwala/url-diversity). But we can borrow heavily from it, and the calculation is relatively easy to make.

First lets start up Spark and load our URL dataset.

In [59]:
import sys

sys.path.append('../utils')

from warc_spark import init

sc, sqlc = init()
df = sqlc.read.csv('results/urls', header=True)
df.show(5)

+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+-----+
|           record_id|           warc_file|                date|                 url|          user_agent|user_agent_family|  bot|
+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+-----+
|<urn:uuid:37efc42...|warcs/liveweb-201...|2017-10-25T08:48:04Z|https://www.zoomi...|        okhttp/3.8.1|           okhttp| true|
|<urn:uuid:0c30c05...|warcs/liveweb-201...|2017-10-25T17:03:39Z|http://www.jeuxvi...|Mozilla/5.0 (Maci...|           Chrome|false|
|<urn:uuid:c44821b...|warcs/liveweb-201...|2017-10-25T20:38:03Z|https://www.zoomi...|        okhttp/3.8.1|           okhttp| true|
|<urn:uuid:cbc9b85...|warcs/liveweb-201...|2017-10-25T07:21:51Z|https://www.zoomi...|        okhttp/3.8.1|           okhttp| true|
|<urn:uuid:7c5c319...|warcs/liveweb-201...|2017-10-25T23:37:36Z|https://www.zoomi..

## Domain Diversity

For the purposes here the domain diversity is of most interest because we want to see how content from the entire web is being selected in SavePageNow. For domains the calculation is basically:

    (Number of Unique Domains - 1) / (Number of Items - 1)
    
[tldextract](https://github.com/john-kurkowski/tldextract) is a nice module for parsing a hostname to get its registered domain. It's much trickier than just taking the last three parts of hostname, and tldextract does the right thing.

In [67]:
import tldextract

# a function that takes a row, extracts the domain from the url, and whther it is a bot or not

def domain_bot(r):
    return (tldextract.extract(r.url).registered_domain, r.bot)

results = df.rdd.map(domain_bot)
results.persist()
results.take(5)

[('zoominfo.com', 'true'),
 ('jeuxvideo.com', 'false'),
 ('zoominfo.com', 'true'),
 ('zoominfo.com', 'true'),
 ('zoominfo.com', 'true')]

Calculate the Domain WSDL Diversity Index for the entire dataset.

In [75]:
results = results.toDF('domain', 'bot')
item_count = results.count()
domain_count = results.select('domain').distinct().count()
(domain_count - 1) / (item_count - 1)

0.03569198843391795

We can also can look specifically at bots and humans to see if the domain diversity is any different between them.

In [82]:
import pyspark.sql.functions as f

bots = results.filter(f.col('bot') == 'true')
bot_items = bots.count()
bot_domains = bots.distinct().count()
(bot_domains - 1) / (bot_items - 1)

0.03369296117141282

In [83]:
humans = results.filter(f.col('bot') == 'false')
human_items = humans.count()
human_domains = humans.distinct().count()
(human_domains - 1) / (human_items - 1)

0.05788105665611979

So there is a slight difference in diversity of domains selected by bots and people: 3.3% vs 5.8%. It would be interesting to see how these compare to other seed lists. It seems pretty small compared to the 38.6% reported in Jones et al.

In [84]:
print('items:', item_count)
print('bots:', bot_items)
print('humans:', human_items)

items: 7457669
bots: 6292710
humans: 1164959


## URL Diversity

We can do the same for calculating the URL diversity. We need a function that will normalize our URLs per the WSDL Diversity Index measure.

    (number of unique urls - 1) / (number of items - 1)

In [88]:
from urllib.parse import urlparse

def url_bot(r):
    u = urlparse(r.url)
    return (''.join([u.netloc, u.path]), r.bot)

results = df.rdd.map(url_bot)
results.persist()
results.take(5)

[('www.zoominfo.com/p/Gail-Kessler/-1657000950', 'true'),
 ('www.jeuxvideo.com/forums/42-51-53683620-1-0-1-0-pour-etre-assure-au-niveau-de-la-sante-aux-usa.htm',
  'false'),
 ('www.zoominfo.com/p/Nur-Hanim/-1822630766', 'true'),
 ('www.zoominfo.com/p/Graeme-Randell/-1634203968', 'true'),
 ('www.zoominfo.com/p/Larry-Campos/-1862445193', 'true')]

Ok now lets calculate the overall URL Diversity:

In [95]:
results = results.toDF('url', 'bot')
item_count = results.count()
url_count = results.select('url').distinct().count()
(url_count - 1) / (item_count - 1)

0.7650201108443014

In [96]:
bots = results.filter(f.col('bot') == 'true')
bot_items = bots.count()
bot_urls = bots.distinct().count()
(bot_urls - 1) / (bot_items - 1)

0.811429068148551

In [98]:
humans = results.filter(f.col('bot') == 'false')
human_items = humans.count()
human_urls = humans.distinct().count()
(human_urls - 1) / (human_items - 1)

0.5326578297243334

In [99]:
print('items:', item_count)
print('bots:', bot_items)
print('humans:', human_items)

items: 7457669
bots: 6292710
humans: 1164959


So for URLs the bots are selecting much more diverse URLs. This is interesting, particularly given the finding about domains.