# Known Sites

It might be interesting to get a sense of what is being archived at some known sites. For example newspapers like the New York Times or social media sites like Twitter. We can use spark to filter through all our URLs and pull ones out that match a particular pattern.

In [1]:
import sys

sys.path.append('../utils')
from warc_spark import init

sc, sqlc = init()
urls = sqlc.read.csv('results/urls', header=True)

First lets add a norm_url column to our DataFrame which contains a normalized form of the original URL. This will remove tracking URL parameters and things that otherwise would make matching hard (http vs https, www, etc).

In [2]:
from urllib.parse import urlparse
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def norm_url(url):
    """
    Normalizes a URL by stripping protocol, www, query string and hash fragment.
    """
    u = urlparse(url)
    host = u.netloc.lstrip('www.')
    
    # keep youtube video urls that use the query string
    if 'youtube.com' in host:
        return ''.join([host, u.path, '?', u.query])
    else:
        return ''.join([host, u.path])

# turn norm_url into a user defined function (udf) for spark
norm_url = udf(norm_url, StringType())

# create a new column (norm_url) using the udf
urls = urls.withColumn('norm_url', norm_url(urls.url))

## New York Times

In [3]:
urls.createOrReplaceTempView('urls')

sql = '''
      SELECT norm_url AS url, COUNT(norm_url) AS total
      FROM urls
      WHERE norm_url LIKE "nytimes.com/%"
      GROUP BY norm_url
      ORDER BY total DESC
      '''

nyt = sqlc.sql(sql)
nyt.take(5)

[Row(url='nytimes.com/', total=174),
 Row(url='nytimes.com/1990/08/12/movies/film-sure-he-s-king-but-he-comes-from-las-vegas.html', total=83),
 Row(url='nytimes.com/2017/10/24/business/senate-vote-wall-street-regulation.html', total=56),
 Row(url='nytimes.com/2018/10/21/business/what-comes-after-the-roomba.html', total=41),
 Row(url='nytimes.com/2018/10/24/us/politics/trump-phone-security.html', total=40)]

## Generalize

For convenience it would be useful to have a function that will do this with any URL pattern, and will also go and attempt to get the title of the page.

First lets create the function that goes and gets a page from the web. It will return the url, title and text of the page. We will use [requests](https://2.python-requests.org/en/master/) to fetch the page, [readability](https://github.com/buriy/python-readability) to extract the main content of the page from its boilerplate, and [bleach](https://github.com/mozilla/bleach) to remove all the HTML tags.

In [18]:
import re
import bleach
import requests
import readability

def get_page(url):
    resp = requests.get(url)
    result = {'url': url, 'title': '', 'text': ''}
    if resp.status_code != 200:
        return result
    
    html = resp.text   
    doc = readability.Document(resp.text)
    result['title'] = doc.title()
    text = doc.summary(html_partial=True)
    text = bleach.clean(text, tags=[], strip=True)
    text = re.sub('  +', ' ', text)
    result['text'] = re.sub('\s\s+', ' ', text)
    
    return result

get_page('https://www.nytimes.com/interactive/2019/06/14/opinion/bluetooth-wireless-tracking-privacy.html')

{'url': 'https://www.nytimes.com/interactive/2019/06/14/opinion/bluetooth-wireless-tracking-privacy.html',
 'title': 'Opinion | In Stores, Secret Bluetooth Surveillance Tracks Your Every Move - The New York Times',
 'text': " Imagine you are shopping in your favorite grocery store. As you approach the dairy aisle, you are sent a push notification in your phone: “10 percent off your favorite yogurt! Click here to redeem your coupon.” You considered buying yogurt on your last trip to the store, but you decided against it. How did your phone know? Your smartphone was tracking you. The grocery store got your location data and paid a shadowy group of marketers to use that information to target you with ads. Recent reports have noted how companies use data gathered from cell towers, ambient Wi-Fi, and GPS. But the location data industry has a much more precise, and unobtrusive, tool: Bluetooth beacons. These beacons are small, inobtrusive electronic devices that are hidden throughout the gro

Now we just need a function that takes a pattern to apply to the normalized URLs and a number of results to return, and will go and do the query and return the results.

In [11]:
def pages(pattern, n=5):
    # sql injection anyone?
    sql = '''
      SELECT norm_url AS url, COUNT(norm_url) AS total
      FROM urls
      WHERE norm_url RLIKE "{}"
      GROUP BY norm_url
      ORDER BY total DESC
      LIMIT {}
      '''.format(pattern, n)
    
    page_list = []
    for r in sqlc.sql(sql).collect():
        page = get_page('https://' + r.url)
        page['count'] = r.total
        page_list.append(page)
    
    return page_list

Now we can test it out on the New York Times again. This time it will print the top 10 New York Times pages archived with their count, the url and the title.

In [12]:
for page in pages('^nytimes.com/.+', 10):
    print(page['count'], page['url'], page['title'])

83 https://nytimes.com/1990/08/12/movies/film-sure-he-s-king-but-he-comes-from-las-vegas.html FILM; Sure He's King, But He Comes From Las Vegas - The New York Times
56 https://nytimes.com/2017/10/24/business/senate-vote-wall-street-regulation.html Consumer Bureau Loses Fight to Allow More Class-Action Suits - The New York Times
41 https://nytimes.com/2018/10/21/business/what-comes-after-the-roomba.html What Comes After the Roomba? - The New York Times
40 https://nytimes.com/2018/10/24/us/politics/trump-phone-security.html When Trump Phones Friends, the Chinese and the Russians Listen and Learn - The New York Times
39 https://nytimes.com/2018/10/24/us/fbi-white-nationalist-robert-paul-rundo-rise-above.html F.B.I. Arrests White Nationalist Leader Who Fled the Country for Central America - The New York Times
36 https://nytimes.com/2018/10/24/nyregion/clinton-obama-explosive-device.html Hillary Clinton, Barack Obama and CNN Offices Are Sent Pipe Bombs - The New York Times
35 https://nytime

## Twitter

In [27]:
for page in pages(r'^twitter.com/', 50):
    print(page['count'], page['url'], page['title'])

1115 https://twitter.com/search Twitter Search
578 https://twitter.com/hechaocheng 阿城守候 (@hechaocheng) | Twitter
334 https://twitter.com/intent/tweet Post a Tweet on Twitter
296 https://twitter.com/login Login on Twitter
200 https://twitter.com/undefined Twitter / Account Suspended
192 https://twitter.com/ Twitter. It's what's happening.
116 https://twitter.com/stillgray Ian Miles Cheong (@stillgray) | Twitter
113 https://twitter.com/noriko_tkgs 塚越慎子 Noriko Tsukagoshi official (@noriko_tkgs) | Twitter
108 https://twitter.com/tekumakumayucon 山口真由子 (@tekumakumayucon) | Twitter
99 https://twitter.com/SmugZebra 
98 https://twitter.com/robertbland14/status/942159533291446272 Twitter / Account Suspended
92 https://twitter.com/stillgray/status/922916100878172160 Ian Miles Cheong on Twitter: "A Yale humanities student says she accidentally got her illegal alien father detained by ICE. https://t.co/2jvugHDcOQ"
91 https://twitter.com/_jacketchan_ jackie chan (@_jacketchan_) | Twitter
91 https://