# Known Sites

It might be interesting to get a sense of what is being archived at some known sites. For example newspapers like the New York Times or social media sites like Twitter. We can use spark to filter through all our URLs and pull ones out that match a particular pattern.

In [1]:
import sys

sys.path.append('../utils')
from warc_spark import init

sc, sqlc = init()
urls = sqlc.read.csv('results/urls', header=True)

First lets add a norm_url column to our DataFrame which contains a normalized form of the original URL. This will remove tracking URL parameters and things that otherwise would make matching hard (http vs https, www, etc).

In [2]:
from urllib.parse import urlparse
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def norm_url(url):
    """
    Normalizes a URL by stripping protocol, www, query string and hash fragment.
    """
    u = urlparse(url)
    host = u.netloc.lstrip('www.')
    
    # keep youtube video urls that use the query string
    if 'youtube.com' in host:
        return ''.join([host, u.path, '?', u.query])
    else:
        return ''.join([host, u.path])

# turn norm_url into a user defined function (udf) for spark
norm_url = udf(norm_url, StringType())

# create a new column (norm_url) using the udf
urls = urls.withColumn('norm_url', norm_url(urls.url))

## New York Times

In [3]:
urls.createOrReplaceTempView('urls')

sql = '''
      SELECT norm_url AS url, COUNT(norm_url) AS total
      FROM urls
      WHERE norm_url LIKE "nytimes.com/%"
      GROUP BY norm_url
      ORDER BY total DESC
      '''

nyt = sqlc.sql(sql)
nyt.take(5)

[Row(url='nytimes.com/', total=174),
 Row(url='nytimes.com/1990/08/12/movies/film-sure-he-s-king-but-he-comes-from-las-vegas.html', total=83),
 Row(url='nytimes.com/2017/10/24/business/senate-vote-wall-street-regulation.html', total=56),
 Row(url='nytimes.com/2018/10/21/business/what-comes-after-the-roomba.html', total=41),
 Row(url='nytimes.com/2018/10/24/us/politics/trump-phone-security.html', total=40)]

## Generalize

For convenience it would be useful to have a function that will do this with any URL pattern, and will also go and attempt to get the title of the page.

First lets create the function that goes and gets a page from the web. It will return the url, title and text of the page. We will use [requests](https://2.python-requests.org/en/master/) to fetch the page, [readability](https://github.com/buriy/python-readability) to extract the main content of the page from its boilerplate, and [bleach](https://github.com/mozilla/bleach) to remove all the HTML tags.

In [170]:
import re
import bleach
import requests
import readability

def get_page(url):
    result = {'url': url, 'title': '', 'text': ''}

    try:
        resp = requests.get(url)
        if resp.status_code != 200:
            return result
    
        html = resp.text   
        doc = readability.Document(resp.text)
        result['title'] = doc.title()
        text = doc.summary(html_partial=True)
        text = bleach.clean(text, tags=[], strip=True)
        text = re.sub('  +', ' ', text)
        result['text'] = re.sub('\s\s+', ' ', text)
    except:
        pass
    
    return result

get_page('https://www.nytimes.com/interactive/2019/06/14/opinion/bluetooth-wireless-tracking-privacy.html')

{'url': 'https://www.nytimes.com/interactive/2019/06/14/opinion/bluetooth-wireless-tracking-privacy.html',
 'title': 'Opinion | In Stores, Secret Bluetooth Surveillance Tracks Your Every Move - The New York Times',
 'text': " Imagine you are shopping in your favorite grocery store. As you approach the dairy aisle, you are sent a push notification in your phone: ‚Äú10 percent off your favorite yogurt! Click here to redeem your coupon.‚Äù You considered buying yogurt on your last trip to the store, but you decided against it. How did your phone know? Your smartphone was tracking you. The grocery store got your location data and paid a shadowy group of marketers to use that information to target you with ads. Recent reports have noted how companies use data gathered from cell towers, ambient Wi-Fi, and GPS. But the location data industry has a much more precise, and unobtrusive, tool: Bluetooth beacons. These beacons are small, inobtrusive electronic devices that are hidden throughout the

Now we just need a function that takes a pattern to apply to the normalized URLs and a number of results to return, and will go and do the query and return the results.

In [52]:
def pages(pattern, n=5, bot=None):
   
    if bot == True:
        bot_clause = 'AND bot == "true"'
    elif bot == False:
        bot_clause = 'AND bot == "false"'
    else:
        bot_clause = ''
            
    # sql injection anyone?
    sql = '''
      SELECT norm_url AS url, COUNT(norm_url) AS total
      FROM urls
      WHERE norm_url RLIKE "{}" {}
      GROUP BY norm_url
      ORDER BY total DESC
      LIMIT {}
      '''.format(pattern, bot_clause, n)
    
    page_list = []
    for r in sqlc.sql(sql).collect():
        page = get_page('https://' + r.url)
        page['count'] = r.total
        page_list.append(page)
    
    return page_list

Now we can test it out on the New York Times again. This time it will print the top 10 New York Times pages archived with their count, the url and the title.

In [12]:
for page in pages('^nytimes.com/.+', 10):
    print(page['count'], page['url'], page['title'])

83 https://nytimes.com/1990/08/12/movies/film-sure-he-s-king-but-he-comes-from-las-vegas.html FILM; Sure He's King, But He Comes From Las Vegas - The New York Times
56 https://nytimes.com/2017/10/24/business/senate-vote-wall-street-regulation.html Consumer Bureau Loses Fight to Allow More Class-Action Suits - The New York Times
41 https://nytimes.com/2018/10/21/business/what-comes-after-the-roomba.html What Comes After the Roomba? - The New York Times
40 https://nytimes.com/2018/10/24/us/politics/trump-phone-security.html When Trump Phones Friends, the Chinese and the Russians Listen and Learn - The New York Times
39 https://nytimes.com/2018/10/24/us/fbi-white-nationalist-robert-paul-rundo-rise-above.html F.B.I. Arrests White Nationalist Leader Who Fled the Country for Central America - The New York Times
36 https://nytimes.com/2018/10/24/nyregion/clinton-obama-explosive-device.html Hillary Clinton, Barack Obama and CNN Offices Are Sent Pipe Bombs - The New York Times
35 https://nytime

## Twitter

In [27]:
for page in pages(r'^twitter.com/', 50):
    print(page['count'], page['url'], page['title'])

1115 https://twitter.com/search Twitter Search
578 https://twitter.com/hechaocheng ÈòøÂüéÂÆàÂÄô (@hechaocheng) | Twitter
334 https://twitter.com/intent/tweet Post a Tweet on Twitter
296 https://twitter.com/login Login on Twitter
200 https://twitter.com/undefined Twitter / Account Suspended
192 https://twitter.com/ Twitter. It's what's happening.
116 https://twitter.com/stillgray Ian Miles Cheong (@stillgray) | Twitter
113 https://twitter.com/noriko_tkgs Â°öË∂äÊÖéÂ≠ê Noriko Tsukagoshi official (@noriko_tkgs) | Twitter
108 https://twitter.com/tekumakumayucon Â±±Âè£ÁúüÁî±Â≠ê (@tekumakumayucon) | Twitter
99 https://twitter.com/SmugZebra 
98 https://twitter.com/robertbland14/status/942159533291446272 Twitter / Account Suspended
92 https://twitter.com/stillgray/status/922916100878172160 Ian Miles Cheong on Twitter: "A Yale humanities student says she accidentally got her illegal alien father detained by ICE. https://t.co/2jvugHDcOQ"
91 https://twitter.com/_jacketchan_ jackie chan (@_jacketch

## Twitter (Bot/Browser)

### Bots

In [35]:
for page in pages(r'^twitter.com/', n=25, bot=True):
    print(page['count'], page['url'])

981 https://twitter.com/search
578 https://twitter.com/hechaocheng
113 https://twitter.com/noriko_tkgs
108 https://twitter.com/tekumakumayucon
99 https://twitter.com/SmugZebra
98 https://twitter.com/robertbland14/status/942159533291446272
91 https://twitter.com/_jacketchan_
91 https://twitter.com/Vuralol
91 https://twitter.com/CDisillusion
88 https://twitter.com/ravenokamura
87 https://twitter.com/Bz_Hive
85 https://twitter.com/DineshDSouza/status/966078572321562625
84 https://twitter.com/_cynthiaaileen
82 https://twitter.com/zramosx3
79 https://twitter.com/TitaniaMcGrath
77 https://twitter.com/
62 https://twitter.com/andrejpwalker/status/922628584619085824
58 https://twitter.com/rxhmxo/status/1054539811346771969
53 https://twitter.com/IainLJBrown
52 https://twitter.com/TitaniaMcGrath/status/1055205594347368448
50 https://twitter.com/i/moments
46 https://twitter.com/HNA_online
44 https://twitter.com/doubting_thomas/status/1055029654560489472
39 https://twitter.com/suttonnick/status/790

### Browser

In [36]:
for page in pages(r'^twitter.com/', n=25, bot=False):
    print(page['count'], page['url'])

302 https://twitter.com/intent/tweet
268 https://twitter.com/login
200 https://twitter.com/undefined
134 https://twitter.com/search
116 https://twitter.com/stillgray
115 https://twitter.com/
92 https://twitter.com/stillgray/status/922916100878172160
51 https://twitter.com/alanmayers/status/1034091370124902400/
50 https://twitter.com/stillgray/status/923125217102249985
49 https://twitter.com/i/live/768633364911788032
49 https://twitter.com/stillgray/status/923077500195835904
48 https://twitter.com/vinegar
47 https://twitter.com/ssnowphia
46 https://twitter.com/scrowder
44 https://twitter.com/cnnbrk
44 https://twitter.com/alanmayers
43 https://twitter.com/stillgray/status/923076226050244609
43 https://twitter.com/topmailru/status/605742757886443520
43 https://twitter.com/stillgray/status/923113854233489408
43 https://twitter.com/stillgray/status/923111226351087618
43 https://twitter.com/stillgray/status/922990020289470464
43 https://twitter.com/stillgray/status/923112257453621249
43 http

In [47]:
tweets = urls.filter(urls.norm_url == "twitter.com/stillgray/status/923077500195835904")
results = tweets.groupBy('user_agent').count().collect()
results

[Row(user_agent='savepagenow (https://github.com/pastpages/savepagenow)', count=49)]

## Wikipedia

Wikipedia automatically snapshots every version of an article. So why are people archiving it in Wikipedia?

In [63]:
for page in pages(r'^en.wikipedia.org/.*', 25):
    print(page['count'], page['url'], page['title'])

138 https://en.wikipedia.org/w/index.php Wikipedia, the free encyclopedia
99 https://en.wikipedia.org/wiki/Bob%C3%B4_(footballer,_born_1985) Bob√¥ (footballer, born 1985) - Wikipedia
88 https://en.wikipedia.org/wiki/Jason_Scott Jason Scott - Wikipedia
80 https://en.wikipedia.org/wiki/List_of_Wikipedias List of Wikipedias - Wikipedia
22 https://en.wikipedia.org/wiki/Main_Page Wikipedia, the free encyclopedia
19 https://en.wikipedia.org/wiki/List_of_most-followed_Twitter_accounts List of most-followed Twitter accounts - Wikipedia
13 https://en.wikipedia.org/wiki/Boustrophedon Boustrophedon - Wikipedia
12 https://en.wikipedia.org/wiki/User_talk:Grumbles180-co User talk:Grumbles180-co - Wikipedia
12 https://en.wikipedia.org/wiki/Suning_County Suning County - Wikipedia
11 https://en.wikipedia.org/wiki/https://en.wikipedia.org/w/index.php 
8 https://en.wikipedia.org/wiki/Kermode_bear Kermode bear - Wikipedia
8 https://en.wikipedia.org/wiki/George_W._Bush George W. Bush - Wikipedia
8 https://

In [57]:
What do the browser requests look like?

Object `like` not found.


In [None]:
What do the browser requests look like

In [55]:
for page in pages(r'^en.wikipedia.org/.*', 25, bot=False):
    print(page['count'], page['url'], page['title'])

138 https://en.wikipedia.org/w/index.php Wikipedia, the free encyclopedia
80 https://en.wikipedia.org/wiki/List_of_Wikipedias List of Wikipedias - Wikipedia
18 https://en.wikipedia.org/wiki/Main_Page Wikipedia, the free encyclopedia
12 https://en.wikipedia.org/wiki/User_talk:Grumbles180-co User talk:Grumbles180-co - Wikipedia
11 https://en.wikipedia.org/wiki/https://en.wikipedia.org/w/index.php 
8 https://en.wikipedia.org/wiki/Kermode_bear Kermode bear - Wikipedia
8 https://en.wikipedia.org/wiki/Ronald_Reagan Ronald Reagan - Wikipedia
8 https://en.wikipedia.org/wiki/George_W._Bush George W. Bush - Wikipedia
7 https://en.wikipedia.org/wiki/Stuart_Sutcliffe Stuart Sutcliffe - Wikipedia
7 https://en.wikipedia.org/wiki/Chuck_Norris Chuck Norris - Wikipedia
7 https://en.wikipedia.org/wiki/Mahatma_Gandhi Mahatma Gandhi - Wikipedia
7 https://en.wikipedia.org/wiki/Bill_Clinton Bill Clinton - Wikipedia
7 https://en.wikipedia.org/wiki/Coca-Cola Coca-Cola - Wikipedia
7 https://en.wikipedia.org/wi

https://en.wikipedia.org/wiki/User_talk:Grumbles180-co  looks interesting. What User-Agents werre requesting that?

In [56]:
results = urls.filter(urls.norm_url == "en.wikipedia.org/wiki/User_talk:Grumbles180-co")
results = results.groupBy('user_agent').count().collect()
results

[Row(user_agent='Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0', count=12)]

All the requests were done by the same browser it seems? What do the times look like?

In [61]:
from pyspark.sql.functions import col

results = urls.filter(urls.norm_url == "en.wikipedia.org/wiki/User_talk:Grumbles180-co")
results = results.select('date').sort(col('date').asc())
results.show(12)

+--------------------+
|                date|
+--------------------+
|2018-10-25T07:20:35Z|
|2018-10-25T07:34:52Z|
|2018-10-25T07:45:58Z|
|2018-10-25T07:58:02Z|
|2018-10-25T16:57:08Z|
|2018-10-25T17:18:24Z|
|2018-10-25T17:30:05Z|
|2018-10-25T17:50:00Z|
|2018-10-25T18:25:58Z|
|2018-10-25T18:36:39Z|
|2018-10-25T18:58:38Z|
|2018-10-25T20:30:57Z|
+--------------------+



It looks like someone may have been archiving it manually and then gone to sleep or to work? Remember we only have one day. This activity could have started earlier, and gone on longer. We can actually look in the Wayback Machine to try to see.

[![image](images/calendar.png)](https://web.archive.org/web/2018*/https://en.wikipedia.org/wiki/User_talk:Grumbles180-co)

It does appear that the behavior appeared to continue on for a few days and then stop. These are the only days that this URL have been archived at the Internet Archive.

## Browser Diversity

It's interesting to get back our commonly requested pages in an order that reflects the number of User-Agents that have requested them. The number of distinct User-Agents will reflect different people or bots that are requesting a page. It We can adjust our pages function that we created earlier.

In [135]:
import pandas

def pages2(pattern, n=5, bot=None):
   
    if bot == True:
        bot_clause = 'AND bot == "true"'
    elif bot == False:
        bot_clause = 'AND bot == "false"'
    else:
        bot_clause = ''
            
    sql = '''
          SELECT 
            norm_url, 
            COUNT(DISTINCT(user_agent)) AS user_agent_count,
            SUM(total) AS request_count
          FROM (          
            SELECT norm_url, user_agent, COUNT(*) AS total
            FROM urls 
            WHERE norm_url RLIKE "{}" {}
            GROUP BY norm_url, user_agent
            ORDER BY total DESC
          )
          GROUP BY norm_url
          ORDER BY user_agent_count DESC, request_count DESC
          LIMIT {}
          '''.format(pattern, bot_clause, n)
    
    page_list = []
    for r in sqlc.sql(sql).collect():
        page = get_page('https://' + r.norm_url)
        page['user_agent_count'] = r.user_agent_count
        page['request_count'] = r.request_count
        page_list.append(page)
    
    return pandas.DataFrame(page_list, columns=['url', 'title', 'text', 'user_agent_count', 'request_count'])

In [136]:
twitter = pages2('^twitter.com/.+', 25)

Unnamed: 0,url,title,text,user_agent_count,request_count
0,https://twitter.com/login,Login on Twitter,Welcome home! This timeline is where you‚Äôll s...,59,296
1,https://twitter.com/intent/tweet,Post a Tweet on Twitter,New to Twitter? Sign up Get instant updates f...,55,334
2,https://twitter.com/search,Twitter Search,Welcome home! This timeline is where you‚Äôll s...,50,1115
3,https://twitter.com/search-home,Twitter Search,Welcome home! This timeline is where you‚Äôll s...,27,35
4,https://twitter.com/account/suspended,Twitter / Account Suspended,Welcome home! This timeline is where you‚Äôll s...,22,68
5,https://twitter.com/intent/like,,,18,22
6,https://twitter.com/undefined,Twitter / Account Suspended,Welcome home! This timeline is where you‚Äôll s...,13,200
7,https://twitter.com/realDonaldTrump/status/105...,"Donald J. Trump on Twitter: ""A very big part o...",Welcome home! This timeline is where you‚Äôll s...,12,33
8,https://twitter.com/DonaldJTrumpJr/status/1055...,"Donald Trump Jr. on Twitter: ""Jim, did you or ...",Welcome home! This timeline is where you‚Äôll s...,11,24
9,https://twitter.com/realDonaldTrump/status/923...,"Donald J. Trump on Twitter: """"Clinton campaign...",Welcome home! This timeline is where you‚Äôll s...,10,21


In [141]:
pandas.set_option('display.max_colwidth', -1)
twitter[['url', 'user_agent_count', 'request_count']]

Unnamed: 0,url,user_agent_count,request_count
0,https://twitter.com/login,59,296
1,https://twitter.com/intent/tweet,55,334
2,https://twitter.com/search,50,1115
3,https://twitter.com/search-home,27,35
4,https://twitter.com/account/suspended,22,68
5,https://twitter.com/intent/like,18,22
6,https://twitter.com/undefined,13,200
7,https://twitter.com/realDonaldTrump/status/1055418269270716418,12,33
8,https://twitter.com/DonaldJTrumpJr/status/1055428345867976704,11,24
9,https://twitter.com/realDonaldTrump/status/923147501418446849,10,21


For convenience here is a little function that takes a normalized url and will return a list of the user-agents that were used to archive it, and the number of times it appears in the data. Lets use it to zero in on a tweet that was archived by 12 different types of user-agent.

<a href="https://twitter.com/realDonaldTrump/status/1055418269270716418"><img style="width: 500px; float: left; border: thin solid #ccc" src="images/tweet.png"></a>

In [166]:
def user_agents(norm_url):
    results = urls.filter(urls.norm_url == norm_url)
    results = results.groupBy('user_agent').count().sort(col("count").desc())
    return results.toDF('user_agent', 'count').toPandas()

user_agents('twitter.com/realDonaldTrump/status/1055418269270716418')

Unnamed: 0,user_agent,count
0,Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot),13
1,python-requests/2.18.1,5
2,Chrome 41.0.2228.0,4
3,Firefox 36.0,2
4,Chrome 41.0.2227.0,2
5,Safari 6.0,1
6,Firefox 33.0,1
7,"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",1
8,Chrome 41.0.2227.1,1
9,Firefox 40.1,1


Here's another convenience function that will return a Pandas DataFrame for any requests for a given normalized URL. They are sorted by time.

In [162]:
def times(norm_url):
    results = urls.filter(urls.norm_url == norm_url)
    results = results.sort(col('date').asc())
    return results.toDF(*urls.columns).toPandas()

trump_tweet = times('twitter.com/realDonaldTrump/status/1055418269270716418')

In [165]:
trump_tweet[['date', 'user_agent']]

Unnamed: 0,date,user_agent
0,2018-10-25T11:18:39Z,python-requests/2.13.0
1,2018-10-25T11:37:07Z,Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)
2,2018-10-25T12:00:22Z,python-requests/2.18.1
3,2018-10-25T12:01:45Z,Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot)
4,2018-10-25T12:11:19Z,python-requests/2.18.1
5,2018-10-25T12:21:36Z,python-requests/2.18.1
6,2018-10-25T12:31:56Z,python-requests/2.18.1
7,2018-10-25T12:43:07Z,python-requests/2.18.1
8,2018-10-25T13:19:18Z,Chrome 41.0.2227.0
9,2018-10-25T13:54:52Z,Firefox 40.1


This looks like it is almost entirely bots: 

* *python-requests*
* not supplying a user-agent at all, which gets the default *Wayback Machine Live Record*
* bots minimally pretending to browsers, e.g. *Chrome 41.0.2228.0)*
* one actual browser, *Safari*

Another interesting thing to note is that the tweet was created at 2018-10-25 11:18:18 and it was first archived 21 seconds later at 2018-10-25 11:18:39 (UTC). It appears that someone has a bot that is waiting for tweets from Trump and is almost instantly archving them. Somebody buy this person a üç∫.

## Instagram

Let's use these functions we've developed to look at Instagram. We'll start by getting the top-25 most archived Instagram URLs.

In [167]:
instagram = pages2(r'^instagram.com/.+', n=25)
instagram

Unnamed: 0,url,title,text,user_agent_count,request_count
0,https://instagram.com/p/BaoQKVlhAKP/,üê∂Fricko the Shiba Inuüê∂ on Instagram: ‚ÄúWho wants this helmet ?üòé ‚Ä¢ ‚Ä¢ ‚Ä¢ ‚Ä¢ #Fricko #üê∂ #‚òÄÔ∏è #shiba #shibainu #dog #Êü¥Áä¨ #Ëµ§Êü¥ #adorable #shibalovers #shibaholics #dogoftheday‚Ä¶‚Äù,,9,21
1,https://instagram.com/accounts/login/,Login ‚Ä¢ Instagram,,8,22
2,https://instagram.com/directory/hashtags/,Instagram,,5,7
3,https://instagram.com/directory/profiles/,Instagram,,5,6
4,https://instagram.com/p/BapbYDpBxee/,,,4,6
5,https://instagram.com/about/us/,About Us ‚Ä¢ Instagram,"About Us\nHead of Instagram\nAdam Mosseri (@mosseri) is the Head of Instagram where he oversees all functions of the business including engineering, product and operations. A designer at heart, Adam is known for balancing sharp design thinking with thoughtful product strategy to create experiences that bring people together and encourage authentic communication.\nAdam has been at Facebook for more than ten years. He was design director for Facebook's mobile apps and then moved into product management where he led the News Feed product and engineering teams for many years. He was Head of News Feed prior to joining Instagram where he oversaw product before managing the entire organization.\nPrior to Facebook, Adam worked at TokBox as the company‚Äôs first designer. He began his career founding a design consultancy in 2003 with offices in New York and San Francisco that focused on graphic, interaction and exhibition design. Adam holds a BA from the Gallatin School of Interdisciplinary Study at NYU where he studied Information Design and Media.\nBorn and raised in New York, he now lives in San Francisco with his wife and two sons.\nFounders\nKevin Systrom (Co-founder)\nKevin Systrom (@kevin) co-founded Instagram and served as CEO for 8 years before leaving the company in September 2018 to pursue his next passion project. With Kevin‚Äôs focus on simplicity and inspiring creativity through solving problems with thoughtful product design, Instagram became the home for innovation on visual storytelling launching dozens of products including Stories and IGTV.\nPrior to founding Instagram, Kevin was part of the startup Odeo, which later became Twitter, and spent two years at Google working on products like Gmail and Google Reader. He graduated from Stanford University with a BS in Management Science &amp; Engineering.\nMike Krieger (Co-founder)\nMike Krieger (@mikeyk) co-founded Instagram and served as Instagram's Head of Engineering for 8 years, before leaving the company in September 2018 to explore new projects. Mike focused on building a broad range of creative products to empower the community on Instagram to connect with their interests and passions. In addition, Mike grew the engineering organization to more than 400 employees in Instagram offices located in Menlo Park, New York and San Francisco.\nA native of S√£o Paulo, Brazil, Mike holds an MS in Symbolic Systems from Stanford University. Prior to founding Instagram, he worked at Meebo as a user experience designer and front-end engineer.",3,5
6,https://instagram.com/explore/locations/,Instagram,,3,5
7,https://instagram.com/universitystudios/,University Studios (@universitystudios) ‚Ä¢ Instagram photos and videos,,3,3
8,https://instagram.com/p/6pzghQttTe/,cindy bledsoe on Instagram: ‚ÄúNew members starting in September .‚Äù,,2,289
9,https://instagram.com/p/6GMmOfNtYy/,cindy bledsoe on Instagram: ‚ÄúMeet and greet 2015! Where we meet new members and perspective members.‚Äù,,2,135


So what's going on with the one that was shared with the most user-agents?

<a href="https://www.instagram.com/p/BaoQKVlhAKP/"><img src="images/instagram1.png" style="width: 800px; float: left; border: thin solid #ccc"></a>

In [168]:
times('instagram.com/p/BaoQKVlhAKP/')[['date', 'user_agent']]

Unnamed: 0,date,user_agent
0,2017-10-24T23:57:05Z,Chrome 41.0.2228.0
1,2017-10-25T00:30:49Z,Safari 6.0
2,2017-10-25T01:04:43Z,Safari 5.1.7
3,2017-10-25T01:38:10Z,Chrome 41.0.2227.0
4,2017-10-25T02:11:32Z,Chrome 41.0.2227.1
5,2017-10-25T02:45:49Z,Firefox 40.1
6,2017-10-25T03:18:57Z,Chrome 41.0.2228.0
7,2017-10-25T03:52:10Z,Safari 7.0.3
8,2017-10-25T04:25:23Z,Firefox 36.0
9,2017-10-25T04:58:34Z,Safari 7.0.3


Those aren't legit User-Agents btw, they seem to be bots pretending to be browsers? What the heck is going on here? So weird...

## Facebook

Is there anything interesting about Facebook? Many Facebook URLs were included for the like button and friends are found in archived content. We're going to filter those out with negative lookahead in the regular expression.

In [180]:
fb = pages2(r'^facebook.com/(?!.+\.php$).+', n=25)
fb

Unnamed: 0,url,title,text,user_agent_count,request_count
0,https://facebook.com/pages/call_to_action/fetch_dialog_data/,,,75,187
1,https://facebook.com/ajax/bz,,,73,345
2,https://facebook.com/plugins/likebox/stream,[no-title],,72,141
3,https://facebook.com//audiencenetwork/xhr/,[no-title],"{""success"":false,""errorCode"":1003,""errorMsg"":""Invalid Request""}",40,47
4,https://facebook.com//audience_network/client_event,,,30,34
5,https://facebook.com/Porcelain-Tub-Restorations-223584184478108/reviews/,Facebook,This content isn't available right nowThis content isn't available right nowThis Facebook Page is only visible to people who meet a minimum age. Please log in to see if it's visible to you.,23,54
6,https://facebook.com/sem_campaigns/sem_pixel_test/,[no-title],,21,120
7,https://facebook.com/plugins/follow,[no-title],,19,24
8,https://facebook.com/dialog/oauth,Facebook,The provided app ID does not look like a valid app ID.,18,123
9,https://facebook.com/login/,Log into Facebook | Facebook,Press alt + / to open this menu,18,78


Ignoring the remaining obvious admin URLs we can see ones like https://facebook.com/Porcelain-Tub-Restorations-223584184478108/reviews/ that was archived by 23 different types of user-agent. But unfortunately the content [does not seem to be viewable](https://web.archive.org/web/20160901000000*/https://www.facebook.com/Porcelain-Tub-Restorations-223584184478108/reviews) in the Wayback machine and it is no longer on the live web.

https://facebook.com/AFPnewsenglish/photos/a.163022200402458.25949.155857464452265/744829832221689/ is still on the web but it [doesn't appear to be in](https://web.archive.org/web/*/https://www.facebook.com/AFPnewsenglish/photos/a.163022200402458.25949.155857464452265/744829832221689/?type=1&amp;relevant_count=1) the Wayback machine. This might be because part of the URL was stripped during normalization. Let's get the original URL.

In [189]:
url = 'facebook.com/AFPnewsenglish/photos/a.163022200402458.25949.155857464452265/744829832221689/'
sqlc.sql('SELECT DISTINCT(url) FROM urls WHERE norm_url = "{}"'.format(url)).collect()

[Row(url='https://www.facebook.com/AFPnewsenglish/photos/a.163022200402458.25949.155857464452265/744829832221689/?type=1&amp;relevant_count=1')]

This helps us find it, but alas it doesn't render in Wayback machine. Here is the live view on the web compared to what renders in Wayback.

<a href="images/fb.png"><img src="images/fb.png" style="float: left; width: 400px;"></a><a href="images/fb-wayback.png"><img src="images/fb-wayback.png" style="width: 400px;"></a>