# Princeton Web Census Interface
## About
This jupyter notebook provides a demo for interacting with data from the Princeton Web Census.

The code relies on our utilities located in the censuslib/ directory, which you can use for your own experiments.

Each 'cell' in this notebook represents a different capability of our census interface. Each can be executed separately from the others, but first you must start with the following instructions...


## Getting started with the Web Census data:

Run the cell below to create a `Census` object, which encapsulates a connection to our PostgreSQL crawl database. This object provides the interface to interact with web census data — some examples of what you can do are located in the cells in the rest of this notebook

(*Note*: the keyboard shortcut to run a cell is shift+enter!)

Our  available census crawls are:
* "census_2016_08_25k_id_detection_1" : A crawl of the top 25k sites, with browser state (cookies, localstorage, etc.) maintained between each site visit



In [1]:
import sys
import os
sys.path.append(os.path.realpath('censuslib'))
from censuslib import census
from censuslib import utils
from collections import defaultdict
import psycopg2
import re
import dill

# Note: If you'd like to access one of our other databases, replace census_name
# with one of our other available crawls listed above
census_name = 'census_2016_08_25k_id_detection_1'

# the 'small_crawl' Census object provides the interface for interacting with
# census data
small_crawl = census.Census(census_name)

# Example API

## Check to see that a given top_url is present in the dataset
All data in the census is keyed by each 'top_url' visited. Each top_url follows the format:  

`http://example.com`  

There is never a leading '`www.`', nor is the scheme ever '`https://`'. If a site redirects to https, that will be reflected in the crawl's data.

In [None]:
top_url = 'http://netflix.com'
print(small_crawl.check_top_url(top_url))

top_url = 'http://notincensus.com'
print(small_crawl.check_top_url(top_url))


## Get all third party trackers on a site
`census.get_all_third_party_trackers_by_site(top_url)` returns a list of third party resources loaded on the site's landing page (top_url) that were identified to be trackers.

To determine if a URL is a tracker, we check the URL against two blocklists: the EasyList filter list, and the EasyPrivacy filter list, both provided by the [EasyList](https://easylist.to/) community. The EasyList filters identify resources that are used in advertising and is a popular list used by adblockers. The EasyPrivacy filters identify additional resource used in tracking.

In [None]:
top_url = 'http://espn.go.com'

results = small_crawl.get_all_third_party_trackers_by_site('http://espn.go.com')

results

## Get all third party responses on a site
For a more comprehensive view of the third party resources on a website, run this cell.

`census.get_all_third_party_responses_by_site(top_url)` returns a two-level results dict containing third party urls loaded on the given site's landing page (top_url).

The dict's structure is:

`dict[third_party_url]['is_tracker']`, contains True if third_party_url is identified on a blocklist.  
`dict[third_party_url]['is_js']`, contains True if third_party_url is a script.  
`dict[third_party_url]['is_img']`, contains True if third_party_url is an image.  
`dict[third_party_url]['url_domain']`, contains the domain of the third party.  



In [None]:
results = small_crawl.get_all_third_party_responses_by_site('http://espn.go.com')


print('Example of one of the third parties:')
third_party = results.popitem()

print('Third party URI loaded on page: ' + third_party[0])
print('Third party domain: ' + third_party[1]['url_domain'])
print('Is it a tracker?: ' + str(third_party[1]['is_tracker']))
print('Is it an image?: ' + str(third_party[1]['is_img']))
print('Is it a script?: ' + str(third_party[1]['is_js']))

## Get all third party responses for a list of sites
Run the cell below to fetch third party data for multiple sites. The results are printed to CSV files in your home directory — visit the Jupyter Notebook file browser at https://webcensus.openwpm.com to view and download the files.

The resulting CSVs are:

* `tracker_js_by_site.csv` : A CSV of rows of [site, tp_domain] for third-party domains with tracking scripts on that site.

* `non_tracker_js_by_site.csv` : A CSV of rows of [site, tp_domain] for third-party domains with non-tracking scripts on that site

* `tracker_img_by_site.csv` : A CSV of rows of [site, tp_domain] for third-party domains with tracking images (pixels, beacons, ads, etc.) on that site.

* `non_tracker_img_by_site.csv` : A CSV of rows of [site, tp_domain] for third-party domains with non-tracking images on that site

* `tracker_other_by_site.csv` : A CSV of rows of [site, tp_domain] for domains of third-party resources that could not be identified as scripts or images, but were still identified as trackers

* `non_tracker_other_by_site.csv` : A CSV of rows of [site, tp_domain] for domains of third-party resources that could not be identified as scripts, images, or trackers.

In [2]:
# List sites to fetch data for :
sites = ['http://cnn.com', 'http://wsj.com']

# This function will write the results to multiple CSVs
small_crawl.get_third_party_resources_for_multiple_sites(sites)

Query results written to filesystem. Check file browser at https://webcensus.openwpm.com to see results.


## Get top_urls that load a resource from a given third party domain

`census.get_top_urls_with_third_party_domain(con, tp_domain)` returns a dictionary mapping top_urls in the census to a list of urls that were loaded on that top_url belonging to a certain tp_domain.

In [None]:
tp_domain = 'addthis.com'

tps_by_top = small_crawl.get_sites_with_third_party_domain(tp_domain)

print("Number of top_urls with given third party : " + str(len(tps_by_top)))

In [None]:
c = small_crawl.connection.cursor("urls-rewrite")
c.itersize = 100000

c.execute('SELECT * FROM urls')
print('executed!')
counter = 0
for row in c:
    counter += 1
    #print(row)
    if counter % 100 == 0:
        print(counter)


## Get "cookie sync" events on a given top_url
Note: This does not include logic for isolating "identifying cookies." Any cookies of a sufficient cookie length that are shared with other domains will be identified.

In [None]:
# For a single top_url...

results = small_crawl.get_cookie_syncs_v2('http://microsoft.com', cookie_length=8)

for receiving_url in results:
    print("R: " + receiving_url)
    for sending_url, val in results[receiving_url]:
        print("\tS: " + sending_url)
        print("\tV: " + val)

In [None]:
# For a list of top_urls...
# Warning: this method is slow.

sites = ['http://microsoft.com', 'http://cnn.com']  # Change me!

cookie_sync_data = defaultdict(defaultdict)
for i, site in sites:
    print(site)
    cookie_sync_data[site] = small_crawl.get_cookie_syncs_v2(site, cookie_length=8)


In [None]:
# Write output as .dill
with open('cookie_syncs.dill', 'w') as f:
    dill.dump(cookie_sync_data, f)

# Write complete output as csv
with open('cookie_syncs.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['top_url', 'sending_domain', 'receiving_url', 'cookie_value'])
    for site in cookie_sync_data:
        for receiving_url in cookie_sync_data[site]:
            for sending_url, cookie_value in cookie_sync_data[site][receiving_url]:
                writer.writerow([site, sending_url, receiving_url, cookie_value])

# Write partial output as CSV, only identifying sending domain and receiving domain
# (rather than the full receiving URL)

cooks_just_domains = defaultdict(defaultdict)
for site in cookie_sync_data:
    cooks_just_domains[site] = defaultdict(set)
    for receiving_url in cookie_sync_data[site]:
        for sending_domain, value in cookie_sync_data[site][receiving_url]:
            cooks_just_domains[site][utils.get_host_plus_ps(receiving_url)].add(sending_domain)
with open('../wsj/cookie_syncs_v2_just_domains.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(['top_url', 'sending_domain', 'receiving_domain'])
    for site in cooks_just_domains:
        for receiving_domain in cooks_just_domains[site]:
            if len(cooks_just_domains[site][receiving_domain]) > 1 and 'NOT_FOUND' in cooks_just_domains[site][receiving_domain]:
                cooks_just_domains[site][receiving_domain].discard('NOT_FOUND')
            for sending_domain in cooks_just_domains[site][receiving_domain]:
                writer.writerow([site, sending_domain, receiving_domain])

## Check a given url against a blocklist
Available blocklists:
* easylist.txt
* easyprivacy.txt

In [None]:
print(utils.is_tracker('http://tags.bkrtx.com/js/bk-coretag.js', is_js=True, is_img=False, 
                       first_party='http://verizonwireless.com', blocklist='easylist.txt'))


## Get third party scripts on given top_url that call particular javascript symbol

In [None]:
print(small_crawl.get_urls_with('http://cnn.com', 'CanvasRenderingContext2D.fillText'))

## Get the domain of a given url

In [None]:
print(utils.get_domain('http://subdomain.example.com/this/will/be/deleted.jpg'))

In [None]:
c.close()
small_crawl.connection.rollback()