# Princeton Web Census Interface
## *About:*
This jupyter notebook provides an interface for interacting with data from the Princeton Web Census.

For questions or comments, feel free to email dill.reisman@gmail.com, or open up an issue on [our Github repo](https://github.com/dreisman/WebCensusNotebook).

## *How to use:*
To execute a cell, select/click it, and either press the 'play' button in the top tool bar or use the keyboard shortcut shift+enter

## *Warning:*
This interface is optimized for fast exploration of individual first parties and third parties. Actions that require a lot of data from many third parties/first parties can get slow. For instance, we do not recommend attempting to get all third party resources on more than 10000 first parties all at once, unless you intend to wait. We recommend sampling where you can, and estimating how many first parties and third parties you might be accessing before fully executing a command.


## *Getting started:*

Run the cell below to create a `Census` object, which provides the interface for accessing the web census data for a particular web census crawl. Everything you need is encapsulated by this object.

For instance, if you want to get information about a particular third party, you can access it through Census.third_parties['thirdparty.com'].

For information about a first party, you can access Census.first_parties['firstparty.com'].

For information about a particular known Organization, you can access Census.organizations['Org Name'].

All objects you access provide many properties you can explore, from the Alexa rank of a first party to the third party resources that a first party embeds on the site, and which of those third party resources are trackers. The best way to learn about our data is to explore the interface.

## *Available census crawls:*

* "census_2016_10_1m_stateless": A crawl of the top 1M sites from October 2016. Browser state (cookies, localstorage, etc.) was cleared between each site visit.



In [None]:
### Execute this cell first! Select it and hit shift+enter. It'll take approx. 10 seconds to initialize.
import matplotlib.pyplot as plt
%matplotlib inline
import sys
import os
sys.path.append(os.path.realpath('censuslib'))
from censuslib import census

# Note: If you'd like to access one of our other databases, replace census_name
# with one of our other available crawls listed above
census_name = 'census_2016_10_1m_stateless'

# the 'cen' Census object provides the interface for interacting with
# census data
cen = census.Census(census_name)

## Basic use of the interface
You can append '?' to the end of an object to get a description of the object. Each object has many properties that can aid you in analyzing the data. Try running the below cells to learn more about the different objects you can access and their properties:

In [None]:
# The top level census object
cen?

In [None]:
# A container of all FirstParties visited in the census
cen.first_parties?

In [None]:
# A container of all ThirdParties visited in the census
cen.third_parties?

In [None]:
# A particular FirstParty visited in the census 
nytimes = cen.first_parties['nytimes.com']
nytimes?

In [None]:
# A container of ThirdParties found on a particular FirstParty
nytimes_tps = cen.first_parties['nytimes.com'].third_parties
nytimes_tps?

In [None]:
# A list of instances (URIs) of third parties found on a particular FirstParty
nytimes_tp_uris = cen.first_parties['nytimes.com'].third_party_resources
single_resource = nytimes_tp_uris[0]
single_resource?

In [None]:
# A particular ThirdParty observed in the census
optimizely = cen.third_parties['optimizely.com']
optimizely?

In [None]:
# A container of FirstParties that have a particular ThirdParty
optimizely_org = cen.third_parties['optimizely.com'].organization
optimizely_org?

### Example: The average number of third party domains by Alexa category

In [None]:
for category in cen.first_parties.alexa_categories:
    avg = sum(len(set(tp.domain for tp in fp.third_party_resources))
              for fp in cen.first_parties.alexa_categories[category][:100]) / 100
    
    print("Average number of third party domains on " + category + " sites: ")
    print("\t" + str(avg))


### Example: The average number of trackers on the top 100 first parties

In [None]:
top_n = 100
avg = len([tp for fp in cen.first_parties[:top_n] for tp in fp.third_party_resources if tp.is_tracker]) / top_n
avg

### Example: All FirstParties that embed resources belonging to a particular organization

In [None]:
org_name = 'AppNexus'
first_parties_with_org = set()

# For each domain that the Organization owns, aggregate the FirstParties that embed a resource from it.
for domain in cen.organizations[org_name].domains:
    try:
        first_parties_with_org.update(cen.third_parties[domain].first_parties)
    # Not all domains may have been observed in the crawl, so continue through Exception.
    except census.CensusException:
        continue
        
print("Number of domains controlled by " + org_name + ": " + str(len(cen.organizations[org_name].domains)))
print("Number of sites with organization: " + str(len(first_parties_with_org)))
first_parties_with_org

# Play with the interface below!