# FLoC cohorts from randomly assigning domains to panel samples

Here we create a comparison/counterfactual to the true panel data.

The same machine_ids for panels are used as for the true data, but where only the sample's demographics are kept.

The visited sets of domains are randomly assigned to panel samples, but where domains are assigned in a way proportional to how they occur in the real panel data.

With this creation, domain visits and therefore cohorts should not be correlated with demographics.


Here's how we do this:

- first take the real panel data, where panel samples are joined with their true sessions data, and data is then limited to samples with >= 7 domains per sample.

- take the distribution of n_domains 

- take the distribution of domains, using each domain's frequency of occurance in the real samples.
i.e. when domains are randomly sampled, their sampling weight is proportional to the number of samples for which they occur in the real panel data.

Create fake panel data where
- each machine,week sample from the real panel is assigned a set of domains by randomly sampling from the distribution of n_domains and the distribution of domains

In [5]:
from datetime import datetime
import sys
sys.path.append('..')

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random

import floc

from comscore.data import read_weeks_machines_domains
import prefixLSH


Read in the sessions data and join with the panel data as and make the distributions to randomly sample the fake panel data from:
- n_domains distribution
- domains distribution

In [None]:
# read in the pre-processed sessions data 
# this maps week,machine_id -> domains set
weeks_machines_domains_fpath = '../output/weeks_machines_domains.csv'
weeks_machines_domains_df = read_weeks_machines_domains(weeks_machines_domains_fpath)
weeks_machines_domains_df.drop(['machine_id', 'domains'], axis=1).head()

reading from ../output/weeks_machines_domains.csv...


In [None]:
all_panels_fpath = '../output/all_panels.csv'
all_panels_df = pd.read_csv(all_panels_fpath)
print('read in all panels: %s total rows' % len(all_panels_df))
print('%s panels' % all_panels_df.panel_id.nunique())
all_panels_df.drop(['machine_id'], axis=1).head()

In [None]:
weeks_machines_domains = weeks_machines_domains_df.set_index(['machine_id','week'])['domains']
all_panels_df['domains'] = all_panels_df.set_index(['machine_id','week']).index.map(weeks_machines_domains)
all_panels_df.drop(['machine_id','domains'], axis=1).head()

In [None]:
n_domains_distribution = all_panels_df.n_domains

In [None]:
# make map for domains distribution
# map: {domain: frequency}

# make a list of all domains where the frequency they appear in the list
# is the same as the frequency which they appear in the panel samples
# weekly domains sets
domains_sets = all_panels_df.domains.to_list()
good_domains = [d for domains in domains_sets for d in domains]
# check this matches the n_domains distribution data
assert(len(good_domains) == n_domains_distribution.sum())
# transform that list into the map: {domain: frequency}
domains_distribution_map = {d: 0 for d in set(good_domains)}
print('%s unique domains' % len(domains_distribution_map))
for d in good_domains:
    domains_distribution_map[d] += 1
# and then trandform this into 2 series:
# domains_list has an item for each domain
# and domains_p is a list of corresponding the probabilities (weights) 
# for each domain in domains_p where the indices match
domains_list = list(domains_distribution_map.keys())
domains_p = [v/len(good_domains) for v in domains_distribution_map.values()]
assert(round((sum(domains_p)), 4) == 1)

In [None]:
random.seed(0)

def get_random_domains_list(x):
    n_domains = np.random.choice(n_domains_distribution, size=1)[0]
    return set(np.random.choice(
        domains_list,
        size=n_domains,
        replace=False, 
        p=domains_p
    ))

Create alternative version of panel by copying true panel and reassigning domains.

In [None]:
all_panels_random_domains_df = all_panels_df.copy().drop(
    ['n_domains', 'domains'], axis=1
)

In [None]:
all_panels_random_domains_df.head()

In [110]:
all_panels_random_domains_df['domains'] = all_panels_random_domains_df.apply(get_random_domains_list, axis=1)
all_panels_random_domains_df['n_domains'] = all_panels_random_domains_df.domains.apply(len)
all_panels_random_domains_df.head()

KeyboardInterrupt: 

Check the distributions of domains for the true vs randomly assigned data. Distributions should be similar.

From true data

In [64]:
good_domains_value_counts = pd.Series(good_domains).value_counts()
good_domains_value_counts.head(20)

google.com        2069410
yahoo.com         1343049
bing.com          1303117
facebook.com      1301940
msn.com           1281952
youtube.com       1089638
live.com           933177
amazon.com         898409
ebay.com           375022
wikipedia.org      325491
aol.com            309125
microsoft.com      305610
pornhub.com        297105
paypal.com         267934
247-inc.net        266388
pinterest.com      251874
craigslist.org     251167
walmart.com        247218
netflix.com        242238
twitter.com        210164
dtype: int64

From randomly assigned domains

In [None]:
panels_random_domains_value_counts = all_panels_random_domains_df['domains'].value_counts().head(20)
panels_random_domains_value_counts

Compute simhash on domains

In [None]:
# apply simhash
all_panels_random_domains_df['simhash'] = all_panels_random_domains_df.domains.apply(floc.hashes.sim_hash_string)

In [None]:
all_panels_random_domains_df.head()

##### Pre-compute cohorts for each panel

each sample's cohort is dependent on the rest of the simhashes in the panel

for this reason, cohorts must be computed per panel

In [None]:
min_k = 40 
# preset all cohorts to None
all_panels_random_domains_df['cohort'] = np.nan

for panel_id in all_panels_random_domains_df.panel_id.unique():
    t_start = datetime.now()
    if panel_id % 1 == 0:
        print('computing cohorts for panel %s/%s' % (panel_id, all_panels_random_domains_df.panel_id.nunique()))
    panel_df = all_panels_random_domains_df[all_panels_random_domains_df.panel_id==panel_id]
    cohorts_dict = prefixLSH.get_cohorts_dict(panel_df.simhash.astype(int), min_k=min_k)
    assign_cohort = lambda x: cohorts_dict[x.simhash] if x.panel_id == panel_id else x['cohort']
    all_panels_random_domains_df['cohort'] = all_panels_random_domains_df.apply(assign_cohort, axis=1)
    if panel_id % 1 == 0:
        print('took %s' % (datetime.now() - t_start))

In [None]:
all_panels_random_domains_df.head()

save output

In [None]:
all_panels_random_domains_cohorts_fpath = '../output/all_panels_random_domains_cohorts.csv'

In [None]:
print('saving to %s...' % all_panels_random_domains_cohorts_fpath)
all_panels_random_domains_df.to_csv(all_panels_random_domains_cohorts_fpath, index=False)

script re-entry point

In [None]:
all_panels_random_domains_df = pd.read_csv(all_panels_random_domains_cohorts_fpath)
print('read in all panels: %s total rows' % len(all_panels_random_domains_df))
print('%s panels' % all_panels_random_domains.panel_id.nunique())
all_panels_random_domains_df.head()