# Sensivity by race by domain visit frequencies

Questions we ask here:

- Are there demographic (by race) differences in browsing behavior?

t-closeness is one way of examining sensitivity.

Here we look at sensitivity by race based on overall domain visitation frequency.


chi-sq test of homogeneity: How we do it.

we have r populations: {white, black, asian, other}

preprocessing:
use raw sessions data after removing bad domains

make map 

{ machine_id: {domain: visits for each domain visited by machine} }

over all 52 weeks of data

table:

```
machine_id, domain, visits
```

where machine_id is duplicated for each domain the machine visited.



our category is domain.

we have D levels of our category, where D is the threshold number of most frequently visited domains.


if we didn't use panel data

Total sample is the size of our comscore data samples total: 



In [1]:
from datetime import datetime
import sys
sys.path.append('..')

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

from itertools import chain
from collections import Counter


from comscore.data import read_weeks_machines_domains, read_comscore_demo_df

TOP_N_DOMAINS = 50

A core question left unanswered by our analysis of FLoC's sensitivity is whether racial groups in our dataset exhibit significant browsing behavior differences. If users' browsing history does not vary by racial group, then we should expect any clustering algorithm, including FLoC, to group independently of race. We use a Chi-Squared homogeneity test to test if racial groups' browsing behaviors are independent from race, treating each racial group as a separate population, and the top domains visited in our dataset as the categorical variable of interest. To calculate domain visit frequencies, we construct a set $D_w^m$ consisting of each unique domain visited by a machine $m$ in a given week $w$. 

In [2]:
# read in the pre-processed sessions data 
# this maps week,machine_id -> domains set
weeks_machines_domains_fpath = '../output/weeks_machines_domains.csv'
weeks_machines_domains_df = read_weeks_machines_domains(weeks_machines_domains_fpath)

reading from ../output/weeks_machines_domains.csv...
... read 4877236 rows


In [3]:
comscore_demo_df = (read_comscore_demo_df(fpath="../data/comscore/2017/demographics.csv", year=2017)
           .assign(machine_id=lambda x: x.machine_id.astype(int))
          )

df = (weeks_machines_domains_df.merge(comscore_demo_df, 
               how='inner', # only include machine_ids with valid data (no nan race)
               left_on='machine_id', 
               right_on='machine_id')
      .query("n_domains > 0")
     )
# convert to list
df.domains = [list(d) for d in df.domains]

In [4]:
# Calculate frequencies for each race group
# takes a couple mins on my mac mini
freqs = []
for race in df.racial_background.unique():
    df2 = df[df.racial_background == race]
    counter = pd.Series(Counter(chain.from_iterable(df2.domains)))
    freqs.append(counter)
    
races = df.racial_background.unique()
dfs = [(pd.DataFrame(distribution)
       .assign(race=races[i])) for i, distribution in enumerate(freqs)]

for dist_df in dfs:
    # '0' is the count column, for now.
    dist_df['p'] = dist_df[0] / sum(dist_df[0])
    dist_df['count'] = dist_df[0]
    dist_df = dist_df.drop(0, axis=1)
    
# domain, count, p, race
race_distributions = (pd.concat(dfs)
                      .drop(0, axis=1)
                      .reset_index()
                      .rename({'index': 'domain'}, axis=1)
                     )

# stack and compute for all
# domain, count, freq
all_distribution = (race_distributions
 .drop(['race', 'p'], axis=1)
 .groupby('domain')
 .aggregate({'count': sum})
 .assign(p=lambda x: x['count'] / x['count'].sum())
)

In [6]:
# top N domains overall
top_domains = all_distribution.nlargest(TOP_N_DOMAINS, columns='p')
top_domains['p'] = top_domains['count'] / top_domains['count'].sum()

Aggregating across weeks for each racial group, this gives us a distribution over all domains in our dataset. For example, here are the top 5 domains visited by each racial group in the comscore data:

In [None]:
(race_distributions
 .groupby('race')
 .apply(lambda x: x.nlargest(5, columns='p'))
)

To calculate our Chi-Square statistics, we limit our analysis to the top 50 domains across all racial groups. Our contingency table consists of each domain as a "category" and each racial group as a separate population:

In [None]:
from scipy.stats import chisquare, chi2_contingency

race_dist_df = race_distributions[race_distributions.domain.isin(top_domains.index)]

race_X = (race_dist_df
 [['domain', 'race', 'count']]
 .pivot(index='domain', columns='race')
          # calc frequencies
)
race_X.head()

In [None]:
# don't need this
# expected_X = top_domains.count
exp_stat, exp_p, dof, e = chi2_contingency(race_X.values)

In [None]:
exp_stat, exp_p, dof, e

---

While the Chi-Square test demonstrates that there *is* a difference in browsing behavior between racial groups, it doesn't show the degree of that difference. To get a sense for differences in behavior, two plots displaying distributions over domain frequencies are below. The first is grouped, which gives a better sense of overall domains, and the second is faceted, which provides a sense of the actual differences in shape:

In [None]:
from plotnine import (ggplot, geom_point, aes, stat_smooth, 
                      facet_wrap, geom_bar, element_text, 
                      theme, theme_minimal, labs, element_blank)

(race_dist_df
 .assign(race=lambda x: pd.Categorical(x.race, categories=x.race.unique().sort()),
        domain=lambda x: pd.Categorical(x.domain, categories=x.domain.unique()),)
 .pipe(lambda x:
     ggplot(x, aes(x='reorder(domain, -p)', y='p', fill='race')) +
     geom_bar(stat='identity', position='dodge') +
     theme_minimal() +
     theme(axis_text_x=element_text(angle=90, vjust=1, hjust=1)) +
     labs(x="Domain", y="Freq", fill="Race")
))

In [None]:
(race_dist_df
 .assign(race=lambda x: pd.Categorical(x.race, categories=x.race.unique().sort()),
        domain=lambda x: pd.Categorical(x.domain, categories=x.domain.unique()),)
 .pipe(lambda x:
     ggplot(x, aes(x='reorder(domain, -p)', y='p')) +
     geom_bar(stat='identity', position='dodge') +
     theme_minimal() +
     theme(axis_text_x=element_blank()) +
     labs(x="Domain", y="Freq") +
       facet_wrap('~race')
))

Although the distributions are generally shaped similarly, there are significant differences in overall frequencies to the top few sites overall.