Setting up linters

In [5]:
%load_ext pycodestyle_magic

In [10]:
%%pycodestyle
a=None

INFO:pycodestyle:2:2: E225 missing whitespace around operator


Getting lists of people exclusive to *Tribune* and *Defender* with number of mentions > *n* (TBD)

In [12]:
data = '/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_name_paper.csv'

In [48]:
import os
import pandas as pd

In [14]:
df = pd.read_csv(data)

In [15]:
df.shape

(2761876, 3)

In [16]:
df.head()

Unnamed: 0,person,paper,n_mentions
0,N. Clark,Chicago Daily Tribune,10358
1,- Cago,Chicago Daily Tribune,8054
2,N. Y.,Chicago Daily Tribune,6072
3,Van Buren,Chicago Daily Tribune,5497
4,W. Madison,Chicago Daily Tribune,5250


In [17]:
papers = df['paper'].unique()

In [18]:
papers

array(['Chicago Daily Tribune ', 'The Chicago Defender '], dtype=object)

In [19]:
tribune = set(df[df['paper']==papers[0]]['person'])

In [20]:
defender = set(df[df['paper']==papers[1]]['person'])

In [21]:
t_exclusive = tribune - defender

In [22]:
d_exclusive = defender - tribune

In [23]:
len(tribune), len(defender)

(2171544, 590332)

In [24]:
len(t_exclusive), len(d_exclusive)

(2108998, 527786)

Almost all of the name mentions are completely different.

In [25]:
len(d_exclusive)/len(defender)

0.8940494501399213

In [26]:
len(t_exclusive)/len(tribune)

0.9711974521354391

Let's see if this holds true for names that occur more than ...the 25% quartile?

In [27]:
df[df['paper']==papers[0]]['n_mentions'].describe()

count    2.171544e+06
mean     1.771238e+00
std      1.725833e+01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.035800e+04
Name: n_mentions, dtype: float64

By volume nearly all names only occur once.

So let's start with names that show up more than once.

In [28]:
tribune = set(df[(df['paper']==papers[0]) & (df['n_mentions'] > 1)]['person'])

In [31]:
defender = set(df[(df['paper']==papers[1]) & (df['n_mentions'] > 1)]['person'])

In [35]:
t_exclusive = tribune - defender

In [36]:
d_exclusive = defender - tribune

In [37]:
len(tribune), len(defender)

(270561, 91719)

In [38]:
len(t_exclusive), len(d_exclusive)

(249808, 70966)

In [39]:
len(d_exclusive)/len(defender)

0.7737328143568944

In [40]:
len(t_exclusive)/len(tribune)

0.9232964100517074

Let's write out some high frequency outputs

In [41]:
isct = tribune & defender

In [52]:
out = '/oak/stanford/groups/malgeehe/celebs/chicago_results'

In [51]:
df[df['person'].isin(isct)].to_csv(os.path.join(out,'people_in_both_papers.csv'))

In [53]:
df[df['person'].isin(t_exclusive)].to_csv(os.path.join(out,'tribune_exclusives.csv'))

In [54]:
df[df['person'].isin(d_exclusive)].to_csv(os.path.join(out,'defender_exclusives.csv'))

What proportion of total name mentions across both papers are determined by people who appear in both papers?

In [55]:
total_mentions = df['n_mentions'].sum()

In [56]:
total_mentions

4761077

In [57]:
isct_mentions = df[df['person'].isin(isct)]['n_mentions'].sum()

In [64]:
len(isct) / df.shape[0]

0.007514095491615119

In [58]:
isct_mentions / total_mentions

0.1445481768095748

In [67]:
isct_df = df[df['person'].isin(isct)]

In [68]:
isct_df['n_mentions'].describe()

count    41506.000000
mean        16.580856
std        108.923074
min          2.000000
25%          2.000000
50%          4.000000
75%         10.000000
max      10358.000000
Name: n_mentions, dtype: float64