Getting contexts in which top celeb mentions occur.

We'll start with the top 1000 people in the *Defender* and in *Tribune*, making it easy to change later as needed.

1. Get ranked top n celebs
2. Get paths to associated XML files
3. Pick x of those files randomly (5?) for samples per celeb
4. For each of those files, open it, parse the XML, find the sentence containing the name, and extract that
5. Create tabular output 

In [108]:
import os
from lxml import etree
import pandas as pd
from multiprocessing import Pool
from collections import Counter

In [7]:
NAMES = '/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_name_paper_year.csv'

In [8]:
df = pd.read_csv(NAMES)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,person,year,paper,n_mentions
0,2474291,N. Clark,1935,Chicago Daily Tribune,1015
1,17547,- Cago,1935,Chicago Daily Tribune,985
2,2474294,N. Clark,1937,Chicago Daily Tribune,939
3,2541407,N. Y.,1935,Chicago Daily Tribune,785
4,17551,- Cago,1937,Chicago Daily Tribune,777


In [10]:
df.shape

(3553244, 5)

Are the most frequent names by raw count all from the same(ish) years? It's tilted toward the 1930s.

Perhaps better to take the top N most-mentioned names in *each* year? Let's make sure that each year has enough names...

In [19]:
g = df.groupby(['year','paper']).count()

Ok, every year has well over 10k total name mentions across both papers

In [22]:
L = ['paper','year','n_mentions']

In [23]:
total_mentions = df[L].groupby(['paper','year']).sum()

In [25]:
total_mentions.reset_index(inplace = True)

Ok, and then a multiple-match thing here:

In [27]:
test = pd.merge(df, total_mentions, on = ['paper','year'])

In [30]:
test.columns

Index(['Unnamed: 0', 'person', 'year', 'paper', 'n_mentions_x',
       'n_mentions_y'],
      dtype='object')

In [34]:
test['%_mentions'] = (test['n_mentions_x'] / test['n_mentions_y']) * 100

In [35]:
test

Unnamed: 0.1,Unnamed: 0,person,year,paper,n_mentions_x,n_mentions_y,%_mentions
0,2474291,N. Clark,1935,Chicago Daily Tribune,1015,465432,0.218077
1,17547,- Cago,1935,Chicago Daily Tribune,985,465432,0.211631
2,2541407,N. Y.,1935,Chicago Daily Tribune,785,465432,0.168661
3,3348955,W. Madison,1935,Chicago Daily Tribune,680,465432,0.146101
4,2496786,N. J.,1935,Chicago Daily Tribune,621,465432,0.133424
...,...,...,...,...,...,...,...
3553239,1246437,George Florence,1926,The Chicago Defender,1,6425,0.015564
3553240,1250698,George Henry Lane,1926,The Chicago Defender,1,6425,0.015564
3553241,1250760,George Herald,1926,The Chicago Defender,1,6425,0.015564
3553242,1251742,George Hudson,1926,The Chicago Defender,1,6425,0.015564


And then top names per paper per year by `%_mentions`

In [36]:
L = ['person', 'paper', 'year', '%_mentions']

In [37]:
papers

array(['Chicago Daily Tribune ', 'The Chicago Defender '], dtype=object)

In [38]:
d = {}
for paper in papers:
    d[paper] = list(range(1919,1939))

In [42]:
L = []
N = 200
for k, v in d.items():
    data = test[test['paper'] == k]
    for x in v:
        temp = data[data['year'] == x]
        temp.sort_values('%_mentions', ascending = False, inplace = True)
        L.append(temp[:N].copy())

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [44]:
out = pd.concat(L)

In [53]:
out.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/200_people_per_paper_per_year.csv')

Here we'll get top names by paper as a proportion of all mentions

In [67]:
df.columns

Index(['Unnamed: 0', 'person', 'year', 'paper', 'n_mentions'], dtype='object')

In [68]:
L = ['paper','n_mentions']

In [69]:
total_mentions = df[L].groupby('paper').sum()

In [70]:
total_mentions.reset_index(inplace = True)

In [76]:
L = ['person','paper','n_mentions']

In [77]:
total_people_mentions = df[L].groupby(['person','paper']).sum()

In [81]:
total_people_mentions.reset_index(inplace = True)

Ok, and then a multiple-match thing here:

In [82]:
test = pd.merge(total_people_mentions, total_mentions, on = 'paper')

In [83]:
test.columns

Index(['person', 'paper', 'n_mentions_x', 'n_mentions_y'], dtype='object')

In [84]:
test['%_mentions'] = (test['n_mentions_x'] / test['n_mentions_y']) * 100

In [87]:
test.sort_values('%_mentions', ascending = False, inplace = True)

In [88]:
test.columns

Index(['person', 'paper', 'n_mentions_x', 'n_mentions_y', '%_mentions'], dtype='object')

In [89]:
papers

array(['Chicago Daily Tribune ', 'The Chicago Defender '], dtype=object)

In [92]:
top_tribune = list(test[test['paper'] == papers[0]]['person'][:500])

In [93]:
top_defender = list(test[test['paper'] == papers[1]]['person'][:500])

Ok, load up name-file relationships

In [56]:
DATA = '/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_names'

In [57]:
files = [os.path.join(DATA, x) for x in os.listdir(DATA) if x.endswith('.tsv')]

In [63]:
def get_data(tsv):
    df = pd.read_csv(tsv, sep = '\t')
    df.columns = ['path','name']
    return df

In [64]:
with Pool() as p:
    L = p.map(get_data, files)

In [65]:
name_file = pd.concat(L)

In [99]:
name_file.columns

Index(['path', 'name'], dtype='object')

In [100]:
t_check = name_file[name_file['name'].isin(top_tribune)]

In [101]:
d_check = name_file[name_file['name'].isin(top_defender)]

In [104]:
len(t_check['path'].unique())

630005

In [107]:
t_check.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/tribune_get_context.csv')

In [105]:
len(d_check['path'].unique())

362508

In [108]:
d_check.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/defender_get_context.csv')

Ok, now that we have files to check, we need to actually parse them.

# Parsing xml

In [3]:
d_check = pd.read_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/defender_get_context.csv')

In [4]:
t_check = pd.read_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/tribune_get_context.csv')

Let's choose 5 (?) random files per person to get the sample sentences.

In [5]:
d_check.columns

Index(['Unnamed: 0', 'path', 'name'], dtype='object')

In [9]:
g = d_check.groupby('name').count()

In [11]:
g.sort_values('path')

Unnamed: 0_level_0,Unnamed: 0,path
name,Unnamed: 1_level_1,Unnamed: 2_level_1
C. H. Thomas,167,167
Clayborne George,173,173
Billy Tucker,182,182
Mary F. Waring,185,185
Dewey R. Jones,185,185
...,...,...
St. Paul,11456,11456
Van Buren,16499,16499
N. J.,17181,17181
N. Y.,25162,25162


Ok, smallest is 167, highest is 26380 in the smaller paper. 5-10 examples is no problem.

In [185]:
def sample_sentence(row):
    
    with open(row[1]['path']) as f: # using this annotation because of itterrows...
        tree = etree.parse(f)
        target_name = [x.strip() for x in row[1]['name'].split() if x] # if x to drop blanks
        # find related tokens
        people_tokens = [x.getparent() for x in tree.xpath("//NER") if x.text == 'PERSON']    
        matches = [x for x in people_tokens if x[0].text.title() in target_name]
        # find out which sentences those tokens belong to
        sents = Counter([x.getparent().getparent() for x in matches])

        # find out which sentences have full names in them
        full_names = []
        for k,v in sents.items():
            if v == len(target_name):
                full_names.append(k)

        if len(full_names) == 1:
            # get the sentence
            sample = full_names.pop()
        elif len(full_names) > 1:
            sample = random.choice(full_names)
        else:
            sample = random.choice([x.getparent().getparent() for x in matches])

        # get sentence words
        sentence = ' '.join([x.text for x in sample.iter() if x.tag == 'word'])

        d = {}
        d['name'] = row[1]['name']
        d['path'] = row[1]['path']
        d['example'] = sentence
        
    return d

Subset the dataframe to 5 rows per person.

In [208]:
t_people = t_check['name'].unique()

In [209]:
# for each of those people, choose five random rows, and get sample sentences:
L = []

for i, person in enumerate(t_people):
    df = t_check[t_check['name'] == person]
    df = df.sample(10)
    
    for row in df.iterrows():
        L.append(sample_sentence(row))
    
    if i % 10 == 0:
        print('\r{}'.format(i), end = '')

490

In [210]:
d = pd.DataFrame(L)

In [211]:
d.head()

Unnamed: 0,name,path,example
0,N. Clark-St,/scratch/groups/malgeehe/celebs/chicago_corenl...,"Funeral servIce from chape : , 2701 N. Clark-st ."
1,N. Clark-St,/scratch/groups/malgeehe/celebs/chicago_corenl...,W. WAiK1t : 4010 N. Clark-st .
2,N. Clark-St,/scratch/groups/malgeehe/celebs/chicago_corenl...,7720-56 lla tol tenan 6712 N. Clark-st .
3,N. Clark-St,/scratch/groups/malgeehe/celebs/chicago_corenl...,6t005 N. Clark-st .
4,N. Clark-St,/scratch/groups/malgeehe/celebs/chicago_corenl...,136 N. Clark-st .


In [212]:
d.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/tribune_sentence_samples.csv')