Which names tend to co-occur within the same document?

In [None]:
import os
import pandas as pd
from multiprocessing import Pool
import time

In [2]:
DATA = '/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_names'

In [3]:
files = [os.path.join(DATA,x) for x in os.listdir(DATA) if x.endswith('.tsv')]

So, let's read in everything, and then start with all of the documents that have > 1 name.

In [5]:
def load_tsv(tsv):
    df = pd.read_csv(tsv, sep = '\t')
    df.columns = ['path', 'name']
    return df

In [None]:
start = time.time()
with Pool as p:
    L = p.map(load_tsv, files)
print(time.time()-start)

In [4]:
L = []
for i,f in enumerate(files):
    d = pd.read_csv(f, sep = '\t')
    d.columns = ['path','name']
    L.append(d)
    if i % 5 == 0:
        print('\r{}'.format(i), end = '')

300

In [None]:
df = pd.concat(L)

Find docs with more than one name

In [6]:
g = df.groupby('path').count()

In [7]:
docs = g[g['name'] > 1].index # docs with multiple people

Of these, which are in the relevant time period since we have too many?

In [8]:
meta = pd.read_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_1919-1939_meta.csv')

In [9]:
in_period = [os.path.split(x)[1] for x in meta['fullpath']]

In [10]:
candidates = [os.path.split(x)[1].split('.xml')[0] for x in docs]

In [11]:
# fix the chunks
candidates = [x.split('_chunk')[0] for x in candidates]

In [12]:
candidates[0]

'CD_20151209220246_00001_491877180.txt'

In [13]:
len(candidates), len(docs)

(1868361, 1868361)

In [14]:
# match
in_period[0], candidates[0]

('CT_20170929192812_00001_181362810.txt',
 'CD_20151209220246_00001_491877180.txt')

In [44]:
in_period_multiple_names = list(set(in_period) & set(candidates))

In [45]:
len(in_period_multiple_names)

440078

In [51]:
fn_lengths = [len(x) for x in in_period_multiple_names]

In [40]:
df.columns

Index(['path', 'name'], dtype='object')

Impasse: we're relying on string matching to determine which of these files to extract names from. String matching appears to be horrifyingly slow.

I absolutely *should* have had all of these matched using numeric document IDs. I don't. Not sure what the best way to proceed with the matching problem is. Sherlock?

Ok, so maybe 60% of the documents have more than one person

In [71]:
import re
word_pattern = re.compile(r'C[A-Z]{1}_[0-9]{14}_[0-9]{5}_[0-9]{9}.txt')

In [36]:
L = None # maybe this will free up enough memory?

In [85]:
data = list(df['path'])

In [87]:
len(data)

18217241

In [101]:
def extract_txt(path):
    return re.search(word_pattern, x).group() #tuple: (path, re.search(word_pattern, x).group())

In [None]:
start = time.time()
with Pool() as p:
    txt = p.map(extract_txt, data)
print(time.time()-start)

In [None]:
txt[:3]

In [None]:
df['txt'] = txt

In [None]:
df.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/name_pairs_table.csv')

Then, filter `df` for elements in the `txt` column that `isin` `in_period_multiple_names`

In [None]:
df.head()