For the top `n` people in both papers, get data about the distribution of article types in which they appear.

- Get lists of top n people per paper
- Merge and dedupe those
- Use resulting list to filter the annualized data
- Merge all of the relevant rows into a new data frame
- Groupby name (and year?) and count article type

In [1]:
import os
import time
from multiprocessing import Pool
import pandas as pd

In [2]:
NAMES = '/oak/stanford/groups/malgeehe/celebs/chicago_results/names_annual'

In [3]:
TOP = '/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_name_paper.csv'

In [4]:
META = '/oak/stanford/groups/malgeehe/celebs/chicago_results/chicago_1919-1939_meta.csv'

In [5]:
names = pd.read_csv(TOP)

In [6]:
names.head()

Unnamed: 0,person,paper,n_mentions
0,N. Clark,Chicago Daily Tribune,10358
1,- Cago,Chicago Daily Tribune,8054
2,N. Y.,Chicago Daily Tribune,6072
3,Van Buren,Chicago Daily Tribune,5497
4,W. Madison,Chicago Daily Tribune,5250


In [7]:
names.shape[0]

2761876

Get top n people per paper:

In [8]:
papers = names['paper'].unique()

In [9]:
N = 1000

In [10]:
L = []
for paper in papers:
    sub = names[names['paper'] == paper].sort_values('n_mentions', ascending = False)
    L.append(list(sub['person'][:N]))

In [11]:
len(L[0]), len(L[1])

(1000, 1000)

In [12]:
combined = set(L[0] + L[1])

In [13]:
len(combined)

1908

In [14]:
combined = list(combined)

Filtering the annualized data

In [15]:
names_files = [os.path.join(NAMES,x) for x in os.listdir(NAMES) if '._' not in x]

In [19]:
names_files = sorted(names_files)

In [21]:
def get_data(path):
    start = time.time()
    data = pd.read_csv(path)
    sub = data[data['person'].isin(combined)]
    total = round(time.time() - start)
    print('completed {} in {} seconds'.format(os.path.split(path)[1], total))
    return sub[['person','paper','year','doc_type', 'n_words']]

In [22]:
data = None

with Pool() as p:
    data = p.map(get_data, names_files)

completed 1921_names.csv in 40 seconds
completed 1926_names.csv in 46 seconds
completed 1922_names.csv in 53 seconds
completed 1927_names.csv in 62 seconds
completed 1925_names.csv in 64 seconds
completed 1920_names.csv in 66 seconds
completed 1928_names.csv in 72 seconds
completed 1919_names.csv in 73 seconds
completed 1924_names.csv in 75 seconds
completed 1929_names.csv in 80 seconds
completed 1939_names.csv in 82 seconds
completed 1923_names.csv in 83 seconds
completed 1930_names.csv in 85 seconds
completed 1936_names.csv in 89 seconds
completed 1938_names.csv in 102 seconds
completed 1934_names.csv in 105 seconds
completed 1931_names.csv in 109 seconds
completed 1933_names.csv in 111 seconds
completed 1932_names.csv in 112 seconds
completed 1937_names.csv in 115 seconds
completed 1935_names.csv in 121 seconds


In [23]:
df = pd.concat(data)

In [25]:
g = df.groupby(['person','paper','year','doc_type']).sum() # count number of words

In [27]:
g.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/person_paper_year_doctype_ntoks.csv')

In [28]:
g = df.groupby(['person','paper','year','doc_type']).count() # count number of articles of type

In [30]:
g.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/person_paper_year_doctype_ndocs.csv')

Do aggregations, not annualized

In [31]:
g = df.groupby(['person','paper','doc_type']).sum() # count number of words

In [34]:
# drop irrelevant columns
g['n_words'].to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/person_paper_doctype_ntoks.csv')

In [35]:
g = df.groupby(['person','paper','doc_type']).count() # count number of words

In [37]:
g = pd.DataFrame(g['n_words'])

In [40]:
g.columns = ['n_docs']

In [42]:
g.to_csv('/oak/stanford/groups/malgeehe/celebs/chicago_results/person_paper_doctype_ndocs.csv')