<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Set-up-directories" data-toc-modified-id="Set-up-directories-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Set up directories</a></span></li></ul></div>

***This notebook reads in the csv containing all 19th c stats and soc science journals and creates an annual time series of their total number with dummy indicators by omitting duplicates, cleaning dates, among other actions.***

Import standard libraries

In [1]:
import glob 
import csv
import pandas as pd
import os

### Set up directories

Ensure CWD is the scripts folder of the rep directory

In [2]:
os.getcwd()

'/Volumes/GoogleDrive/My Drive/02_Stanford/00_Researching/16_SocialScientization/-03_HM/00_replication/01_scripts'

In [3]:
directory = os.path.dirname(os.getcwd()) + "/"
data = directory + "00_data/"
journals_folder = data + "04_journals/"

Parse each bibliographic data file 

In [7]:
record_split = "-"*200
files = glob.glob(journals_folder+"DirectExport*")
with open(journals_folder+"19c_journals.csv", 'w') as csvf: 
    writer = csv.writer(csvf)
    writer.writerow(['title', 'auth', 'lang', 'year', 'start', 'end', 'soc', 'stats'])
    for file in files:
        with open(file, 'r') as f:
            records = f.read()
            records = records.split(record_split)
            records.pop()
            record_n = -1
            for record in records: 
                record_n += 1
                line_n = -1
                lines = record.split('\n')
                title = ""
                year = ""
                auth = ""
                lang = ""
                for line in lines:
                    line_n += 1
                    if "Title:" in line:
                        title = line.split("Title:")[-1]
                        title = title.strip()
                        title = title.replace("         ", " ")
                        if ":" in title: 
                            title += lines[line_n + 1]
                    if "Year:" in line: 
                        year = line.split(":")[-1].strip()
                        if ("s" or "?") in year: 
                            start = -99
                            end = - 99
                        elif "-" in year: 
                            start = year.split("-")[0]
                            if len(year.split("-")[1]) == 4:
                                end = year.split("-")[1]
                            else: 
                                end = 1914
                        else: 
                            start = year
                            end = year
                    if "Corp Author(s):" in line:
                        auth = line.split(":")[-1].strip()
                    if "Language:" in line: 
                        lang = line.split(":")[-1].strip()
                    if "Descriptor:" in line: 
                        if "Social science" in line:
                            soc_sci = 1
                        elif "Statistics" in line:
                            stats = 1
                        else: 
                            soc_sci = 0
                            stats = 0
                        if soc_sci == 0 and stats == 0:
                            if "Social sciences" in lines[line_n + 1]:
                                soc_sci = 1
                            elif "Statistics" in lines[line_n + 1]:
                                stats = 1
                writer.writerow([title, auth, lang, year, start, end, soc_sci, stats])

Filter out publications with unidentified years and with duplicate entries

In [5]:
df = pd.read_csv(journals_folder+"19c_journals.csv")

In [6]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,auth,lang,year,start,end,soc,stats
0,0,ProQuest statistical abstract of the United St...,USA; Department of Commerce and Labor; Bureau ...,English,1879-2012,1879,2012,0,1
1,1,Statistical abstract of the United States.,United States.; Bureau of Foreign and Domestic...,English,1879-,1879,1914,0,1
2,2,Statistisches Jahrbuch ... /,Baden (Germany). Statistisches Landesamt.,German,1868-1938,1868,1938,0,1
3,3,Statistique des grèves.,France.; Direction du travail.,French,1893-?,1893,1914,0,1
4,4,Statistical return (South Australia. Colonial ...,South Australia.; Colonial Secretary's Office.,English,1847,1847,1847,0,1


In [6]:
df.shape

(5196, 9)

In [7]:
df_filtered = df[df['start'] != -99]

In [8]:
df_filtered.head()

Unnamed: 0.1,Unnamed: 0,title,auth,lang,year,start,end,soc,stats
0,0,ProQuest statistical abstract of the United St...,USA; Department of Commerce and Labor; Bureau ...,English,1879-2012,1879,2012,0,1
1,1,Statistical abstract of the United States.,United States.; Bureau of Foreign and Domestic...,English,1879-,1879,1914,0,1
2,2,Statistisches Jahrbuch ... /,Baden (Germany). Statistisches Landesamt.,German,1868-1938,1868,1938,0,1
3,3,Statistique des grèves.,France.; Direction du travail.,French,1893-?,1893,1914,0,1
4,4,Statistical return (South Australia. Colonial ...,South Australia.; Colonial Secretary's Office.,English,1847,1847,1847,0,1


In [9]:
df_filtered.shape

(5196, 9)

In [10]:
df_filtered.drop_duplicates(subset=['title', 'auth', 'start'], 
                            keep='first', 
                            inplace=True, 
                            ignore_index=True)

In [11]:
df_filtered.shape

(5196, 9)

In [28]:
df_filtered['year'].value_counts()

1912-?       41
1878-        38
1913-        36
1905-        33
1914-1948    30
             ..
1909-1925     1
1830          1
1891-1950     1
1869-1916     1
1898-1905     1
Name: year, Length: 1936, dtype: int64

In [12]:
df_filtered.to_csv(journals_folder+"19c_journals.csv")

Count how many stats and soc pubs per year

In [391]:
stats_journals = {}
soc_journals = {}
for year in range(1803, 1915):
    if year not in (stats_journals or soc_journals): 
        stats_journals[year] = 0
        soc_journals[year] = 0
    for index, journal in df_filtered.iterrows():
        if journal['soc'] == 1: 
            if year in range(journal['start'], journal['end']):
                soc_journals[year] += 1
        if journal['stats'] == 1:
            if year in range(journal['start'], journal['end']):
                stats_journals[year] += 1

Create a data frame with count dicts

In [409]:
years_df = pd.DataFrame.from_dict(stats_journals, 
                                  orient="index", 
                                  columns=["stats_journals"])

years_df['year'] = years_df.index
years_df.reset_index(drop=True, inplace=True)

In [410]:
years_df.head()

Unnamed: 0,stats_journals,year
0,14,1803
1,11,1804
2,11,1805
3,13,1806
4,13,1807


In [411]:
years_df['soc_journals'] = years_df['year'].map(soc_journals)

In [412]:
years_df

Unnamed: 0,stats_journals,year,soc_journals
0,14,1803,4
1,11,1804,4
2,11,1805,4
3,13,1806,5
4,13,1807,5
...,...,...,...
106,1955,1909,763
107,1990,1910,771
108,2059,1911,791
109,2136,1912,809


Export to stata

In [414]:
years_df.to_stata("19c_journals.dta", write_index=False)