# Cleaning References pre-1970s
The purpose of this notebook is to clean results obtained from mturk for journal articles that are from, on average, before 1970. There are two data sources which has resulted in slight differences in the raw data input structure.

1. AWS MTURK - a service offered by AWS, output returns as a csv file
2. fMTURK - a clone of AWS MTURK specific to scholarly publishing where the output returns as a json file

Note that naming conventions for variables vary even though they are both structured data sets so combining them will require some trivial manipulation.

Expected output: 
1. json files of reference matches 
2. json and csv files of references collected via manual interfaces
3. csv file of all input data
4. reconciliation of all input files vs output files to see which pages and files have been digitized

## Initial setup

In [1]:
# libraries required, please install pandas
import pandas as pd
from unidecode import unidecode
import re
from datetime import date
import json
import numpy as np
from os import listdir
from os.path import isfile, join
# set column options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

In [2]:
# change base path to point to the results of the mturk data
# the expectation is that this was directly downloaded from the respective results interface
mturk_files_out="/Users/sijiawu/Downloads/thesis_docs/mturk_process/output_files/"
mturk_files_in="/Users/sijiawu/Downloads/thesis_docs/mturk_process/input_files/"
fmturk_files_out="/Users/sijiawu/Downloads/thesis_docs/fmturk/"

In [3]:
# load in journal metadata
JOURNALS= ['AER', 'JPE', 'ECTA', 'RES', 'QJE']
#read in all processed masterlists
j_data=pd.DataFrame()
for i in JOURNALS:
    j_data=pd.concat([pd.read_excel('/Users/sijiawu/Work/Thesis/Data/Combined/'+i+'_M_sco_du.xlsx'), j_data], ignore_index=True)
#Create a batch file

j_data=j_data[j_data.duplicated()==False].reset_index().drop('index', axis=1)

# Replace the journal names with Acronyms
j_data.loc[j_data['journal']=="Econometrica",'journal']='econometrica'
j_data.loc[j_data['journal']=='The Quarterly Journal of Economics','journal']='quarterly journal of economics'
j_data.loc[j_data['journal']=='The Review of Economic Studies','journal']='review of economic studies'
j_data.loc[j_data['journal']=='Journal of Political Economy','journal']='journal of political economy'
j_data.loc[j_data['journal']=='The American Economic Review','journal']='american economic review'

#some corrections to the issue
j_data.loc[j_data["number"]=="2023-03-04 00:00:00","number"]="3--4"
j_data.loc[j_data["number"]=="4-5","number"]="4--5"
j_data.loc[j_data["number"]=="1-2","number"]="1--2"

j_data.journal.unique()

j_data["id"]=j_data["URL"].str.split("/").str[-1]


  j_data=pd.concat([pd.read_excel('/Users/sijiawu/Work/Thesis/Data/Combined/'+i+'_M_sco_du.xlsx'), j_data], ignore_index=True)


In [4]:
# load in mturk based files
ref_mturk = [f for f in listdir(mturk_files_out) if isfile(join(mturk_files_out, f))]
input_fs= [f for f in listdir(mturk_files_in) if isfile(join(mturk_files_in, f))]

In [6]:
# file for recoding the column names on mturk retrieved data. I had two batches that had differing naming conventions. This ensures consistency.
with open('./031_recon/recoding.json') as json_data:
    dict = json.load(json_data)
    json_data.close()


In [7]:
# load fmturk data which is in json format.
with open('./031_recon/response_1725501854297.json', 'r') as file:
    ref_fmturk = json.load(file)
    file.close()

### Helper Functions

In [8]:
#remove leading and trailing non-ascii characters
def strip_leading(_str):
    k=0
    l=len(_str)
    while k!=len(_str):
        if re.search('[,*" \'.:]',_str[k]) is not None:
            k=k+1
        else:
            break
    while l>0:
        if re.search('[,*" \'.:]',_str[l-1]) is not None:
            l=l-1
        else:
            break
    return _str[k:l]

#reformat some spelling
def replace_d_space(_str):
    temp=re.sub("[ ]+"," ",_str)
    temp=re.sub("ze ","se ",temp)
    temp=re.sub("zation","sation",temp)
    temp=re.sub("- | -","-", temp)
    temp=re.sub("- | -","-", temp)
    return temp

# this function:
# - removes all leading and trailing non-alphabet characters, in case there is some sort of punctuation copied
# - remove + signs
# - strips leading "the" if the second field is set to True
def strip_and_convert(str_, strip_the):
    #print(str_)
    if pd.isna(str_)==True:
        return "None"
    temp=unidecode(str_)
    try:
        l = [x.isalpha() for x in temp].index(True)
        m = [x.isalpha() for x in temp[::-1]].index(True)
        temp=temp[l:len(str_)-m]
    except:
        print(temp)
        temp="none"
    #temp=re.sub('^"(.*)"$','(.*)', temp)
    if (temp[0:4].lower()=="the ")&(strip_the==True):
        temp=temp[4:]
    temp=re.sub(' +', ' ', temp)
    temp=re.sub(' ,', ', ', temp)
    temp=temp.strip()
    return temp

roman_numerals = {"I" : 1,
                  "V" : 5,
                  "X" : 10,
                  "L" : 50,
                  "C" : 100,
                  "D" : 500,
                  "M" : 1000
                  }

def rom_to_dec(user_input):
    int_value = 0
    for i in range(len(user_input)):
        if user_input[i] in roman_numerals:
            if i + 1 < len(user_input) and roman_numerals[user_input[i]] < roman_numerals[user_input[i + 1]]:
                int_value -= roman_numerals[user_input[i]]
            else:
                int_value += roman_numerals[user_input[i]]
        else:
            print("Invalid input.")
            return "none"

    return int_value

def rom_match(strg, search=re.compile(r'[^IVXLCDM]').search):
     return not bool(search(strg))
    
def number_match(strg, search=re.compile(r'[^0-9.]').search):
     return not bool(search(strg))


## Formatting fMturk data

Each page can have 0 to many references, this process flattens the json data into a pandas dataframe such that there is one reference per row. I also take the opportunity to add in any additional information

In [9]:
ref_fmturk[0]

{'id': '665a9036e7af08e1933ac1f8',
 'batchId': 'qje_remnant_b523',
 'tasknum': 0,
 'pdf_url': 'https://myawsbucket-1231.s3.eu-west-3.amazonaws.com/QJE_shards/1880567_wo_cover_page-8.pdf',
 'issue': 'unavailable',
 'volume': 'unavailable',
 'journal': 'quarterly journal of economics',
 'year': 1972,
 'author_name': 'unavailable',
 'answer': '[{"id":1,"jstor":"","type":2,"author":"D. Jorgenson and J. Stephenson","title":"The Time Structure of Investment  Behavior in United States Manufacturing, 1947-1960","journal":"Review of Economics and Statistics","volume":"XLIX","issue":"Feb","year":"1967","pages":"none"},{"id":2,"jstor":"1813210","type":8},{"id":3,"jstor":"","type":2,"author":"none","title":"Investment Behavior and Neo-classical Theory","journal":"Review of Economics and Statistics","volume":"L","issue":"Aug","year":"1968","pages":"none"},{"id":4,"jstor":"","type":2,"author":"A Alchian","title":"Information costs, pricing, and resource unemployment","journal":"western economic jour

In [10]:
answers={}
issues_tasks=[]
a=0
for i in range(len(ref_fmturk)):
    temp={
        'tasknum':ref_fmturk[i]['id'],
        'pdf_url':ref_fmturk[i]['pdf_url'],
        'id_o': ref_fmturk[i]['pdf_url'].split("/")[-1].split('_')[0],
        'page_o':ref_fmturk[i]['pdf_url'].split("/")[-1].split('_')[3].split('-')[-1].split('.')[0],
        'issue_o':None, 
        'title_o':None,
        'volume_o':None,
        'authors_o':None,
        'journal_o':ref_fmturk[i]['journal'], 
        'year_o':ref_fmturk[i]['year'],
        'completer': ref_fmturk[i]['completer']
        }
    try:
        answer=json.loads(ref_fmturk[i]['answer'])
    except:
        issues_tasks.append(ref_fmturk[i]['id'])
        continue
    # print(i)
    for j in range(len(answer)):
        a=a+1
        # print(j)
        answers[a]=answer[j]|temp

rem_refs=pd.DataFrame(answers).transpose()
rem_refs=rem_refs.fillna("none")
rem_refs['journal']=rem_refs['journal'].str.strip().str.lower()
rem_refs=rem_refs.replace(to_replace='  ', value=' ', regex=True)

### Reconciliation of Journal names

In [11]:
with open('./031_recon/fmturk_recon.json', 'r', encoding='utf-8') as file:
    recon=json.load(file)
    file.close()

recon_split={}

for i in recon.keys():
    for j in recon[i]:
        recon_split[j]=i

for i in rem_refs.index:
    if rem_refs.loc[i,'journal'] in recon_split.keys():
        rem_refs.loc[i,'journal_proc'] =recon_split[rem_refs.loc[i,'journal']]

# order=list(rem_refs['journal'].unique())
# order.sort

# for i in order:
#     print('"'+i+'",')

### Reconciliation of years

In [12]:
with open('./031_recon/fmturk_year_recon.json', 'r', encoding='utf-8') as file:
    recon=json.load(file)
    file.close()

recon_split={}

for i in recon.keys():
    for j in recon[i]:
        recon_split[j]=i
    
year_pot=[]
nones=['none','n.d.','n. d.']
for i in rem_refs.index:
    try:
        if rem_refs.loc[i,'year']in nones:
            rem_refs.loc[i,'year_proc'] =0
        elif rem_refs.loc[i,'year']=='forthcoming':
            rem_refs.loc[i,'year_proc'] =3
        else:
            temp=int(float(rem_refs.loc[i,'year'].strip()))
            rem_refs.loc[i,'year_proc'] =temp
    except:
        rem_refs.loc[i,'year_proc'] =recon_split[rem_refs.loc[i,'year'].strip()]
        # print('"'+rem_refs.loc[i,'year']+'",')

In [None]:
def year_split(x):
    if x[0]=="none":
        return 0
    if x[0]=="wrong":
        return 1
    if x[0]=="forthcoming":
        return 3
    if x[0]=="uncertain":
        return 4
    if x[0]=="various":
        return 4
    if "," in x[0]:
        return int(x[0].split(',')[-1])
    else:
        if len(x)==1:
            return int(float(x[0]))
        if len(x)==2:
            return int(float(x[1]))
    
rem_refs["year_proc_split"]=rem_refs["year_proc"].astype(str)
rem_refs["year_proc_split"]=rem_refs["year_proc_split"].str.split('-')
rem_refs["year_latest"]=rem_refs["year_proc_split"].apply(lambda x: year_split(x))
rem_refs["year_latest"].unique()

### Reconciliation of issue

In [13]:
rem_refs["issue_proc"]=rem_refs['issue'].str.lower().str.strip()

#read in the title recon file
issue_rec_f=None
with open("./031_recon/fmturk_issue_recon.json", 'r') as f:
    issue_rec_f=json.load(f) 

#reformat and expand the title recon file
issue_rec_f_split_out={}
for key in issue_rec_f.keys():
    for k in issue_rec_f[key]:
        issue_rec_f_split_out[k]=key  

issue_corr=[]
w_count=0
for i in rem_refs.index:
    val=rem_refs.loc[i,'issue_proc']
    ind=0
    if val in issue_rec_f_split_out.keys():
        val=issue_rec_f_split_out[val]
    if val in issue_rec_f_split_out.keys():
        rem_refs.loc[i,'issue_proc']=val
    #         print(val)
    else:
        w_count=w_count+1
        issue_corr.append(val)


### Reconciliation of Volume

In [14]:
rem_refs["volume_proc"]=rem_refs['volume'].str.upper().str.strip()

#read in the title recon file
volume_rec_f=None
with open("./031_recon/fmturk_volume_recon.json", 'r') as f:
    volume_rec_f=json.load(f) 

#reformat and expand the title recon file
volume_rec_f_split_out={}
for key in volume_rec_f.keys():
    for k in volume_rec_f[key]:
        volume_rec_f_split_out[k]=key


volume_corr=[]
w_count=0
for i in rem_refs.index:
    val=rem_refs.loc[i,'volume_proc']
    ind=0
    if val in volume_rec_f_split_out.keys():
        val=volume_rec_f_split_out[val]
    if val in volume_rec_f_split_out.keys():
        rem_refs.loc[i,'volume_proc']=val
    #         print(val)
    else:
        w_count=w_count+1
        volume_corr.append(val)
# vol_rec_f=list(set(rem_refs["volume_proc"]))
# vol_rec_f.sort()
# for i in vol_rec_f:
#     print('"'+i+'":["'+i+'"],')

### Title recon

In [15]:
#pre-process the title with previous two functions
rem_refs["title_proc"]=rem_refs["title"].fillna("none").astype(str).str.lower()
rem_refs["title_proc"]=rem_refs["title_proc"].apply(strip_leading,1)
rem_refs["title_proc"]=rem_refs["title_proc"].apply(replace_d_space,1)
title_sort=list(rem_refs.title_proc.unique())
title_sort.sort()

j_data["title_proc"]=j_data["title"].fillna("none").astype(str).str.lower()
j_data["title_proc"]=j_data["title_proc"].apply(strip_leading,1)

len(title_sort) #total titles


2950

### Some statistics and data sample

In [16]:
rem_refs['type'].value_counts()

8    2547
3    1710
2    1328
4     730
1     139
7     137
6     115
Name: type, dtype: int64

In [17]:
j_list_app=["american economic review","econometrica", "journal of political economy", "quarterly journal of economics", "review of economic studies"]

In [18]:
rem_refs[rem_refs['journal_proc'].isin(j_list_app)].shape

(172, 30)

In [19]:
len(issues_tasks) #list of tasks with issues

31

In [20]:
rem_refs.columns

Index(['id', 'jstor', 'type', 'author', 'title', 'journal', 'volume', 'issue',
       'year', 'pages', 'tasknum', 'pdf_url', 'id_o', 'page_o', 'issue_o',
       'title_o', 'volume_o', 'authors_o', 'journal_o', 'year_o', 'completer',
       'publisher', 'location', 'chapter_title', 'text_full', 'journal_proc',
       'year_proc', 'issue_proc', 'volume_proc', 'title_proc'],
      dtype='object')

In [21]:
rem_refs.head(4)

Unnamed: 0,id,jstor,type,author,title,journal,volume,issue,year,pages,...,completer,publisher,location,chapter_title,text_full,journal_proc,year_proc,issue_proc,volume_proc,title_proc
1,1,,2,D. Jorgenson and J. Stephenson,"The Time Structure of Investment Behavior in United States Manufacturing, 1947-1960",review of economics and statistics,XLIX,Feb,1967,none,...,665123709458209e9841956e,none,none,none,none,review of economics and statistics,1967.0,feb,XLIX,"the time structure of investment behavior in united states manufacturing, 1947-1960"
2,2,1813210.0,8,none,none,none,none,none,none,none,...,665123709458209e9841956e,none,none,none,none,,0.0,none,NONE,none
3,3,,2,none,Investment Behavior and Neo-classical Theory,review of economics and statistics,L,Aug,1968,none,...,665123709458209e9841956e,none,none,none,none,review of economics and statistics,1968.0,aug,L,investment behavior and neo-classical theory
4,4,,2,A Alchian,"Information costs, pricing, and resource unemployment",western economic journal,VII,June,1969,none,...,665123709458209e9841956e,none,none,none,none,,1969.0,june,VII,"information costs, pricing, and resource unemployment"


## Format and load Mturk data

In [22]:
#read in each mturk file and rename the columns
All=[]
for i in ref_mturk:
    All.append(pd.read_csv(mturk_files_out+i).rename(columns=dict))
All=pd.concat(All, ignore_index=True)
All.sort_index(axis=1, inplace=True)
All.shape

(23517, 582)

In [23]:
#check for duplicates by checking that dataframe size is the same after dropping duplicates
check=All.drop_duplicates()
check.shape

(23517, 582)

In [24]:
sum(check["RejectionTime"].isna()==False)

122

In [25]:
# check remainder
Allin=[]
p_url=list(All[All["RejectionTime"].isna()!=False]["Input.pdf_url"]) # this does not include 

for i in input_fs:
    Allin.append(pd.read_csv(mturk_files_in+i))

Allin=pd.concat(Allin, ignore_index=True)
Allin=Allin.drop_duplicates()

# prep remainder for processing
tmp=Allin[Allin["pdf_url"].isin(p_url)==False].reset_index(drop=True)
tmp["journal"].value_counts()


AER                           6974
QJE                           1210
JPE                             76
ECONOMETRICA                    75
Review of Economic Studies      32
Name: journal, dtype: int64

## Pre-processing
In this section, the mturk data is restructured such that each line has a single entered reference. Some unnecessary fields from the mturk data are also dropped. We keep the HITId to keep a foreign key linking back to the raw data.

In [26]:
# list all the column names
# list(All.columns)

In [26]:
# list of columns to drop
lst=['HITTypeId',
 'Title',
 'Description',
 'Keywords',
 'Reward',
 'CreationTime',
 'MaxAssignments',
 'RequesterAnnotation',
 'AssignmentDurationInSeconds',
 'AutoApprovalDelayInSeconds',
 'Expiration',
 'NumberOfSimilarHITs',
 'LifetimeInSeconds',
 'AssignmentId',
 'AssignmentStatus',
 'AcceptTime',
 'AutoApprovalTime',
 'ApprovalTime',
 'RejectionTime',
 'LifetimeApprovalRate',
 'Last30DaysApprovalRate',
 'Last7DaysApprovalRate',
 'SubmitTime']

All=All.drop(lst, axis=1)

# add for number of referencing counts
All["num_refs"] = None 

### Functions used to clean individual reference entries

In [27]:



# the following three functions expect to recieve a row of the data from mturk and the number of the reference on the page.
# after which it will return that reference as a json dictionary.
# Please see sample entry from from Mturk to see the fields expected for each type of data entry.
def process_article(x, num):
    article_dict={
        "type": 2,
        "author": x["Answer.ref."+str(num)+"_author"],
        "title": x["Answer.ref."+str(num)+"_title"],
        "journal": strip_and_convert(x["Answer.ref."+str(num)+"_journal"], True),
        "year": x["Answer.ref."+str(num)+"_year"],
        "volume": x["Answer.ref."+str(num)+"_vol"],
        "issue": x["Answer.ref."+str(num)+"_issue"],
        "pages": x["Answer.ref."+str(num)+"_pages"],  
    }
    
    return article_dict

def process_book(x, num):
    book_dict={
        "type": 3,
        "author": x["Answer.ref."+str(num)+"_author"],
        "title": x["Answer.ref."+str(num)+"_title"],
        "chapter_title": x["Answer.ref."+str(num)+"_chapter_title"],
        "year": x["Answer.ref."+str(num)+"_year"],
        "volume": x["Answer.ref."+str(num)+"_vol"],
        "location": x["Answer.ref."+str(num)+"_location"],
        "publisher": x["Answer.ref."+str(num)+"_publisher"],
        "pages": x["Answer.ref."+str(num)+"_pages"],  
    }
    return book_dict

def process_other(x, num):
    other_dict={
        "type": 4,
        "author": x["Answer.ref."+str(num)+"_author"],
        "title": x["Answer.ref."+str(num)+"_title"],
        "year": x["Answer.ref."+str(num)+"_year"],
        "publisher": x["Answer.ref."+str(num)+"_publisher"],
        "text_full": x["Answer.ref."+str(num)+"_textfull"],  
    }
    return other_dict

def process_news(x, num):
    news_dict={
        "type": 6,
        "year": x["Answer.ref."+str(num)+"_year"],
        "publisher": x["Answer.ref."+str(num)+"_publisher"],
        "text_full": x["Answer.ref."+str(num)+"_textfull"],  
    }
    return news_dict

def process_laws(x, num):
    law_dict={
        "type": 7,
        "year": x["Answer.ref."+str(num)+"_year"],
        "text_full": x["Answer.ref."+str(num)+"_textfull"],  
    }
    return law_dict

def process_jstor(x, num):
    jstor_dict={
        "type": 8,
        "year": x["Answer.ref."+str(num)+"_year"],
        "jstor": x["Answer.jstor_"+str(num)],  
    }
    return jstor_dict


# function to merge dictionaries
def Merge(dict1, dict2):
    return(dict2|dict1)

In [28]:
a=0
all_ref={}

# Clean out each reference
for j in All[All["RequesterFeedback"].isna()!=False].index:
    count=0
    # print(a)
    for i in range(0,30,1):
        item=None
        temp={
            "id": i+1,
            "tasknum":All.iloc[j]["HITId"],
            "id_o": All.iloc[j]["Input.ID"],
            "page_o": All.iloc[j]["Input.page"],
            "year_o": All.iloc[j]["Input.year"],
            "journal_o": All.iloc[j]["Input.journal"],
            "authors_o": All.iloc[j]["Input.authors"],
            "title_o": All.iloc[j]["Input.title"],
            "volume_o": All.iloc[j]["Input.vol"],
            "issue_o": All.iloc[j]["Input.issue"],
            'completer': All.iloc[j]['WorkerId'],
            'pdf_url':All.iloc[j]['Input.pdf_url']
        }
        
        if pd.isna(All.iloc[j]["Answer."+str(i)+".1"]):
            continue
        elif All.iloc[j]["Answer."+str(i)+".1"]==True:
            item={'type':1}
        elif All.iloc[j]["Answer."+str(i)+".2"]==True:
            count=count+1
            item=process_article(All.iloc[j],i)
        elif All.iloc[j]["Answer."+str(i)+".3"]==True:
            count=count+1
            item=process_book(All.iloc[j],i)
        elif All.iloc[j]["Answer."+str(i)+".4"]==True:
            count=count+1
            item=process_other(All.iloc[j],i)
        elif All.iloc[j]["Answer."+str(i)+".5"]==True:
            item={'type':5}
        elif All.iloc[j]["Answer."+str(i)+".6"]==True:
            item=process_news(All.iloc[j],i)
            count=count+1
        elif All.iloc[j]["Answer."+str(i)+".7"]==True:
            item=process_laws(All.iloc[j],i)
            count=count+1
        elif All.iloc[j]["Answer."+str(i)+".8"]==True:
            item=process_jstor(All.iloc[j],i)
            count=count+1
            
        if item != None:
            a=a+1
            all_ref[a]=temp|item

#NB change this into switch statement, horrible if ladder is good enough for now

72
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


## Reconciliation of Metadata Fields

This is to increase the percentage match on a metadata dield in the masterlist. The specific fields going through reconciliation are:
- journal name
- pages
- volume
- issue
- year
- title

The output from above is a dictionary of the separated out responses in json format.

In [29]:
# transpose the data into a dataframe
ar=pd.DataFrame.from_dict(all_ref).transpose()


In [30]:
ar["type"].value_counts()

3    13723
2    12482
4    11788
1     6566
8     1016
7      856
6      320
5       23
Name: type, dtype: int64

## Reconciliating Journal names

There are many misspellings of journal names, this section is correct them. Process:
- Strip leading and trailing white spaces
- Ensure "the" has been removed from the beginning of the journal, this was done during the previous step.
- Lowercase the list, get a unique list and sort alphabetically
- Go through the list and copy duplicates into an array with the key as the corrected spelling: eg "AER":["AERS","AERB" ...],
- Format each misspelling into a dictionary in the form {misspelling: correction, ....,misspelling: correction} in preparation of replacing it in the dataset.
- Create a separate column in the data that is a copy of the journal column, journal_proc, caste it to string type and replace the errors in this column.

The result is a json file where each key is a journal name and an array of errors. And a json file where each error name maps to the corrected journal name including journal names that are not in error.

In [31]:
unique_journals=list(ar[ar["type"]==2]["journal"].str.strip().str.lower().unique())
unique_journals.sort()
len(unique_journals)

1886

In [32]:
# load in the journal names from file
sort_info=None
with open("./031_recon/mturk_journal_name_recon.json", 'r') as f:
    sort_info = json.load(f)
    
# def convert(o):
#     if isinstance(o, np.int64): return int(o)  
#     raise TypeError
    
# sorted_list = sorted(sort_info.items())

# sorted_dict = {}
# for key, value in sorted_list:
#     sorted_dict[key] = value

# print(sorted_dict)
# with open("journal_name_recon.json", 'w') as f:
#     json.dump(sorted_dict, f, indent = 6,default=convert) 

### Format each journal name error

In [34]:
split_out={}

for key in sort_info.keys():
    for i in sort_info[key]:
        split_out[i]=key

for i in range(len(unique_journals)):
    if unique_journals[i] in split_out.keys():
        unique_journals[i]=split_out[unique_journals[i]]
        
for i in unique_journals:
    if i not in split_out.keys():
        split_out[i]=i

### Replace the journal names for the ones we care about

In [35]:
ar["journal_proc"]=ar['journal'].fillna("none").astype(str).str.lower().str.strip()
# joi=['american economic review', 'journal of political economy','econometrica','quarterly journal of economics', 'review of economic studies']
for i in ar[ar["journal_proc"].isna()==False].index:
    if ar.loc[i,'journal_proc'] in split_out.keys():
        ar.loc[i,'journal_proc']=split_out[ar.loc[i,'journal_proc']]

## Reconcile the years
1. fillna as "none" and convert the year column to type string, store it in a new column called year_proc
2. some entered a month or season followed by a space and then the year, split by space or comma and take the last year value entered.
3. for each value, try caste the string year to an int, either directly or first converted to a float
4. append all other cases to a list called year_corr, get a unique set and reconcile manually via a list called year_rec
5. Format of errors is {correct_year: [year_error1, year_error2 ...], ...}
6. Save it to a file called year_recon.json. this is the file that is being read in in the next code block.

In [36]:
year_rec=None
with open("./031_recon/mturk_year_recon.json", 'r') as f:
    year_rec = json.load(f)

In [37]:
year_rec_split_out={}
for key in year_rec.keys():
    for i in year_rec[key]:
        year_rec_split_out[i]=key
# format the year_rec dictionary into one such that it is "error":"correction" form for each error.

In [39]:
ar["year_proc"]=ar['year'].fillna("none").astype(str)
ar["year_proc"]=ar["year_proc"].str.strip().str.split(',| ').str[-1]

year_corr=[] #this contains anything that isn't resolved in year_rec

for i in ar.index:
    proc_year="none"
    year_temp=re.sub("I", "1", ar.loc[i,"year_proc"]) # sub for I issue
    try:
        proc_year=int(float(year_temp))  #convert from float to int
        if (proc_year==0):
            ar.loc[i, 'year_proc']='none'
        elif (proc_year<1000)|(proc_year>2000):
            # print(proc_year)
            ar.loc[i, 'year_proc']='wrong'
            continue
        ar.loc[i, 'year_proc']=proc_year
    except:
        pst=0
        if (year_temp in year_rec_split_out.keys()):
            ar.loc[i, 'year_proc']=year_rec_split_out[year_temp]
            pst=1
        elif ("-" in year_temp) or ("/" in year_temp):
            split_year=re.sub("/", "-", year_temp).split('-')
            if (len(split_year[0])==4) and (len(split_year[1])==4):
                ar.loc[i, 'year_proc']=re.sub("/", "-", year_temp)
                pst=1
        if (ar.loc[i, "year_proc"]!="none") & (pst==0):
            year_corr.append(year_temp) #append to list if could not directly convert


year_corr_u=list(set(year_corr)) #get unique list
year_corr_u.sort()

In [39]:
# for i in year_corr_u: #because all years have been resolved the year_corr is empty
#     print('"'+i+'",')

Some year fields are multiple years, eg:1954-1955 in which case I take the latest year to occur. in the masterlist data does not have any entries with multiple years.

In [40]:
def year_split(x):
    if x[0]=="none":
        return 0
    if x[0]=="wrong":
        return 1
    if x[0]=="forthcoming":
        return 3
    if x[0]=="uncertain":
        return 4
    if x[0]=="various":
        return 4
    if len(x)==1:
        return int(x[0])
    if len(x)==2:
        return int(x[1])
    else:
        return 0
    
ar["year_proc_split"]=ar["year_proc"].astype(str)
ar["year_proc_split"]=ar["year_proc_split"].str.split('-')
ar["year_latest"]=ar["year_proc_split"].apply(lambda x: year_split(x))
ar["year_latest"].unique()

array([1969, 1964, 1968, 1945, 1951, 1953, 1957, 1965, 1967, 1966, 1926,
       1962, 1959, 1958, 1925, 1954, 1948, 1955,    0, 1946, 1942, 1949,
       1956, 1924, 1886, 1937, 1950, 1938, 1947, 1936, 1934, 1944, 1913,
       1935, 1941, 1961, 1940, 1960, 1927, 1928, 1930, 1933, 1952, 1943,
       1932, 1656, 1920, 1845, 1921, 1939, 1911, 1888, 1895, 1903, 1892,
       1869, 1834, 1929, 1914, 1917, 1963, 1918, 1970, 1931, 1919,    1,
       1908, 1923, 1916, 1922, 1907, 1912, 1906, 1909, 1897, 1870, 1910,
       1915, 1887, 1605, 1868, 1900, 1902, 1879, 1843, 1853, 1832, 1860,
       1844, 1848, 1891, 1890, 1893, 1871, 1901, 1864, 1863, 1899, 1904,
       1862, 1858, 1880, 1794, 1796, 1795, 1778, 1883, 1874, 1803, 1810,
       1872, 1819, 1821, 1823, 1831, 1812, 1846, 1809, 1875, 1811, 1801,
       1807, 1826, 1841, 1838, 1847, 1837, 1877, 1839, 1804, 1830, 1840,
       1824, 1859, 1772, 1836, 1856, 1878,    4, 1896,    3, 1898, 1852,
       1971, 1905, 1849, 1851, 1873, 1882, 1885, 18

## Volume reconcilliation

1. fillna as "none" and convert the year column to type string, store it in a new column called volume_proc
2. I expect volume to be in either roman numerals or an integer, functions below convert roman numerals to decimals, detect roman numerals and detect that a piece of text is only numbers.
3. for each value, try caste the string to an int, either directly or first converted to a float
4. append all other cases to a list called oc_u, get a unique set and reconcile manually via a list called vol_rec
5. Format of errors is {correct_volume: [volume_error1, volume_error2 ...], ...}
6. Save it to a file called volume_recon.json. this is the file that is being read after the next code block.

In [41]:
vol_rec=None
with open("./031_recon/mturk_volume_recon.json", 'r') as f:
    vol_rec = json.load(f) 

In [42]:
vol_rec_split_out={}
for key in vol_rec.keys():
    for k in vol_rec[key]:
        vol_rec_split_out[k]=key

In [43]:
ar["volume_proc"]=ar["volume"].fillna("none").astype(str).str.upper()
vol_out=list(ar["volume_proc"].unique())
vol_out.sort()

w_count=0
n_count=0
vol_corr=[]

for i in ar.index:
    val=ar.loc[i,'volume_proc']
    if val in vol_rec_split_out.keys():
        val=vol_rec_split_out[val]
    if number_match(val):
        ar.loc[i,'volume_proc']=int(float(val))
    elif rom_match(val):
        ar.loc[i,'volume_proc']=rom_to_dec(val)
    elif val=="NONE":
        n_count=n_count+1
    elif val in vol_rec_split_out.keys():
        ar.loc[i,'volume_proc']=vol_rec_split_out[val]
        #print(val+" "+vol_rec_split_out[val])
    else:
        w_count=w_count+1
        vol_corr.append(val)
        print('"'+val+'",')


In [44]:
# NOTE THE output is empty because all issues have been resolved
vol_corr_u=list(set(vol_corr))
vol_corr_u.sort()
for a in vol_corr_u:
    print('"'+a+'":[\"'+a+'\"],')

In [45]:
# with open("/Users/sijiawu/Work/Refs Danae/volume_recon.json", 'w') as f:
#     json.dump(vol_rec, f, indent = 6) 

## Issue reconcilliation

1. fillna as "none" and convert the year column to type string, store it in a new column called volume_proc
2. I expect issue to be in either roman numerals or an integer, or a float. This part uses the same functions as the previous section to for converting roman numerals to decimals, detect roman numerals and detect that a piece of text is only numbers.
3. for each value, I fist check if the value is to be corrected against the compiled file. Then try caste the string value of the issue to an int if it is a number, either directly or first converted to a float. if not a number then check for a roman numeral and then convert that to an integer. If it fails all the previous conditions check if None or if it is a string value designated to be so in the corrections file.
4. append all other cases to a list called issue_corr, get a unique set and reconcile manually via a list called issue_rec which is saved in the file issue_recon.json
5. Format of errors is {correct_issue: [issue_error1, issue_error2 ...], ...}
6. Iteratively perform the above until all errors are resolved. Save it to a file called issue_recon.json. this is the file that is being read in the next code block.


In [46]:
issue_rec=None
with open("./031_recon/mturk_issue_recon.json", 'r') as f:
    issue_rec=json.load(f) 

In [47]:
# make key-value pairs that can easily replace things.
issue_rec_split_out={}
for key in issue_rec.keys():
    for k in issue_rec[key]:
        issue_rec_split_out[k]=key 

In [48]:
ar["issue_proc"]=ar["issue"].fillna("none").astype(str).str.upper()
issue_out=list(ar["issue_proc"].unique())
issue_out.sort()

w_count=0
n_count=0
issue_corr=[]

for i in ar.index:
    val=ar.loc[i,'issue_proc']
    ind=0
    if val in issue_rec_split_out.keys():
        val=issue_rec_split_out[val]
    if number_match(val):
        ar.loc[i,'issue_proc']=int(float(val))
    elif rom_match(val):
        ar.loc[i,'issue_proc']=rom_to_dec(val)
    elif val=="NONE":
        n_count=n_count+1
        ar.loc[i,'issue_proc']=val
    elif val in issue_rec_split_out.keys():
        ar.loc[i,'issue_proc']=val
#         print(val)
    else:
        w_count=w_count+1
        issue_corr.append(val)
        # print('"'+str(val)+'"')
  


In [49]:
issue_corr_u=list(set(issue_corr))
issue_corr_u.sort()
for a in issue_corr_u:
    print('"'+a+'",')

"1,2,54",
"166-69",
"2D SERIES",
"7-9",
"FASC. 1",
"FASC. 2A",
"FASC. 3",
"FASC. 4A",
"FIRST SEMESTER",
"SONDERHEFT 41",
"SUPPLEMENT IX",


## Titles

For titles, I only compile corrections for those of journal articles. 
1. lower the test and fill the na values with "none". Strip leading and trailing characters that don't belong in titles. Assign these titles to a new column: 
2. replace americanized spelling, replace characters that don't belong in titles
3. 46989 total references with 26603 unique titles. There are 12561 journal article references of which 9913 are unique. Since I only care for those in the top 5 econ journals. There are 5074 top 5 journal references of which 3172 are unique titles. After cleaning for duplicates and spelling mistakes, unique top 5 article titles reduce to 2822 unique titles. 

In [50]:
ar["type"].value_counts()

3    13723
2    12482
4    11788
1     6566
8     1016
7      856
6      320
5       23
Name: type, dtype: int64

In [51]:
#pre-process the title with previous two functions
ar["title_proc"]=ar["title"].fillna("none").astype(str).str.lower()
ar["title_proc"]=ar["title_proc"].apply(strip_leading,1)
ar["title_proc"]=ar["title_proc"].apply(replace_d_space,1)
title_sort=list(ar.title_proc.unique())
title_sort.sort()

j_data["title_proc"]=j_data["title"].fillna("none").astype(str).str.lower()
j_data["title_proc"]=j_data["title_proc"].apply(strip_leading,1)

len(title_sort) #total titles


22930

In [52]:
print(len(ar[(ar['type']==2) & (ar['journal_proc'].isin(j_list_app)==True)]["title_proc"]))
print(len(ar[(ar['type']==2) & (ar['journal_proc'].isin(j_list_app)==True)]["title_proc"].unique()))

5039
3156


In [54]:
#read in the title recon file
title_rec=None
with open("./031_recon/mturk_title_recon.json", 'r') as f:
    title_rec=json.load(f) 

#reformat and expand the title recon file
title_rec_split_out={}
for key in title_rec.keys():
    for k in title_rec[key]:
        title_rec_split_out[k]=key  

In [55]:
for i in ar.index:
    if ar.loc[i,"title_proc"] in title_rec_split_out.keys():
        ar.loc[i,"title_proc"]=title_rec_split_out[ar.loc[i,"title_proc"]]

## Merge two data sets
Merge the fMturk and Mturk origin data

In [56]:
t1=rem_refs.columns
t2=ar.columns

In [58]:
fullset=pd.concat([ar,rem_refs]).reset_index(drop=True)

In [59]:
fullset.to_pickle("pre_1970s")