# Author Name Recon

This notebook cleans the author names of each paper for the top 5 economic journals for all time using jstor metadata. Running through the cells produces a reconciliation file for the author names as they appear in the masterlists in a raw format. ie: what went in for each author name of an article, what came out and possible aliases. I exclude Miscellaneous content only.

## Notes

* Going through this data, some of the citation files for JPE acquired from the chicago journals website has an error where special characters are not encoding properly. I have created a manual resolution of these files, but this should be fixed further up the pipeline. There are only 33 names with this problem and you can't resolve it by using the unidecode library. But, specifically, these files:

{'uchicago_jpe126_1.bib',
 'uchicago_jpe126_3.bib',
 'uchicago_jpe126_5.bib',
 'uchicago_jpe126_6.bib',
 'uchicago_jpe126_S1.bib',
 'uchicago_jpe127_1.bib',
 'uchicago_jpe127_2.bib',
 'uchicago_jpe127_3.bib',
 'uchicago_jpe127_4.bib',
 'uchicago_jpe127_5.bib',
 'uchicago_jpe128_1.bib',
 'uchicago_jpe128_10.bib',
 'uchicago_jpe128_11.bib',
 'uchicago_jpe128_12.bib',
 'uchicago_jpe128_2.bib',
 'uchicago_jpe128_4.bib',
 'uchicago_jpe128_5.bib',
 'uchicago_jpe128_7.bib',
 'uchicago_jpe128_8.bib',
 'uchicago_jpe128_9.bib'} 
* Resolution of some special characters results in empty string using unicodedata.normalize() function. For example the danish o (o with a slash) does not result in o but rather an empty string. So I have made a function and a accompanying recon file that fixes this.
* There are some authors which:
    * have only their initials (even the last name is contracted)
    * they are just referred to by their last name because of how well known they are. 
    * there are so many authors that only the primary author is mentioned and the others are contracted into a catch-all term like "other", "company" or "co.". 
    
    These are limited and recommend resolving here as they are easily picked out. Todo: Construct a recon file manually compiled to replace these.

**Input**: 
* each of the cleaned masterlists with file ending in \_M_sco_du.xlsx 

**Output**: 
* a json file of reconciliated and cleaned author names and associated article URLs
* a list of UTF-8 characters and the ascii equivalent to which they should resolve.
* a list of author name corrections for JPE errors
* TODO: a list of author name corrections for contraction cases, need to do article URL matching.

## Import libraries

In [84]:
import pandas as pd
import re
import json
# set column options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

## Read in input files

In [85]:
JOURNALS= ['AER', 'JPE', 'ECTA', 'RES', 'QJE']
base_path="/Users/sijiawu/Work/Thesis/Data"
input_base_path=base_path+"/013_split_data"
output_base_path=base_path+"/020_author_names_recon"
#read in all processed masterlists

#NB this is the output from notebook 013_split_data
j_data=pd.read_pickle(input_base_path+"/013_merged_proc_scopus_inception_2020_w_counts.pkl")
j_data['year']=j_data["year"].astype(int)

In [86]:
j_data.columns # verify headers

Index(['issue_url', 'author', 'title', 'journal', 'volume', 'number', 'pages',
       'year', 'ISSN', 'abstract', 'URL', 'publisher', 'content_type', 'type',
       'jid', 'author_split', 'urldate', 'reviewed-author', 'uploaded',
       'title_10', 'URL_og', 'number_og', 'title_og', 'author_og', 'pages_og',
       'j_fix', 'scopus_jid', 'scopus_id', 'scopus_authorgroup',
       'scopus_authors', 'scopus_affiliations', 'scopus_references',
       'scopus_author_full_names', 'scopus_title', 'scopus_year',
       'scopus_source_title', 'scopus_volume', 'scopus_issue', 'scopus_art_no',
       'scopus_page_start', 'scopus_page_end', 'scopus_page_count',
       'scopus_cited_by', 'scopus_doi', 'scopus_abstract', 'scopus_publisher',
       'scopus_document_type', 'scopus_publication_stage',
       'scopus_open_access', 'scopus_source', 'scopus_eid', 'scopus_title_og',
       'scopus_volume_og', 'scopus_issue_og', 'scopus_page_start_og',
       'scopus_page_end_og', 'scopus_year_og', 's_fix', 

In [87]:
len(j_data["URL"].unique()) #number of articles

62257

In [88]:
j_data[j_data['URL']=="https://www.jstor.org/stable/27804963"].values

array([['https://www.jstor.org/stable/10.2307/i27804949',
        'Victoria Ivashina and David Scharfstein',
        'loan syndication and credit cycles',
        'The American Economic Review', 100, '2', '57-61', 2010, 28282,
        nan, 'https://www.jstor.org/stable/27804963',
        'American Economic Association', 'Article', 'S', 'aer', nan,
        datetime.datetime(2023, 9, 4, 0, 0), nan, 1.0,
        'Loan Syndication and Credit Cycles',
        'https://www.jstor.org/stable/27804963', '2',
        'Loan Syndication and Credit Cycles', nan, '57-61', 1.0, 'aer',
        '10.1257/aer.100.2.57', True, True, True, True,
        'Ivashina, Victoria (26323576100); Scharfstein, David (7003408304)',
        'loan syndication and credit cycles', '2010',
        'American Economic Review', '100', '2', nan, '57', '61', 4.0,
        65.0, '10.1257/aer.100.2.57', '[No abstract available]', nan,
        'Conference paper', 'Final', nan, 'Scopus', '2-s2.0-77956122893',
        'Loan syndicat

In [89]:
# must account for duplicates due to scopus duplication - scopus has one jstor to many scopus entries
j_data[['content_type', 'URL']].drop_duplicates()['content_type'].value_counts()

content_type
Article       32678
Review        13813
MISC          12542
Comment        1419
Reply           832
Discussion      742
Rejoinder       153
Errata           78
Name: count, dtype: int64

## Create a restricted character set

In [90]:
char_set="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ ,'.-"
chars=[*char_set]

## Constants

In [91]:
# titles that do not matter
auth_ad=[
    "c.p.a.",
    "m.e.",
    "m.d.",
    "s.j.", #society of jesus
    "s. j.", #society of jesus
    # "wm.",  #contraction of the given name William
]

# suffixes that matter for distinguishing authors
auth_ad2=[
    "2nd",
    "3rd",
    "jr.", #junior
    "yr." #misspelling of junior. Yordon case
]

## Resolve the JPE formatting issue

Please generate the file using the code block at the end of the notebook and manually resolve the errors. Rename the file and run the next two codeblocks.

In [92]:
with open(output_base_path+'/jpe_problem_names.json') as f: 
    data = f.read() 
names_repl = json.loads(data) 

In [93]:
for i in j_data.index:
    if j_data.loc[i, "author"] in names_repl.keys():
         j_data.loc[i, "author"]=names_repl[j_data.loc[i, "author"]]

## Resolve special characters outside ascii

As above, generate the file using code at the end of the notebook. Or copy in the file.

In [94]:
with open(output_base_path+'/spec_char_replacement.json') as f: 
    data = f.read() 
js = json.loads(data) 

def replace_spec_chars(str_in):
    hold=""
    for o in str_in:
        if o.lower() in js.keys():
            hold=hold+js[o.lower()]
        else:
            hold=hold+o
    return hold

In [95]:
for i in j_data.index:
    if pd.isna(j_data.loc[i, "author"])==False:
        j_data.loc[i,'author']=re.sub('\xa0'," ",j_data.loc[i,'author'])
        j_data.loc[i,'author']=re.sub(r'\s+', " ", j_data.loc[i,'author'])
        if j_data.loc[i, "author"].isascii()==True:
            continue
        j_data.loc[i, "author"]=replace_spec_chars(j_data.loc[i,"author"])
        if j_data.loc[i, "author"].isascii()==False:
            print(j_data.loc[i, ["author","URL"]])



Create a new column where all author names are split by " and ".

In [96]:
j_data['author_split']=j_data['author'].str.split(" and ")
j_data['author_count']=j_data['author_split'].apply(lambda x: len(x) if isinstance(x, list) else 0)

## Process author names

In [97]:
#store the aliases in lists for comparison later
a1=[]
a2=[]
i1=[]
ln=[]


#function for creating aliases
#todo annotate function
def proc_auth(auth_str):
#     print(auth_str)
    suff=""
    auth_split=auth_str.split(" ")
    if "," in auth_str:
        suff=", "+auth_str.split(", ")[-1]
        auth_split=auth_str.split(",")[0].split(" ")
        # print(suff)
    check=0
    for j in auth_split:
        if "." in j:
            check=check+1
    if check==len(auth_split):
        return [auth_str, auth_str, auth_str, auth_str, 0]
    
    if len(auth_split)>1:
        init_auth=auth_split[0][0]+'. '+auth_split[-1]
        alt_auth=""
        alt_2_auth=auth_split[0]+" "
        if len(auth_split)>2:
            for k in auth_split[1:-1]:
                alt_2_auth+=k[0]+'. '
        alt_2_auth+=auth_split[-1]
        
        
        for k in auth_split[:-1]:
            alt_auth+=k[0]+'. '
        alt_auth+=auth_split[-1]
        ln.append(auth_split[-1])
        a1.append(alt_2_auth+suff)
        a2.append(alt_auth+suff)
        i1.append(init_auth+suff)
        
        
        
        return [alt_2_auth, alt_auth, init_auth, auth_split[-1]]
    else:
         return[auth_str, auth_str, auth_str, auth_str, 0]


In [98]:
for i in j_data.index:
    authors=j_data.loc[i,"author_split"]

In [99]:
j_data['jid'].unique()

array(['qje', 'aer', 'ecta', 'jpe', 'res'], dtype=object)

Note that there was a bug in the code below that has not resolved the case of authors having a designation like m.e. at the end of their name. This is resolved.

In [100]:
a=1
all_authors=[]
all_authors_a=[]
autht=0
proc_auths_all={}
for i in j_data.index:
    authors=j_data.loc[i,"author_split"]
    proc_auths={
                "authors":{}, 
                "year":j_data.loc[i,"year"], 
                'content_type':j_data.loc[i, "content_type"],
                'jid':j_data.loc[i, "jid"]
                }
    if pd.isna(j_data.loc[i,"author"])==False:
        for j in range(len(authors)):
            sa=authors[j].lower()
            all_authors_a.append(str(sa))
            autht=autht+1

            o=[]
            
            # remove the author titles that are not necessary if present
            for k in auth_ad:
                p=", "+k
                if (p in sa):
                    if 'yr' in p:
                        print(sa)
                    sa=re.sub(p,'',sa)
                    # print(sa)
                    break

            # remove and store the suffixes of an author if present
            for k in auth_ad2:
                # print("case 2")

                r=", "+k
                if r in sa[-len(r):]:
                    # print("case 2.1")
                    o.append(k)
                    sa=re.sub(r,'',sa)
                    # print(sa)
                    break
                elif " "+k in sa[-len(" "+k):]:
                    # print("case 2.2")
                    t1=sa.split(" "+k)
                    sa=re.sub(" "+k,"",sa)
                    # print(sa)
                    o.append(k)
                    break
                    
                        
            if "," in sa:
                reorg=sa.split(r", ")
                sa=reorg[1].strip()+ " "+ reorg[0].strip()
           
            all_authors.append(sa)
            proc_auths["authors"][j]={}
            proc_auths["authors"][j]["raw"]=authors[j]
            proc_auths["authors"][j]["init"]=sa
            proc_auths["authors"][j]['auth_suffix']=o
            aliases=proc_auth(sa)
            if len(aliases)==5:
                k=1
                # print(sa)
                # print(j_data.loc[i,"author"])
                # print(j_data.loc[i, "URL"])
                # print(j_data.loc[i, "year"])
                # print(j_data.loc[i, "journal"])

            proc_auths["authors"][j]["a1"]=aliases[0]
            proc_auths["authors"][j]["a2"]=aliases[1]
            proc_auths["authors"][j]["a3"]=aliases[2]

            if ("[" in sa) or ("(" in sa):
                k=1
                # print(sa)
                # print(j_data.loc[i,"author"])
                # print(j_data.loc[i, "year"])
                # print(j_data.loc[i, "journal"])
                # print(j_data.loc[i, "URL"])

    else:
        a=a+1
    proc_auths_all[j_data.loc[i,"URL"]]=proc_auths

## Save the output

In [101]:

with open(output_base_path+"/020_author_proc.json", "w") as outfile: 
    json.dump(proc_auths_all, outfile, indent=4, default=int)

j_data.to_pickle(output_base_path+"/020_merged_proc_scopus_inception_with_auth_split_2020.pkl")


In [102]:
j_data.columns

Index(['issue_url', 'author', 'title', 'journal', 'volume', 'number', 'pages',
       'year', 'ISSN', 'abstract', 'URL', 'publisher', 'content_type', 'type',
       'jid', 'author_split', 'urldate', 'reviewed-author', 'uploaded',
       'title_10', 'URL_og', 'number_og', 'title_og', 'author_og', 'pages_og',
       'j_fix', 'scopus_jid', 'scopus_id', 'scopus_authorgroup',
       'scopus_authors', 'scopus_affiliations', 'scopus_references',
       'scopus_author_full_names', 'scopus_title', 'scopus_year',
       'scopus_source_title', 'scopus_volume', 'scopus_issue', 'scopus_art_no',
       'scopus_page_start', 'scopus_page_end', 'scopus_page_count',
       'scopus_cited_by', 'scopus_doi', 'scopus_abstract', 'scopus_publisher',
       'scopus_document_type', 'scopus_publication_stage',
       'scopus_open_access', 'scopus_source', 'scopus_eid', 'scopus_title_og',
       'scopus_volume_og', 'scopus_issue_og', 'scopus_page_start_og',
       'scopus_page_end_og', 'scopus_year_og', 's_fix', 

## Compute some stats

In [103]:
auth_summary=[]
auth_t=list(set(all_authors_a))
u_auth=list(set(all_authors))
u_a1=list(set(a1))
u_a2=list(set(a2))
u_a3=list(set(i1))
u_ln=list(set(ln))
auth_summary.append({"description":"The total number of authors", "total": len(all_authors)})
auth_summary.append({"description":"The number of unique author names without editing", "total":  len(auth_t)})
auth_summary.append({"description":"The number of unique author names after initial processing", "total":  len(u_auth)})
auth_summary.append({"description":"Unique alias 1 type names (contracted middle names + last name)", "total":  len(u_a1)})
auth_summary.append({"description":"Unique alias 2 type names (contracted first names + last name)", "total":  len(u_a2)})
auth_summary.append({"description":"Unique alias 3 type names (contracted first name + last name)", "total":  len(u_a3)})
auth_summary.append({"description":"Unique last names", "total":  len(u_ln)})

pd.DataFrame(auth_summary).to_csv(output_base_path+"/summary_author_counts.csv", index=False)

# NO SIGNIFICANT OUTPUT AFTER THIS POINT

## Computing similarity to find duplicates
Methodology with 20000 names, we can precommpile the set of potential names using last names and first names as a way to determine duplicates. 
1. allocate authors to a dictionary by last names.
2. Function by process of elimination. Assumption: first letter of last name and first letter of first name will always be correct.
3. format 

# Three cases of ambiguity:

1. same name, different person
   - is in same time period -> require affiliations
   - not in same time period -> resolve with significant significant publishing year barrier
2. similar name or alias, same person
   - has multiple resolution from no middle name to has middle name: require affiliations
   - spelling error possibilities. Direct correction in data - require research on individual articles for resolution, restricted to this data set only
3. has multiple resolutions from contraction: require affiliations
   - generate list of contractions and associated possibilities if full name exists
   - modify by 


## Top 20 Author Output Ranked for all time

In [104]:
# after initial process
pd.DataFrame(all_authors).value_counts()[:20]

0                   
frank h. knight         113
george j. stigler        95
a. b. wolfe              92
paul a. samuelson        91
daron acemoglu           85
f. w. taussig            84
william j. baumol        83
m. bronfenbrenner        81
j. m. clark              79
jean tirole              77
joseph e. stiglitz       75
paul h. douglas          71
h. parker willis         68
wesley c. mitchell       67
milton friedman          66
franklin m. fisher       64
harry g. johnson         63
t. n. carver             60
chester w. wright        57
j. laurence laughlin     57
Name: count, dtype: int64

In [105]:
# using a1
pd.DataFrame(a1).value_counts()[:20]

0                 
frank h. knight       113
george j. stigler      95
a. b. wolfe            92
paul a. samuelson      91
j. m. clark            88
daron acemoglu         85
f. w. taussig          84
william j. baumol      83
m. bronfenbrenner      81
jean tirole            77
joseph e. stiglitz     75
paul h. douglas        71
h. p. willis           70
wesley c. mitchell     67
milton friedman        66
franklin m. fisher     64
harry g. johnson       63
frank a. fetter        60
t. n. carver           60
chester w. wright      58
Name: count, dtype: int64

In [106]:
# using a2
pd.DataFrame(a2).value_counts()[:20]

0                
f. h. knight         133
m. bronfenbrenner    101
j. m. clark           97
p. a. samuelson       96
g. j. stigler         95
a. b. wolfe           94
w. j. baumol          90
f. w. taussig         86
j. e. stiglitz        86
d. acemoglu           85
j. tirole             77
w. c. mitchell        75
h. p. willis          74
p. h. douglas         73
f. m. fisher          66
m. friedman           66
a. p. lerner          65
t. n. carver          64
f. a. fetter          63
h. g. johnson         63
Name: count, dtype: int64

In [107]:
# using a3
pd.DataFrame(i1).value_counts()[:20]

0                
f. knight            133
j. clark             112
f. fetter            111
r. gordon            101
m. bronfenbrenner    101
j. stiglitz           98
p. samuelson          98
g. stigler            97
a. wolfe              94
w. baumol             92
f. taussig            86
d. acemoglu           85
w. mitchell           83
m. feldstein          78
j. tirole             77
h. willis             74
v. smith              74
p. douglas            74
j. robinson           73
c. wright             71
Name: count, dtype: int64

## Frequency of Authors scale of output for all time

In [108]:
# after initial processing
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(all_authors).value_counts()).reset_index()
for i in range(len(seq)-1):
#     print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
#     print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

Unnamed: 0,occurences,number
0,0 < x <= 1,12658
1,1 < x <= 2,3595
2,2 < x <= 5,3452
3,5 < x <= 10,1614
4,10 < x <= 50,1177
5,50 < x <= 100,33
6,100 < x <= 200,1
7,200 < x <= 1000,0


In [109]:
# first initial + last name
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(i1).value_counts()).reset_index()
for i in range(len(seq)-1):
#     print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
#     print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

Unnamed: 0,occurences,number
0,0 < x <= 1,8662
1,1 < x <= 2,2838
2,2 < x <= 5,3139
3,5 < x <= 10,1641
4,10 < x <= 50,1374
5,50 < x <= 100,44
6,100 < x <= 200,5
7,200 < x <= 1000,0


In [110]:
# Initials and last name
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(a2).value_counts()).reset_index()
for i in range(len(seq)-1):
    print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
    print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

between 0 exclusive and 1 inclusive
10725
between 1 exclusive and 2 inclusive
3251
between 2 exclusive and 5 inclusive
3351
between 5 exclusive and 10 inclusive
1640
between 10 exclusive and 50 inclusive
1257
between 50 exclusive and 100 inclusive
39
between 100 exclusive and 200 inclusive
2
between 200 exclusive and 1000 inclusive
0


Unnamed: 0,occurences,number
0,0 < x <= 1,10725
1,1 < x <= 2,3251
2,2 < x <= 5,3351
3,5 < x <= 10,1640
4,10 < x <= 50,1257
5,50 < x <= 100,39
6,100 < x <= 200,2
7,200 < x <= 1000,0


In [111]:
# contracted middle names
summary=[]
seq=[0,1,2,5,10,50,100,200,1000]
nms=pd.DataFrame(pd.DataFrame(a1).value_counts()).reset_index()
for i in range(len(seq)-1):
#     print("between "+str(seq[i])+" exclusive and "+str(seq[i+1])+ " inclusive")
#     print(nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0])
    summary.append({"occurences":str(seq[i])+" < x <= "+str(seq[i+1]), "number":nms[(nms["count"]<=seq[i+1])&(nms["count"]>seq[i])].shape[0]})
    
pd.DataFrame(summary)

Unnamed: 0,occurences,number
0,0 < x <= 1,12285
1,1 < x <= 2,3536
2,2 < x <= 5,3419
3,5 < x <= 10,1621
4,10 < x <= 50,1193
5,50 < x <= 100,34
6,100 < x <= 200,1
7,200 < x <= 1000,0


## Resolving problem names for JPE

It is just read in above but this is how it should be generated if the file above is not applied to the variable j_data. They are printed to a json file with a unique timestamp please resolve each name, using the URLs below to check for the author name on the article page if the names were cutoff.

In [112]:
# stuff={}
# for i in j_data.index:
#     authors=j_data.loc[i,"author"]
#     if authors is not np.NaN:
#         sa=authors.lower()
#         if ("{" in sa) or ("\\" in sa):
#             print(j_data.loc[i,"author"])
#             print(j_data.loc[i, "URL"])
#             stuff[j_data.loc[i, "author"]]=j_data.loc[i, "author"]
#             print()

# with open(output_base_path+"/jpe_problem_names_"+str(time.time())+".json", "w") as outfile: 
#     json.dump(stuff, outfile, indent=4)

## Resolving special characters

Todo: optimize this code.
As above, add in the the unresolvable characters directly in the file and rename it to match the file name above.

In [113]:
# spec_char_set={}

# spec_chars=[]
# def process_string(str_in):
#     if str_in.isascii()==True:
#         return str_in
#     hold=""
#     for o in str_in:
#         if o.isascii()==False:
#             spec_chars.append(o)
#             print(str_in)

# for i in j_data.index:
#     authors=j_data.loc[i,"author"]
#     if authors is not np.NaN:
#         sa=authors.lower()
#         process_string(sa)
        
# u_spec_chars=list(set(spec_chars))
# u_spec_chars.sort()

# for o in u_spec_chars:
#     spec_char_set[o]=unicodedata.normalize("NFKD", o).encode('ascii', 'ignore').decode('utf-8')
#     if len(spec_char_set[o])==0:
#         print(o)
        
# print("end")
        
# with open(output_base_path+"/spec_char_replacement_"+str(time.time())+".json", "w") as outfile: 
#     json.dump(spec_char_set, outfile, indent=4)