# AER Cleaning

This notebook walks through how the AER articles were sorted into categories of articles and non-articles.

## Load Libraries

In [1]:
from tokenize import Ignore
import pandas as pd
import time
from os import path
import sys
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
import re
import os
from difflib import SequenceMatcher
import datetime
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.max_rows',100)
pd.set_option('display.max_colwidth', 120)

## Load Files

Replace the file paths below to match local file paths

In [2]:
base_path="/Users/sijiawu/Work/Thesis/Data"

In [3]:
masters = pd.read_excel(base_path+"/Masterlists/AER_Masterlist.xlsx")
masters10 = pd.read_excel(base_path+ "/2010/AER_master.xlsx")
pivots = pd.read_excel(base_path+"/Pivots/AER_Pivots2020.xlsx")

## Create file names

In [4]:
saveas=base_path+"/Processed/AER_processed.xlsx"

## Some random checks on the masters list

My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

Note: in both cases I've restricted to output to 20 to for sake of viewing on github - there is no scroll function for output.


In [5]:
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1]).head(20)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
new books,2013
front matter,565
discussion,542
back matter,454
notes,304
periodicals,204
volume information,112
titles of new books,106
"documents, reports, and legislation",89
report of the finance committee,66


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [6]:
temp2=masters[masters['author'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp2).head(20)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
new books,2007
front matter,565
back matter,453
notes,301
periodicals,204
volume information,112
titles of new books,106
"documents, reports, and legislation",72
report of the finance committee,63
report of the auditor,37


In [7]:
masters10.loc[6149,'title']

'Why are Prices Sticky?: Discussion'

In [8]:
masters10.columns

Index(['stable_url', 'authors', 'title', 'abstract', 'content_type',
       'issue_url', 'pages'],
      dtype='object')

There is also many reports with unique titles due to the year of the report being included in the title. Discussions are no longer part of the table excluding non-authored articles indicating these may be non-adminstrative documents.

The next block corrects for individual errors that were noted.

In [9]:
#Block for misspelling or renaming of data
masters10.loc[8990,'title']="Back Matter"
masters10.loc[10861,'title']="Back Matter"
masters10.loc[16376,'title']="Foreword"
masters10.loc[25807,'title']="Documents, Reports and Legislation"
masters10.loc[25815,'authors']="Alexander Marx"
masters10.loc[25720,'authors']="Review by: James Bonar"
masters10.loc[6425,'content_type']="Discussion"
masters10.loc[2284,'authors']="Victoria Ivashina and David Scharfstein"
masters10.loc[503,'authors']="Jennifer L. Doleac and Benjamin Hansen"
masters10.loc[22177,'authors']="Review by: W. L. Crum"
masters10.loc[22176,'authors']="Review by: Gardiner C. Means"
masters10.loc[24681,'authors']="Review by: Victor H. Pelz"
masters10.loc[6073,'authors']='Haizhou Huang'
masters10.loc[19384,'authors']='Review by: Anon'
masters10.loc[6149,'content_type']="Discussion"
masters10.loc[18729,'authors']='Anon'
masters10.loc[14710,'authors']='Anon'
masters10.loc[14710,'title']='Human Resources: The Wealth of a Nation by Eli Ginzberg: Erratum'
masters10.loc[24876,'authors']='Review by: Henry Pratt Fairchild'
masters10.loc[11919,'authors']='Review by: Anon'
masters10.loc[23831,'authors']='Review by: Roy G. Blakey'
masters10.loc[24620,'authors']='Review by: Ralph H. Blanchard'
masters10.loc[27402,'authors']='Review by: Anon'
masters10.loc[19927,'authors']='Anon'

masters.loc[11764, 'title']="Discussion"


  masters10.loc[6425,'content_type']="Discussion"


In [10]:
#Block for misspelling or renaming of data
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1801690','title']="Back Matter"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1910576','title']="Back Matter"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1818315','title']="Foreword"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1827575','title']="Documents, Reports and Legislation"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1808527','authors']="Alexander Marx"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1814356','authors']="Review by: James Bonar"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/2006616','content_type']="Discussion"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/27804963','authors']="Victoria Ivashina and David Scharfstein"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/44250460','authors']="Jennifer L. Doleac and Benjamin Hansen"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1808382','authors']="Review by: W. L. Crum"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1808381','authors']="Review by: Gardiner C. Means"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1802635','authors']="Review by: Victor H. Pelz"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/2006938','authors']='Haizhou Huang'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1813742','authors']='Review by: Anon'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/2006833','content_type']="Discussion"
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1801847','authors']='Anon'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1809912','authors']='Anon'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1809912','title']='Human Resources: The Wealth of a Nation by Eli Ginzberg: Erratum'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1808685','authors']='Review by: Henry Pratt Fairchild'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1812165','authors']='Review by: Anon'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/22','authors']='Review by: Roy G. Blakey'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1804820','authors']='Review by: Ralph H. Blanchard'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1802952','authors']='Review by: Anon'
masters10.loc[masters10['stable_url']=='https://www.jstor.org/stable/1807448','authors']='Anon'



In [11]:
masters.columns

Index(['type', 'issue_url', 'ISSN', 'URL', 'journal', 'number', 'publisher',
       'title', 'urldate', 'volume', 'year', 'abstract', 'author', 'pages',
       'reviewed-author', 'uploaded'],
      dtype='object')

## Combine the scraped list with the citations files

I have found that masterlists contructed from citation files lack the reviewed source's name while it is present on the page of the article. Some files are just missing the title. So I'm combining the old masterlists with the new ones.

In [12]:
masters["URL"]="https:"+masters["URL"].str.split(':').str[-1]
masters.drop('type', inplace=True, axis=1)
masters10["stable_url"]="https:"+masters10["stable_url"].str.split(':').str[-1]
masters10.rename(columns = {'authors':'author','stable_url':'URL','title':'title_10'}, inplace = True)
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = pd.NA  
pivots['type']=pd.NA
masters['pages']=masters["pages"].str.split('pp. ').str[-1]
masters['pages']=masters['pages'].replace(r'--','-',regex=True).str.strip()

## Format Author Names

there are two sets of data here. The master lists constructed from scraping each page and the masterlists constructed from the bibtex files for each article.

In [13]:
masters["author_split"]=masters['author'].str.split(' and ')


In [14]:
masters=masters.merge(masters10[['URL', 'title_10']], how='left', on='URL')
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False), 'content_type']="Review"


In [15]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False),"title"]=masters[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False)]["title_10"]
masters.loc[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True),"title"]=masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True)]["title_10"]

In [16]:
for i in masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)].index:
    temp=masters.iloc[i]
    indic=0
    if len(temp['author_split'])>1:
        for j in temp['author_split']:
            if j in temp["title_10"]:
                indic=1
                masters.loc[i, "title"]=temp["title_10"]
                masters.loc[i, "reviewed-author"]=j
                if "Review by:" in temp["title_10"]:
                    print("weird")
    if indic==0:
        masters.loc[i, 'title']=temp['title_10']

In [17]:
masters.loc[masters['number']==datetime.datetime(2023, 4, 5, 0, 0),"number"]="4-5"
masters.loc[masters['number']==datetime.datetime(2023, 1, 2, 0, 0),"number"]="1-2"


## Classifying Miscellaneous content

In [18]:
masters.loc[masters.title.str.lower() == "back matter", 'content_type'] = "MISC"  
masters.loc[masters.title.str.lower() == "front matter", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "volume matter", 'content_type'] = "MISC"
masters.loc[masters.title == "Announcements", 'content_type'] = "MISC"
masters.loc[masters.title == "Announcement", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "foreword", 'content_type'] = "MISC"
masters.loc[masters.title == "Periodicals", 'content_type'] = "MISC"

masters.loc[masters.title.str.lower() == "doctoral dissertations", 'content_type'] = "MISC"
masters.loc[masters.title == "Editorial Statement", 'content_type'] = "MISC"
masters.loc[masters.title.str.lower() == "list of members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "annual meetings", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "biographical listing of members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower() == "honorary members", 'content_type'] = "MISC" 
masters.loc[masters.title.str.lower().str.contains("preliminary announcement of the program"), 'content_type'] = "MISC"
masters.loc[masters["title"].str.contains("Distinguished Fellow"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("\[photograph\]"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("volume information"),'content_type']="MISC"
masters.loc[masters['title'].str.contains("The John Bates Clark Award"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("new books"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("titles of new books"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("new book"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("the american economic association"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("in memoriam"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("in memorium"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("memorial:"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("list of doctoral dissertations"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("notes") & masters['author'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("documents, reports and legislation"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("documents, reports, and legislation"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("editor") & masters["title"].str.lower().str.contains("introduction"),'content_type']="MISC"
masters.loc[masters['title'].str.match(r"^Editorial Note")==True, "content_type"]="MISC"
masters.loc[masters['title'].str.match(r"^Editor's Note")==True, "content_type"]="MISC"
masters.loc[masters["title"].str.lower().str.contains("classification of members"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("aer survey of members"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("annual business meeting"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("auditor") & masters["title"].str.lower().str.contains("report"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("proceedings of the") & masters["title"].str.lower().str.contains("annual meeting"),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("report of the") & masters['author'].isna(),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^report of the treasurer')==True, 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^report of the director:')==True, 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^report of the managing editor')==True,'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^report of the editor:')==True,'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^report of the secretary')==True,'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("minutes of the") & masters['author'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.lower().str.contains("minutes of business meetings") & masters['author'].isna(),'content_type']="MISC"
masters.loc[masters["title"].str.len()<3,'content_type']='MISC'
masters.loc[masters['title'].str.match(r'^Program.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.match(r'^Business Meeting.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'introductory remarks')==True,'content_type']='MISC'

#masters[masters['title'].str.lower().str.contains("review")]['title']
masters.loc[masters['title'].str.lower().str.match(r'the committee on.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.* representative')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.*committee on')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^report of.* finance committee')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*francis.*walker.*award')==True,'content_type']='MISC'

masters.loc[masters['title'].str.lower().str.match(r'^\[communication\]$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^\[introduction\]$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^introduction$')==True,'content_type']='MISC'

masters.loc[masters['author'].isna() & masters['content_type'].isna(),'content_type']='MISC' 

manual=["https://www.jstor.org/stable/1812108",
        "https://www.jstor.org/stable/1813763"]

masters.loc[masters["URL"].isin(manual)==True, "content_type"]="MISC"

In [19]:
#masters[masters["title"].str.lower().str.contains('affiliation')][['title','stable_url']]
#masters[masters['title'].str.lower().str.match(r'^report of the secretary')==True]
#masters[masters["title"].str.lower().str.contains("aer survey of members")][['title','stable_url']]

... One last check. Note: I found that after removing most of the miscellaneous content the remainder that did not have author names were not articles.

In [20]:
# print(masters[masters['author'].isna() & masters['content_type'].isna()]['title'].shape[0])
# masters[masters['author'].isna() & masters['content_type'].isna()][['title','URL']].sort_values('title')

In [21]:
# pd.DataFrame(masters[masters['content_type'].isna()]['title'].str.lower().value_counts())
#masters[masters.title.str.lower().str.match(r'.*:.*') & masters.content_type.isna()].head()

In [22]:
pd.set_option('display.max_rows',masters.shape[0])
pd.set_option('display.max_colwidth', 100)
#pd.DataFrame(masters[['title', 'stable_url']][(masters['content_type']!='MISC') &(masters['authors'].str.lower().str.contains("review")==False)]).sort_values('title'


## Separating out other types

In [23]:
#masters.loc[~(masters['authors'].isna()) & masters['authors'].str.lower().str.match(r'.*review by:.*'),'content_type']='Review'
masters.loc[~(masters['author'].isna()) & (masters['reviewed-author'].isna()==False),'content_type']='Review'
masters[masters.content_type=='Review'].shape[0]

6801

In [24]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )comment(|.*)$')==True,'content_type']='Comment'
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )further comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0]

858

In [25]:
masters.loc[masters['title'].str.lower().str.contains("erratum")|masters['title'].str.lower().str.contains("errata"), 'content_type']="Errata"

In [26]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

506

In [27]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters.loc[masters.title.str.lower().str.match(r'.*--discussion.*') & masters.content_type.isna(),'content_type']='Discussion'
masters.loc[(masters['title'].str.lower().str.contains("round table")==True)&(masters['content_type']!="MISC"),'content_type']="Discussion"
masters[masters['content_type']=='Discussion'].shape[0]

718

In [28]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

52

In [29]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

12797

In [30]:
# block for testing regex strings
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion(|.*)$')==True] #false positive for discussion
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True] comments to specific people
#masters[masters.title.str.lower().str.match(r'^(|a )note.*')]

In [31]:
masters[masters.content_type=='Article'].shape[0] #articles in data set

12797

In [32]:
masters[(masters['content_type']=='Article') & (masters['year']>1939)].shape[0] #all articles after 1940

11431

In [33]:
masters[(masters['content_type']=='Article') & (masters.year>1939) & (masters.year<2011)].shape[0] #articles between 1940 and 2010

9459

## Consider the pivots file

At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. The next code block separates special issues (S) from normal issues (N)

In [34]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots[pivots.type=='S'].head()

Unnamed: 0,year,issue_url,Jstor_issue_text,journal,type
42,2017,https://www.jstor.org/stable/10.2307/i40178116,No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Ninth Annual Meeting OF THE AMERICAN ECON...,amereconrevi,S
54,2016,https://www.jstor.org/stable/10.2307/i40158602,No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Eighth Annual Meeting OF THE AMERICAN ECO...,amereconrevi,S
66,2015,https://www.jstor.org/stable/10.2307/i40156735,No. 5 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Seventh Annual Meeting OF THE AMERICAN EC...,amereconrevi,S
78,2014,https://www.jstor.org/stable/10.2307/i40112127,No. 5 PAPERS AND PROCEEDINGS OF One Hundred Twenty-Sixth Annual Meeting OF THE AMERICAN ECONOMIC...,amereconrevi,S
87,2013,https://www.jstor.org/stable/10.2307/i23469657,No. 3 PAPERS AND PROCEEDINGS OF THE One Hundred Twenty-Fifth Annual Meeting OF THE AMERICAN ECON...,amereconrevi,S


Merging the pivots with masters

In [35]:
result = pd.merge(masters, pivots[['issue_url','type']], how="left", on=["issue_url", "issue_url"])


In [36]:
result.to_excel(saveas, index=False)
