# JPE cleaning
This notebook walks through how the JPE articles were sorted into categories of articles and non-articles.

## Loading libraries

In [31]:
from tokenize import Ignore
import pandas as pd
from difflib import SequenceMatcher
import multiprocessing as mp
import time
from os import path
import os
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
import re
import numpy as np
import matplotlib.pyplot as plt


pd.set_option('display.max_rows',100)
pd.set_option('display.max_colwidth', 120)   

## Loading Files
Please replace file paths with local file paths and comment out unapplicable content eg: datadump

In [None]:
root_path="/Users/sijiawu/Work/Thesis/Data"
base_path=root_path+"/010_clean_masterlists"

In [33]:
masters = pd.read_excel(base_path+"/Masterlists/JPE_Masterlist.xlsx")
masters10 = pd.read_excel(base_path+ "/2010/JPE_master.xlsx")
pivots = pd.read_excel(base_path+"/Pivots/JPE_Pivots2020.xlsx")

## Create File names
Again, replace these with local file paths

In [34]:
saveas=base_path+"/Processed/JPE_processed.xlsx"

## Some random checks on the masters list
My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

In [35]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1]).head(10)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
front matter,431
back matter,322
books received,248
volume information,137
washington notes,110
journal of political economy: acknowledges the assistance of:,74
new publications,50
journal of political economy,43
back cover,29
[notes],27


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [36]:
temp2=masters[masters['author'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp2).head(10)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
front matter,431
back matter,322
books received,248
volume information,137
washington notes,110
journal of political economy: acknowledges the assistance of:,74
new publications,50
journal of political economy,43
back cover,29
recent referees,24


In [37]:
masters["content_type"]=None
temp2=masters[masters['content_type'].isna()==True]['title'].str.lower().value_counts()
#pd.DataFrame(temp2)
removal=list(temp2[temp2>=5].index)
masters.loc[masters.title.str.lower().isin(removal),'content_type']='MISC'

## Combine the scraped list with the citations files

I have found that masterlists contructed from citation files lack the reviewed source's name while it is present on the page of the article. Some files are just missing the title. So I'm combining the old masterlists with the new ones.

In [38]:
masters["URL"]="https:"+masters["URL"].str.split(':').str[-1]
masters.drop('type', inplace=True, axis=1)
masters10["stable_url"]="https:"+masters10["stable_url"].str.split(':').str[-1]
masters10.rename(columns = {'authors':'author','stable_url':'URL','title':'title_10'}, inplace = True)
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = pd.NA  
pivots['type']=pd.NA
masters['pages']=masters["pages"].str.split('pp. ').str[-1]
masters['pages']=masters['pages'].replace(r'--','-',regex=True).str.strip()

In [39]:
masters["author_split"]=masters['author'].str.split(' and ')
masters=masters.merge(masters10[['URL', 'title_10']], how='left', on='URL')

In [40]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False), 'content_type']="Review"
masters.loc[((masters['title'].str.lower().str.contains('book reviews indexed by author of book')==True)),'content_type']='Review'

In [41]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False),"title"]=masters[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False)]["title_10"]
masters.loc[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True),"title"]=masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True)]["title_10"]

In [42]:
for i in masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)].index:
    temp=masters.iloc[i]
    indic=0
    if len(temp['author_split'])>1:
        for j in temp['author_split']:
            if j in temp["title_10"]:
                indic=1
                masters.loc[i, "title"]=temp["title_10"]
                masters.loc[i, "reviewed-author"]=j
                masters.loc[i, "content_type"]="Review"
                if "Review by:" in temp["title_10"]:
                    print("weird")
    if indic==0:
        masters.loc[i, 'title']=temp['title_10']

In [43]:
masters10.head()

Unnamed: 0,URL,author,title_10,abstract,content_type,issue_url,pages
0,https://www.jstor.org/stable/26549909,,JOURNAL OF POLITICAL ECONOMY,,,https://www.jstor.org/stable/10.2307/e26549908,
1,https://www.jstor.org/stable/26549910,,Journal of Political Economy,,,https://www.jstor.org/stable/10.2307/e26549908,
2,https://www.jstor.org/stable/26549911,Johannes Hörner and Andrzej Skrzypacz,Selling Information,,,https://www.jstor.org/stable/10.2307/e26549908,1515-1562
3,https://www.jstor.org/stable/26549912,Gabriel Chodorow-Reich and Loukas Karabarbounis,The Cyclicality of the Opportunity Cost of Employment,,,https://www.jstor.org/stable/10.2307/e26549908,1563-1618
4,https://www.jstor.org/stable/26549913,David Gill and Victoria Prowse,"Cognitive Ability, Character Skills, and Learning to Play Equilibrium: A Level-k Analysis",,,https://www.jstor.org/stable/10.2307/e26549908,1619-1676


In [44]:
masters[masters.title.isna()]

Unnamed: 0,issue_url,ISSN,URL,journal,number,publisher,title,urldate,volume,year,author,pages,abstract,reviewed-author,content_type,author_split,title_10


## Classifying miscellaneous documents

In [45]:
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'front matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'back matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'volume information').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'books recieved').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'washington notes').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters['title'].str.lower().str.match(r'(in )?memor(y|i(a|u)(m|l))')==True, 'content_type']='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'books reccieved').ratio(), axis=1)>0.75,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^journal of political economy(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^index to volume.*')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^new publications')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^(prefatory |\[)note(|s)(|\])$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^(|\[)questions and answers(\]|)$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^(|short )notice(|s)$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^back cover(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^introduction(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^combined references(.*)')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'^editor')==True, 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^from the editor')==True, 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains(r'^jpe.*')==True, 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains(r'^preface$')==True, 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains(r'^the annual meetings$')==True, 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains(r'\[photograph\]')==True, 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^(dissertations|john bates clark: a memorial|volume infomation|volume infromation)$')==True,'content_type']='MISC'
masters.loc[masters['author'].str.lower().str.contains('suggested ')==True, 'content_type']='MISC'

manual=["https://www.jstor.org/stable/26550496",
 "https://www.jstor.org/stable/26550454",
 "https://www.jstor.org/stable/26550440",
 "https://www.jstor.org/stable/26550429",
 "https://www.jstor.org/stable/26550405",
 "https://www.jstor.org/stable/26549923",
 "https://www.jstor.org/stable/26549931",
 "https://www.jstor.org/stable/26549919",
 "https://www.jstor.org/stable/26549907",
 "https://www.jstor.org/stable/26549896",
 "https://www.jstor.org/stable/26549885",
 "https://www.jstor.org/stable/26549875",
 "https://www.jstor.org/stable/26549865",
 "https://www.jstor.org/stable/1830706",
 "https://www.jstor.org/stable/1829984",
"https://www.jstor.org/stable/1829099",
"https://www.jstor.org/stable/1829100",
"https://www.jstor.org/stable/1829101",]

masters.loc[masters["URL"].isin(manual)==True, "content_type"]="MISC"


In [46]:
## refer to tweet by JPE https://x.com/JPolEcon/status/1446209115735277583

## Classifying other content types

In [47]:
# check for how many articles are still unclassified
sum(masters.content_type.isna())
#masters.shape[0]

6635

In [48]:
masters.loc[(masters['title'].str.lower().str.match(r'(.*) by (.*)')==True) & (masters.author.isna()==True),'content_type']='Review' 
#possible reviews that don't have author names
masters.loc[~(masters['author'].isna()) & (masters['reviewed-author'].isna()==False),'content_type']='Review'


In [49]:
masters.loc[masters['title'].str.lower().str.contains("erratum")|masters['title'].str.lower().str.contains("errata"), 'content_type']="Errata"

In [50]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0] #comments

166

In [51]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

112

In [52]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

47

In [53]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

11

In [54]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

6281

In [55]:
# block for testing regex matches
#masters[masters['title'].str.lower().str.match(r'^\washington notes$')==True]
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True]
#masters[masters.content_type=='Discussion'].shape[0]
pivots.head()

Unnamed: 0,year,issue_url,Jstor_issue_text,journal,type
0,2020,uchicago_jpe128_9.bib,,jpoliecon,
1,2020,uchicago_jpe128_8.bib,,jpoliecon,
2,2020,uchicago_jpe128_7.bib,,jpoliecon,
3,2020,uchicago_jpe128_6.bib,,jpoliecon,
4,2020,uchicago_jpe128_5.bib,,jpoliecon,


## Consider the pivots file
At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. Separate special issues (S) from normal issues (N)

In [56]:
pivots.loc[pivots.Jstor_issue_text.isna(),"Jstor_issue_text"]="None"
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'.*(conference|s1).*'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots.type.value_counts()


type
N    814
S     10
Name: count, dtype: int64

Merge pivots and masters together

In [57]:
result = pd.merge(masters, pivots[['issue_url','type']], how="left", on=["issue_url", "issue_url"])

In [58]:
result.to_excel(saveas, index=False)

In [59]:
masters.content_type.value_counts()

content_type
Article       6281
Review        6009
MISC          1705
Comment        166
Reply          112
Rejoinder       47
Errata          14
Discussion      11
Name: count, dtype: int64