# QJE Cleaning
This notebook walks through how the QJE articles were sorted into categories of articles and non-articles.

## Load Libraries

In [86]:
from tokenize import Ignore
import pandas as pd
from difflib import SequenceMatcher
import multiprocessing as mp
import time
from os import path
import os
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
import re
import numpy as np
import matplotlib.pyplot as plt


pd.set_option('display.max_rows',None)
pd.set_option('display.max_colwidth', 120)   

## Load Files
Please change file paths to local and comment out file reads that are not present eg: datadump

In [87]:
base_path="/Users/sijiawu/Work/Thesis/Data"

In [88]:
masters = pd.read_excel(base_path+"/Masterlists/QJE_Masterlist.xlsx")
masters10 = pd.read_excel(base_path+ "/2010/QJE_master.xlsx")
pivots = pd.read_excel(base_path+"/Pivots/QJE_Pivots2020.xlsx")

## Create file names
For output

In [89]:
saveas=base_path+"/Processed/QJE_processed.xlsx"

## Some random checks on the masters list
My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

In [90]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1])

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
front matter,416
back matter,413
volume information,178
recent publications,141
books received,100
recent publications upon economics,86
[notes and memoranda],39
the quarterly journal of economics,14
comment,4
the gas supply of boston,4


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [91]:
temp1=masters[masters['author'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp1)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
front matter,416
back matter,413
volume information,178
recent publications,141
books received,100
recent publications upon economics,86
[notes and memoranda],36
the quarterly journal of economics,14
[introduction],3
scientific publications of harvard university,3


In [92]:
# block for testing regex patterns
#pd.DataFrame(masters[masters['content_type'].isna()]['title'].str.lower().value_counts())
#masters[masters['title'].str.lower().str.match(r'(^|: )report of the')]
#masters[masters['title'].str.lower().str.match(r'(^|.*: )report of the')]
#masters.loc[masters['title'].str.lower().str.match(r'^combined references(.*)')==True,'content_type']='MISC'


It seems anything with duplicates greater than 5 are miscellaneous according to the list above and the bulk of miscellaneous content can be removed.

In [93]:
temp2=masters[(masters['author'].isna()==True)]['title'].str.lower().value_counts()
pd.DataFrame(temp2)
removal=list(temp2[temp2>=3].index)
removal
masters.loc[masters.title.str.lower().isin(removal),'content_type']='MISC'

## Combine the scraped list with the citations files

I have found that masterlists contructed from citation files lack the reviewed source's name while it is present on the page of the article. Some files are just missing the title. So I'm combining the old masterlists with the new ones.

In [94]:
masters["URL"]="https:"+masters["URL"].str.split(':').str[-1]
masters.drop('type', inplace=True, axis=1)
masters10["stable_url"]="https:"+masters10["stable_url"].str.split(':').str[-1]
masters10.rename(columns = {'authors':'author','stable_url':'URL','title':'title_10'}, inplace = True)
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = pd.NA  
pivots['type']=pd.NA
masters['pages']=masters["pages"].str.split('pp. ').str[-1]
masters['pages']=masters['pages'].replace(r'--','-',regex=True).str.strip()

In [95]:
masters["author_split"]=masters['author'].str.split(' and ')
masters=masters.merge(masters10[['URL', 'title_10']], how='left', on='URL')

In [96]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False), 'content_type']="Review"
masters.loc[(pd.isna(masters["reviewed-author"])==False), 'content_type']="Review"

In [97]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False),"title"]=masters[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False)]["title_10"]
masters.loc[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True),"title"]=masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True)]["title_10"]

In [98]:
for i in masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)].index:
    temp=masters.iloc[i]
    indic=0
    if len(temp['author_split'])>1:
        for j in temp['author_split']:
            if j in temp["title_10"]:
                indic=1
                masters.loc[i, "title"]=temp["title_10"]
                masters.loc[i, "reviewed-author"]=j
                masters.loc[i, "content_type"]="Review"
                if "Review by:" in temp["title_10"]:
                    print("weird")
    if indic==0:
        masters.loc[i, 'title']=temp['title_10']

In [99]:
masters[masters.title.isna()]

Unnamed: 0,issue_url,ISSN,URL,journal,number,publisher,title,urldate,volume,year,abstract,author,pages,reviewed-author,uploaded,content_type,author_split,title_10


## Classifying miscellaneous documents

In [100]:
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = pd.NA  
masters.loc[masters['title'].str.lower().str.match(r'\[introduction\]')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'notes and memoranda')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'the schumpeter prize$')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'acknowledgement')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'editorial')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'index, volume')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'harvard university courses in economics for 1928-29')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'notice')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'subscriptions')==True,'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'volume matter')==True,'content_type']='MISC'
manual=["https://www.jstor.org/stable/1884484",
"https://www.jstor.org/stable/1885577",
"https://www.jstor.org/stable/1882057",
"https://www.jstor.org/stable/1884497"]
masters.loc[masters["URL"].isin(manual)==True, "content_type"]="MISC"


In [101]:
masters[masters['title'].str.lower().str.match(r'notice')==True]

Unnamed: 0,issue_url,ISSN,URL,journal,number,publisher,title,urldate,volume,year,abstract,author,pages,reviewed-author,uploaded,content_type,author_split,title_10
1935,https://www.jstor.org/stable/10.2307/i332451,"00335533, 15314650",https://www.jstor.org/stable/1885532,The Quarterly Journal of Economics,2,Oxford University Press,Notice to Our Readers,2023-09-12,99,1984,,,383-384,,1,MISC,,Notice to Our Readers



## Classifying other content

In [102]:
sum(masters.content_type.isna())
#masters.shape[0]

5346

In [103]:
masters.loc[masters['title'].str.lower().str.contains("erratum")|masters['title'].str.lower().str.contains("errata"), 'content_type']="Errata"

In [104]:
# masters.loc[masters['authors'].str.lower().str.match(r'^review(ed|) by(.*)')==True,'content_type']='Review' #reviews
# masters.loc[(masters['title'].str.lower().str.match(r'(.*) by (.*)')==True) & (masters.authors.isna()==True),'content_type']='Review2' 
#possible reviews that don't have author names
masters[(masters['content_type']=='Review2') | (masters['content_type']=='Review')].shape[0] #reviews

113

In [105]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?).*comment.*$')==True,'content_type']='Comment'
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*comment$')==True,'content_type']='Comment'
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(a further|further) comment.*$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0]
#.shape[0] 
#comments

263

In [106]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

156

In [107]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?|).*rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

31

In [108]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

4

In [109]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

4873

In [110]:
# code block for testing regex
#masters[masters['title'].str.lower().str.match(r'^\washington notes$')==True]
#masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True]
#masters[masters.content_type=='Discussion']

## Consider the pivots file
At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles. The next block separates special issues (S) from normal issues (N) 

In [111]:
pivots.loc[pivots.Jstor_issue_text.isna(),"Jstor_issue_text"]="None"

In [112]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey|index)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots.type.value_counts()

type
N    544
S      4
Name: count, dtype: int64

## Merging pivots and masters

In [113]:
result = pd.merge(masters, pivots[['issue_url','type']], how="left", on=["issue_url", "issue_url"])

In [114]:
result.to_excel(saveas, index=False)