# ECTA Cleaning

This notebook walks through how the ECTA articles were sorted into categories of articles and non-articles.

## Load Libraries

In [55]:
from tokenize import Ignore
import numpy as np
import pandas as pd
import time
from os import path
import sys
from pathlib import Path
from PyPDF2 import PdfFileReader, PdfFileWriter
import re
import os
from difflib import SequenceMatcher
import datetime
import matplotlib.pyplot as plt

pd.set_option('display.max_rows',None)
pd.set_option('display.max_colwidth', 120)   

## Load Files
Replace these file paths with local file paths

In [56]:
base_path="/Users/sijiawu/Work/Thesis/Data"

In [57]:
masters = pd.read_excel(base_path+"/Masterlists/ECTA_Masterlist.xlsx")
masters10 = pd.read_excel(base_path+ "/2010/ECTA_master.xlsx")
pivots = pd.read_excel(base_path+"/Pivots/ECTA_Pivots2020.xlsx")

## Create File names

In [58]:
saveas=base_path+"/Processed/ECTA_processed.xlsx"

## Some random checks on the masters list

My assumption is that all data without author names must be miscellaneous documents like reports by the committee, forewords, front matters etc.. The goal of this notebook is to check for certain that all the documents without author names are actually miscellaneous documents and then classify them as miscellaneous (MISC). Hence, first we group everything the data by title to see the repetitive general content that can likely be removed.

In [59]:
pd.set_option('display.max_rows',masters.shape[0])
temp=masters['title'].str.lower().value_counts()
pd.DataFrame(temp[temp>1]).head(10)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
back matter,445
front matter,436
news notes,193
announcements,146
accepted manuscripts,116
volume information,80
submission of manuscripts to econometrica,49
forthcoming papers,42
report of the secretary,31
report of the treasurer,31


Some repetitions are due to multiple comments. Now consider this list in absence of author names.

In [60]:
temp1=masters[masters['author'].isna()]['title'].str.lower().value_counts()
pd.DataFrame(temp1).head(10)

Unnamed: 0_level_0,count
title,Unnamed: 1_level_1
back matter,445
front matter,436
news notes,193
announcements,134
accepted manuscripts,116
volume information,80
submission of manuscripts to econometrica,49
forthcoming papers,42
news note,25
fellows of the econometric society,25


In [61]:
# block for testing regex matching
#pd.DataFrame(masters[masters['content_type'].isna()]['title'].str.lower().value_counts())
#masters[masters['title'].str.lower().str.match(r'(^|: )report of the')]
#masters[masters['title'].str.lower().str.match(r'(^|.*: )report of the')]
#masters.loc[masters['title'].str.lower().str.match(r'^combined references(.*)')==True,'content_type']='MISC'
#masters[masters['title'].str.lower().str.match(r'.*(members|members and subscribers)$')]

Judging from the above anything with greater than or equal to 5 duplicates are miscellaneous. The next code blocks classify it as such.

In [62]:
masters["content_type"]=None
temp2=masters[masters['content_type'].isna()==True]['title'].str.lower().value_counts()
#pd.DataFrame(temp2)
removal=list(temp2[temp2>=5].index)
masters.loc[masters.title.str.lower().isin(removal),'content_type']='MISC'

## Combine the scraped list with the citations files

I have found that masterlists contructed from citation files lack the reviewed source's name while it is present on the page of the article. Some files are just missing the title. So I'm combining the old masterlists with the new ones.

In [63]:
masters["URL"]="https:"+masters["URL"].str.split(':').str[-1]
masters.drop('type', inplace=True, axis=1)
masters10["stable_url"]="https:"+masters10["stable_url"].str.split(':').str[-1]
masters10.rename(columns = {'authors':'author','stable_url':'URL','title':'title_10'}, inplace = True)
masters10.rename(columns = {'authors':'author','stable_url':'URL','title':'title_10'}, inplace = True)
masters['pages']=masters['pages'].str.strip()  
masters.loc[masters.title.str.lower() == "back matter", 'pages'] = pd.NA  
masters['pages']=masters["pages"].str.split('pp. ').str[-1]
masters['pages']=masters['pages'].replace(r'--','-',regex=True).str.strip()

In [64]:
masters["author_split"]=masters['author'].str.split(' and ')
masters=masters.merge(masters10[['URL', 'title_10']], how='left', on='URL')

In [65]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False), 'content_type']="Review"
masters.loc[((masters['title'].str.lower().str.contains('book reviews indexed by author of book')==True)),'content_type']='Review'

In [66]:
masters.loc[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False),"title"]=masters[(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==False)]["title_10"]
masters.loc[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True),"title"]=masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)&(pd.isna(masters["author"])==True)]["title_10"]

In [67]:
for i in masters[(pd.isna(masters["title_10"])==False)&(pd.isna(masters["title"])==True)&(pd.isna(masters["reviewed-author"])==True)].index:
    temp=masters.iloc[i]
    indic=0
    if len(temp['author_split'])>1:
        for j in temp['author_split']:
            if j in temp["title_10"]:
                indic=1
                masters.loc[i, "title"]=temp["title_10"]
                masters.loc[i, "reviewed-author"]=j
                masters.loc[i, "content_type"]="Review"
                if "Review by:" in temp["title_10"]:
                    print("weird")
    if indic==0:
        masters.loc[i, 'title']=temp['title_10']

In [68]:
masters10.head()

Unnamed: 0,URL,author,title_10,abstract,content_type,issue_url,pages
0,https://www.jstor.org/stable/45238021,,Front Matter,,,https://www.jstor.org/stable/10.2307/i40226149,
1,https://www.jstor.org/stable/45238022,,[Illustration],,,https://www.jstor.org/stable/10.2307/i40226149,
2,https://www.jstor.org/stable/45238023,"Martin Beraja, Erik Hurst and Juan Ospina",THE AGGREGATE IMPLICATIONS OF REGIONAL BUSINESS CYCLES,,,https://www.jstor.org/stable/10.2307/i40226149,1789-1833
3,https://www.jstor.org/stable/45238024,Amanda Friedenberg,BARGAINING UNDER STRATEGIC UNCERTAINTY: THE ROLE OF SECOND-ORDER OPTIMISM,,,https://www.jstor.org/stable/10.2307/i40226149,1835-1865
4,https://www.jstor.org/stable/45238025,Gabriel Carroll and Georgy Egorov,STRATEGIC COMMUNICATION WITH MINIMAL VERIFICATION,,,https://www.jstor.org/stable/10.2307/i40226149,1867-1892


## Classifying miscellaneous content

In [69]:
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'front matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'back matter').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'news note(|s)').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'announcements').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'accepted manuscripts').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'volume information').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'submission of manuscripts to econometrica').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'forthcoming papers').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[masters['title'].str.lower().str.match(r'(^|.*: )report of the'), 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*report (of|on) the(.*)(editors|fellows)'), 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'meeting of the econometric society'), 'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'(^|.*: )report of the.*')==True,'content_type']="MISC"
masters.loc[(masters['title'].str.lower().str.contains('econometric society')==True)&(masters["author"].isna()==True),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('econometrica')==True,'content_type']='MISC'
masters.loc[(masters['title'].str.lower().str.contains('report')==True) & (masters['author'].isna()==True),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.strip().str.match(r'treasurer(.*)report'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.strip().str.contains(r'report from the president'),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.contains('announcement of')==True)),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.match(r'editor(.*)note')==True)),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.match(r'(.*):program$')==True)),'content_type']='MISC'
masters.loc[((masters['title'].str.lower().str.strip().str.match(r'accountant(.*)opinion')==True)),'content_type']='MISC'
masters.loc[masters.apply(lambda k: SequenceMatcher(None, k['title'].lower(), 'unpublished research memoranda').ratio(), axis=1)>0.75,"content_type"]='MISC'
masters.loc[((masters['title'].str.lower().str.strip().str.match(r'^(obituary|death(s?) of members)$')==True)),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*fellows$'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('nomination of fellows'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'.*editorial$'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'(index of authors|summary of accounts)'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.match(r'.*(members|members and subscribers)$'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains('\[illustration\]'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains('\[photograph.*\]'),'content_type']='MISC'
masters.loc[masters['title'].str.lower().str.contains('abstracts of papers'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('frisch medal award'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('award of frisch'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^membership list'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('additive preferences'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('communications'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains('letters to the editor'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^program of.*'), 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'.*: program'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains(r'call for papers'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'election of (new |)fellow(|s)'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^index( of| to|$).*'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'^introduction.*'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains(r'notice of'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.match(r'meetings (in|of)'),'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains(r"société d'économétrie"), 'content_type']="MISC"
masters.loc[masters['title'].str.lower().str.contains(r"election results"), "content_type"]="MISC"
masters.loc[masters['title'].str.lower().str.contains(r"compte.*congres"),"content_type"]="MISC"
masters.loc[masters['title'].str.lower().str.match(r"^(abbreviations|alphabetical list of periodicals|alphabetical list of associations, societies, etc.|author index|in memoriam|monograph prizes|national science foundation grant|news notes from other journals|nomination of fellow, 1984|north american regional conference|north american summer meeting, madison, wisconsin|note of appreciation|note on membership listing|notes of appreciation|notes to financial statements|omission of july issue|pagination error|postgraduate research in econometrics|postdoctoral study in statistics|reprints desired by european members|research information|second world congress|style manual|subject index|third world congress|travel grant to .* meeting|a note of appreciation|acknowledg(e|.)ment)$"),"content_type"]="MISC"
masters.loc[masters['title'].str.lower().str.match(r"^(announcement and tentative program|announcements and notes|announcements of the december 1957|appointment of co-editor|assistantships in econometrics|attendance at the oxford meeting, september 25-29, 1936|election of vice-president|fellowships|fellowships and grants|geographical list of subscribers|la conf.rence européenne de la soci.té d'.conometrie|miscellaneous index|north american summer meeting, madison, wisconsin|officers and council|officers and new council|plans for the atlantic city|plans for special publications|salute to ragnar frisch in honor of his sixty|rules for electing fellows as revised|statements of loss and fund balance for the years ended december|suggestions for fellowship|washington meeting with international|washington meeting, september|\[program\]: tenth indian econometric)|in memoriam \[yehuda grunfeld\]|obituary notice, dickson h. leavens|resumption of editorship by professor frisch"), "content_type"]="MISC"

manual=["https://www.jstor.org/stable/2938202"]

masters.loc[masters["URL"].isin(manual)==True, 'content_type']="MISC"



## Classifying other content

In [70]:
sum(masters.content_type.isna())
#masters.shape[0]

5573

In [71]:
# masters.loc[masters['authors'].str.lower().str.match(r'^review(ed|) by(.*)')==True,'content_type']='Review' #reviews
# masters.loc[(masters['title'].str.lower().str.match(r'(.*) by (.*)')==True) & (masters.author.isna()==True),'content_type']=None 
masters.loc[~(masters['author'].isna()) & (masters['reviewed-author'].isna()==False),'content_type']='Review'

In [72]:
masters.loc[masters['title'].str.lower().str.contains("erratum")|masters['title'].str.lower().str.contains("errata"), 'content_type']="Errata"

In [73]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )comment(|.*)$')==True,'content_type']='Comment'
masters[masters['content_type']=='Comment'].shape[0] #comments

73

In [74]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )reply(| to.*)$')==True,'content_type']="Reply"
masters[masters['content_type']=='Reply'].shape[0]

43

In [75]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True,'content_type']="Rejoinder"
masters[masters['content_type']=='Rejoinder'].shape[0]

14

In [76]:
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*: (|a )discussion$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'(^|a )discussion(|.*)$')==True,'content_type']="Discussion"
masters.loc[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*:.*(|a )discussion(|s)$')==True,'content_type']='Discussion'
masters[masters['content_type']=='Discussion'].shape[0]

9

In [77]:
masters.loc[masters['content_type'].isna(),'content_type']="Article"
masters[masters['content_type']=='Article'].shape[0]

5427

In [78]:
#masters[masters['title'].str.lower().str.match(r'^\washington notes$')==True]
masters[masters.content_type.isna() & masters.title.str.lower().str.match(r'.*(:|\?) (|a )rejoinder.*$')==True]

masters[masters.content_type=='Discussion'].shape[0]

9

## Consider the pivots file
At times, conference papers are structured differently to normal articles. Hence, it may be necessary to distinguish conference papers from common articles.

In [79]:
pivots.loc[pivots.Jstor_issue_text.str.lower().str.match(r'(.*)(supplement|proceedings|annual meeting|survey|index|bibliographical directory)(.*)'),'type']="S"
pivots.loc[pivots.type.isna(),'type']='N'
pivots.type.value_counts()

type
N    447
S      8
Name: count, dtype: int64

Merge and calculate value counts of all the content types.

In [80]:
result = pd.merge(masters, pivots[['issue_url','type']], how="left", on=["issue_url", "issue_url"])

In [81]:
result.to_excel(saveas, index=False)