## This section aims to match up Scopus records and Jstor articles
If an article's affiliations, citations or abstracts are recorded on Scopus, I want to exclude them from the set of pdf's that are sent to docParser. Matching up the Scopus data is also useful for comparing the textual accuracy of OCR parsers.

1. An exact match is if title, year, issue, pages and journal names match. 
2. A direct match is if it is explicitly matched from the scopus id to the jstor id via a manually compiled dictionary
3. An approximate match is if the jstor and scopus titles fuzzy match with atleast a 95 % score, the first listed author has a fuzzy match of atleast 95 % and year, issue, pages and journal names match

If the matching algorithm does not produce an approximate match that satisfies the criteria in 3., then the algorithm records the best match according to title similarity which one is expected to use to check for errors against the scopus or jstor metadata. I use this best match list to:
- Resolve errors in page numbering, assigned year, trivial title errors or issue numbers in the scopus data and jstor data
- Compile a list of scopus ids to exclude that are not in top 5 or have erroneous data - each scopus id on this list has a documented reason for its exclusion
- Compile a list for 2. because 3 and 1 would fail. The case occurs if either jstor or scopus are missing author names for the entry. This usually occurs with scopus data, in which case it is not feasable to fill this in because there is an associated author id.
- Compile a list of scopus ids that match to miscellaneous articles for exclusion. I don't expect to analyze miscellaneous articles, they also often lack the author or are trivial e.g. a back matter. Reports by committee members and publisher originated content count as miscellaneous as for jstor articles but errata and corrections are not considered miscellaneous.
- Compile a list of cases where many scopus ids are assigned to the same jstor article and many jstor articles are assigned to the same scopus id. The latter is usually an error and gets added to the exclusion list.

This reconciliation is compiled in a json file called scopus_recon.json. The two largest correction categories are the title field and the pages field in both datasets, I correct whichever dataset entry is incorrect after googling so that as many cases as possible are adjusted to an exact match. 

Why not match the scopus id and jstor id/doi directly? The scopus id is often the DOI of the article. And for older articles, the jstor url id is recorded as the doi. However, scopus occasionally gets assigns the wrong data (title, author, pages, references etc.) to an unrelated DOI hence we cannot assume scopus ids match DOI/Jstor ids.


### Reasons for page differences:
- scopus is off by 1 or 2 pages because there is an extra blank page in the journal that scopus counts but jstor does not.
- scopus is off because it doesn't count the first page of the paper because it did not have a page number on it.
- scopus forgot to count the last page or cuts off early leaving off the reference list or lists the appendix separately.
- scopus if off by a page
- scopus has two articles erroneously assigned to the same doi/entry on it's records. the references of the two articles are also combined together to imply the same article. 

- jstor is off by 1 page because it does not count the first page which given a title page does not have a page number on it
- jstor actually has two articles erroneously assigned to the same article eg: the comment and it's reply but indexes the article by only the author of the first and second articles.


In [94]:
import pandas as pd
from difflib import SequenceMatcher as sq
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import datetime
import pickle
import pprint
import json

In [95]:
pd.set_option('display.max_rows', None)

In [96]:
base_path="/Users/sijiawu/Work/Thesis/Data"

jid=["aer", 'ecta', 'jpe', 'res', 'qje']
cleaned={}

for i in jid:
    # print(i)
    cleaned[i]=pd.read_excel(base_path+'/Processed/'+i.upper()+'_processed.xlsx')
    # print(j_data[i].dtypes)
    cleaned[i]['volume']=cleaned[i]['volume'].astype(int)
    cleaned[i]['year']=cleaned[i]['year'].astype(int)
    cleaned[i]['pages']=cleaned[i]['pages'].str.strip()
    cleaned[i]['number']=cleaned[i]['number'].astype(str).str.strip()
    cleaned[i]=cleaned[i].drop_duplicates(subset=['URL'], keep="last").reset_index(drop=True)
    cleaned[i]['jid']=i


scopus_base = pd.read_excel(base_path+'/SCOPUS/api_output/scopus_all.xlsx')
cleaned=pd.concat(cleaned.values()).reset_index(drop=True)

scopus_base['scopus_title_og']=scopus_base['Title']

In [97]:
missing_cleaned=[
{   'issue_url':'https://academic.oup.com//qje/issue/135/3',
    'author':"Enke, Benjamin",
    'title' :'What You See Is All There Is*',
    'journal' :"The Quarterly Journal of Economics",
    'volume' :'135',
    'number' :'3',
    'pages' :'1363-1398',
    'year' :'2020',
    'ISSN' :'0033-5533',
    'abstract' :"News reports and communication are inherently constrained by space, time, and attention. As a result, news sources often condition the decision of whether to share a piece of information on the similarity between the signal and the prior belief of the audience, which generates a sample selection problem. This article experimentally studies how people form beliefs in these contexts, in particular the mechanisms behind errors in statistical reasoning. I document that a substantial fraction of experimental participants follows a simple “what you see is all there is” heuristic, according to which participants exclusively consider information that is right in front of them, and directly use the sample mean to estimate the population mean. A series of treatments aimed at identifying mechanisms suggests that for many participants, unobserved signals do not even come to mind. I provide causal evidence that the frequency of such incorrect mental models is a function of the computational complexity of the decision problem. These results point to the context dependence of what comes to mind and the resulting errors in belief updating.",
    "URL" :"https://doi.org/10.1093/qje/qjaa012",
    "publisher":"Oxford University Press",
    'content_type':'Article',
    'type':'N',
    'jid':'qje',
    'author_split':"['Enke, Benjamin']"
}]
cleaned=pd.concat([pd.DataFrame(missing_cleaned),cleaned]).reset_index(drop=True)



In [98]:
# Open and read the JSON file
with open('scopus_recon.json', 'r') as file:
    recon_scopus = json.load(file)

s_fix=recon_scopus['scopus_fix'] #done
ignore_misc_scopus_ids=recon_scopus['ignore_misc'] #done
scopus_exclusion=recon_scopus['scopus_exclusion'].keys() #done
match_direct=recon_scopus['match_direct'] #done
msoj=recon_scopus['ManyScoToOneJstor'] #
mjos=recon_scopus['ManyJstorToOneSco'] #
fix_cleaned=recon_scopus['fix_cleaned_jstor'] #done

many_match=list(set(list(mjos.keys())+sum(msoj.values(),[])))

In [99]:
# create originals for comparison
for i in ['Volume', 'Issue', 'Page start', 'Page end', 'Title', 'Year']:
    scopus_base['scopus_'+i.lower().replace(' ','_')+'_og']=scopus_base[i]

for i in ['URL', 'number', 'title', 'author', 'pages']:
    cleaned[i+'_og']=cleaned[i]


In [100]:
scopus_base['s_fix']=0
for i in s_fix:
    for j in i.keys():
        if (j!='scopus_id'):
            scopus_base.loc[scopus_base['scopus_id']==i['scopus_id'], j]=i[j]
    scopus_base.loc[scopus_base['scopus_id']==i['scopus_id'], 's_fix']=1


  scopus_base.loc[scopus_base['scopus_id']==i['scopus_id'], j]=i[j]


In [101]:
def proc_var(df, field):
    df[field]=df[field].astype(str)
    for i in df.index:
        if df.loc[i, field]=='nan':
            continue
        else:
            try:
                df.loc[i, field]=str(int(float(df.loc[i, field]))).strip()
            except Exception as e:
                # print(e)
                df.loc[i, field]=str(df.loc[i, field]).strip()
    

In [102]:
for i in cleaned.index:
    if pd.isna(cleaned.loc[i, 'pages'])==True:
        continue
    if cleaned.loc[i, 'pages'][0:3]=="p. ":
        cleaned.loc[i, 'pages']=cleaned.loc[i, 'pages'][3:]

proc_var(cleaned,'year')
proc_var(cleaned, 'pages')

In [103]:
proc_var(scopus_base,'Volume')
proc_var(scopus_base,'Year')
proc_var(scopus_base,'Issue')
proc_var(scopus_base,'Page start')
proc_var(scopus_base,'Page end')


In [104]:
scopus_base['scopus_pages']='nan'
for i in scopus_base.index:
    if scopus_base.loc[i,"Page end"].lower().strip()=="nan":
        if scopus_base.loc[i, "Page start"]=="nan":
            print("No pages")
        else:
            scopus_base.loc[i, "scopus_pages"]=scopus_base.loc[i,"Page start"].lower().strip()
    else:
        if scopus_base.loc[i, "Page start"].lower().strip()==scopus_base.loc[i,"Page end"].lower().strip():
            scopus_base.loc[i, "scopus_pages"]=str(scopus_base.loc[i, "Page start"]).lower().strip()
        else:
            scopus_base.loc[i, "scopus_pages"]=(str(scopus_base.loc[i, "Page start"]).lower()+"-"+str(scopus_base.loc[i,"Page end"]).lower()).strip()


No pages


In [105]:
rename_scopus={
 'jid': 'scopus_jid',
 'scopus_id': 'scopus_id',
 'authorgroup': 'scopus_authorgroup',
 'authors': 'scopus_authors',
 'affiliations': 'scopus_affiliations',
 'references': 'scopus_references',
 'Author full names': 'scopus_author_full_names',
 'Title': 'scopus_title',
 'Year': 'scopus_year',
 'Source title': 'scopus_source_title',
 'Volume': 'scopus_volume',
 'Issue': 'scopus_issue',
 'Art. No.': 'scopus_art_no',
 'Page start': 'scopus_page_start',
 'Page end': 'scopus_page_end',
 'Page count': 'scopus_page_count',
 'Cited by': 'scopus_cited_by',
 'DOI': 'scopus_doi',
 'Abstract': 'scopus_abstract',
 'Publisher': 'scopus_publisher',
 'Document Type': 'scopus_document_type',
 'Publication Stage': 'scopus_publication_stage',
 'Open Access': 'scopus_open_access',
 'Source': 'scopus_source',
 'EID': 'scopus_eid'
}

scopus_base = scopus_base.rename(columns=rename_scopus)

scopus_base['scopus_title']=scopus_base['scopus_title'].str.lower().str.strip().str.replace('‘',"'").str.replace('’',"'").str.replace('"',"'").str.replace('–','-').str.replace('‐','-').str.replace('™','').str.replace('- ','-').str.replace(' -','-').str.replace('“',"'").str.replace("*","").str.replace('”',"'").str.replace('behaviour','behavior').str.strip()
scopus_base['scopus_title']=scopus_base['scopus_title'].str.strip().str.split().apply(' '.join).str.strip()
for i in scopus_base.index:
    if '†' in scopus_base.loc[i, 'scopus_title']:
        scopus_base.loc[i, 'scopus_title']=scopus_base.loc[i, 'scopus_title'].strip()[:-1].strip()
    if scopus_base.loc[i, 'scopus_title'].strip()[-1]==".":
        scopus_base.loc[i, 'scopus_title']=scopus_base.loc[i, 'scopus_title'].strip()[:-1].strip()


cleaned['title']=cleaned['title'].str.lower().str.strip().str.replace('‘',"'").str.replace('’',"'").str.replace('"',"'").str.replace('–','-').str.replace('‐','-').str.replace('™','').str.replace('- ','-').str.replace(' -','-').str.replace('“',"'").str.replace('”',"'").str.replace("*","").str.replace('behaviour','behavior').str.strip()
cleaned['title']=cleaned['title'].str.strip().str.split().apply(' '.join).str.strip()
for i in cleaned.index:
    if '†' in cleaned.loc[i, 'title']:
        cleaned.loc[i, 'title']=cleaned.loc[i, 'title'].strip()[:-1].strip()
    if str(cleaned.loc[i, 'title']).strip()[-1]==".":
        cleaned.loc[i, 'title']=cleaned.loc[i, 'title'].strip()[:-1].strip()

In [106]:
cleaned.loc[cleaned["number"]==datetime.datetime(2023, 5, 6, 0, 0),'number']='5-6'
cleaned.loc[cleaned["number"]==datetime.datetime(2023, 3, 4, 0, 0),'number']='3-4'
cleaned.loc[cleaned["number"]==datetime.datetime(2023, 1, 2, 0, 0),'number']='1-2'


cleaned['j_fix']=0
for i in fix_cleaned:
    for j in i.keys():
        if j!='URL':
            cleaned.loc[cleaned['URL']==i['URL'],j]=i[j]
    cleaned.loc[cleaned['URL']==i['URL'], 'j_fix']=1


cleaned['title']=cleaned['title'].str.lower()
scopus_base['scopus_title']=scopus_base['scopus_title'].str.lower()


In [107]:
scopus_base.loc[scopus_base['scopus_issue']=='5 Part 2', 'scopus_issue']='5'
scopus_base.loc[scopus_base['scopus_issue']=='5 Part 1', 'scopus_issue']='5'
scopus_base.loc[scopus_base['scopus_issue']=='6 PART 1', 'scopus_issue']='6'
scopus_base.loc[scopus_base['scopus_issue']=='6 PART 2', 'scopus_issue']='6'

cleaned.loc[cleaned['number']=='2023-05-06 00:00:00', 'number']='5-6'
cleaned.loc[cleaned['number']=='2023-03-04 00:00:00', 'number']='3-4'
cleaned.loc[cleaned['number']=='2023-01-02 00:00:00', 'number']='1-2'

In [108]:
#discard scopus titles that are post 2020
year_range=[]
for i in range(1940,2021):
    year_range.append(str(i))

ex_years=['2021', '2022', '2023', '2024']

scopus_plus=scopus_base[(scopus_base["scopus_year"].isin(ex_years)==True)].reset_index(drop=True)
scopus_ignore=scopus_base[(scopus_base["scopus_year"].isin(ex_years)==False)&(scopus_base["scopus_id"].isin(ignore_misc_scopus_ids)==True)].reset_index(drop=True)
scopus_wrong=scopus_base[(scopus_base['scopus_id'].isin(scopus_exclusion)==True)].reset_index(drop=True)
scopus_many=scopus_base[(scopus_base['scopus_id'].isin(many_match)==True)].reset_index(drop=True)
scopus_direct=scopus_base[scopus_base['scopus_id'].isin(list(match_direct.values()))].reset_index(drop=True)
scopus=scopus_base[(scopus_base["scopus_year"].isin(ex_years)==False)&(scopus_base['scopus_id'].isin(list(match_direct.values()))==False)&(scopus_base['scopus_id'].isin(many_match)==False)&(scopus_base["scopus_id"].isin(ignore_misc_scopus_ids)==False)&(scopus_base['scopus_id'].isin(scopus_exclusion)==False)].reset_index(drop=True)


print(scopus_base.shape)

print(scopus_plus.shape)
print(scopus_ignore.shape)
print(scopus_wrong.shape)
print(scopus_many.shape)
print(scopus_direct.shape)
print(scopus.shape)

(15653, 33)
(828, 33)
(164, 33)
(44, 33)
(45, 33)
(45, 33)
(14527, 33)


In [109]:
828+164+14577+44

15613

In [110]:
merged=pd.merge(cleaned, scopus, how='left', left_on=['title', 'year', 'jid','number','pages'], right_on=['scopus_title', 'scopus_year', 'scopus_jid', 'scopus_issue','scopus_pages']).reset_index(drop=True)

#indicator for whether the merge was automatic on relavant fields '0' or because of approximate matches '1' 
merged["scopus_indicator"]=0

for item in match_direct.keys():
    target=scopus_direct[scopus_direct['scopus_id']==match_direct[item]]
    for j in scopus_direct.columns:
        merged.loc[merged['URL']==item,j]=target[j].values[0]
        # print(target[j].values[0])
    merged.loc[merged['URL']==item, 'scopus_indicator']=2


m_add=[]
j_fields=['title_10', 'abstract', 'year', 'pages_og', 'content_type', 'volume', 'urldate', 'number', 'URL', 'author', 'author_og', 'type', 'ISSN', 'scopus_indicator', 'journal', 'title_og', 'publisher', 'number_og', 'pages', 'issue_url', 'reviewed-author', 'author_split', 'uploaded', 'jid', 'title', 'URL_og']
for i in mjos:
    # print(i)
    for j in mjos[i]:
        m_add.append(merged[merged["URL"]==j][j_fields].to_dict("records")[0]|
                     scopus_many[scopus_many['scopus_id']==i].to_dict('records')[0]|
                     {"scopus_indicator":3})

for i in msoj:
    # print(i)
    for j in msoj[i]:
        m_add.append(merged[merged["URL"]==i][j_fields].to_dict("records")[0]|
                     scopus_many[scopus_many['scopus_id']==j].to_dict('records')[0]|
                     {"scopus_indicator":4})

add=pd.DataFrame(m_add)


In [111]:
merged=merged[merged["URL"].isin(add['URL'].to_list())==False].reset_index(drop=True)
merged=pd.concat([merged, add]).reset_index(drop=True)

In [112]:
scopus_mis={}
for i in jid:
    print(i)
    sids=list(merged[merged['jid']==i]['scopus_id'].unique())
    temp=scopus[(scopus["scopus_id"].isin(sids)==False)&(scopus["scopus_jid"]==i)].reset_index(drop=True)
    scopus_mis[i]=temp
    


aer
ecta
jpe
res
qje


In [113]:
scopus_recon={}
# for each journal
for l in jid:
    print(l)
    a=0
    ff=[]
    b=0
    check=[]

    for i in scopus_mis[l].index:
        # print(i)
        found=0
        max_r=0
        sim=0
        m_sim=0
        fuzz_max=0
        fuzz_found=0
        target=None
        # if scopus_mis[l].loc[i,'scopus_id']!='10.1086/261724':
        #     continue
        for j in merged[(merged['year']==scopus_mis[l].loc[i, 'scopus_year'])&(merged['number']==scopus_mis[l].loc[i, 'scopus_issue'])&(merged['jid']==l)].index:
            # print(merged.loc[j, 'title'])
            seq_rat=sq(None, scopus_mis[l].loc[i,'scopus_title'],merged.loc[j, 'title']).ratio()
            # fuzz_rat=0
            # for k in str(merged.loc[j,'author']).split(' and '):
            #     if fuzz.token_sort_ratio(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0], k)>fuzz_rat:
            fuzz_rat=fuzz.token_sort_ratio(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0], str(merged.loc[j,'author']).split(' and ')[0])
            # print(merged.loc[j,'title'])
            if (seq_rat>=sim):
                if (fuzz_rat>=fuzz_max) | (pd.isna(merged.loc[j,'author'])) | (pd.isna(scopus_mis[l].loc[i,'scopus_author_full_names'])):
                    if (seq_rat==sim):
                        fuzz_found+=1
                    sim=seq_rat
                    m_sim=j
                    fuzz_max=fuzz_rat
                
            if (seq_rat>0.95) & (fuzz_rat>95)& (str(scopus_mis[l].loc[i,'scopus_issue'])==str(merged.loc[j,'number'])) & (pd.isna(merged.loc[j,'scopus_id'])==True):
                # print("execute")
                # print(scopus_mis[l].loc[i,'scopus_pages'])
                # print(merged.loc[j,'pages'])
                if (seq_rat>max_r)&((scopus_mis[l].loc[i,'scopus_pages'].lower()==merged.loc[j,'pages'].lower())|((scopus_mis[l].loc[i,'scopus_pages'].lower()+'-'+scopus_mis[l].loc[i,'scopus_pages'].lower())==merged.loc[j,'pages'].lower())):
                    max_r=seq_rat
                    target=j
                    found+=1
                # print(found)
                # print(scopus_mis[l].loc[i,'scopus_title'])
                # print('----match----'+merged.loc[j, 'title']+'    '+str(seq_rat))
                # print(scopus_mis[l].loc[i,'scopus_issue']+ '    '+str(merged.loc[j,'number']))
                # print(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0]+ '    '+str(merged.loc[j,'author']).split(' and ')[0])
                a+=1
                # print('\n')
        if found>1:
            ff.append({i:target})
            for k in scopus.columns:
                merged.loc[target,k]=scopus_mis[l].loc[i,k]
        elif found==1:
            # print(target)
            for k in scopus.columns:
                merged.loc[target,k]=scopus_mis[l].loc[i,k]
            merged.loc[target, 'scopus_indicator']=1
        else:
            
            if sim!=0:
                # print(sim)
                # print(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0])
                # print(str(merged.loc[m_sim, 'author']).split(' and ')[0])
                # print(fuzz.token_sort_ratio(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0], str(merged.loc[m_sim, 'author']).split(' and ')[0]))
                # print(str(scopus_mis[l].loc[i,'scopus_issue']))
                # print(str(merged.loc[m_sim,'number']))
                # print(temp.loc[i,'scopus_title'])
                # print(sim)
                # print(m_sim)
                # print('{"scopus_id":"'+temp.loc[i,'scopus_id'] + '", "Title":"'+merged.loc[m_sim, 'title']+'"}')
                check.append({
                    "scopus_id":scopus_mis[l].loc[i,'scopus_id'],
                    "title":merged.loc[m_sim, 'title'],
                    "title_scopus":scopus_mis[l].loc[i,'scopus_title'],
                    "sim":sim, #similarity score
                    "as":scopus_mis[l].loc[i,'scopus_author_full_names'],
                    "v":scopus_mis[l].loc[i,'scopus_volume'],
                    "is":scopus_mis[l].loc[i,'scopus_issue'],
                    'i':merged.loc[m_sim, 'number'],
                    'ps':scopus_mis[l].loc[i,'scopus_pages'],
                    "a":merged.loc[m_sim, 'author'],
                    "URL":merged.loc[m_sim, 'URL'],
                    'p':merged.loc[m_sim, 'pages'],
                    'y': merged.loc[m_sim, 'year'],
                    'ys':scopus_mis[l].loc[i,'scopus_year'],
                    'jid':merged.loc[m_sim, 'jid'],

                })
                b+=1

    scopus_recon[l]={
        "found_count": a, 
        "check_count": b, 
        "conflict_match": ff, 
        "check": check
        }

aer
ecta
jpe
res
qje


In [114]:
scopus_recon

{'aer': {'found_count': 127,
  'check_count': 0,
  'conflict_match': [],
  'check': []},
 'ecta': {'found_count': 37,
  'check_count': 0,
  'conflict_match': [],
  'check': []},
 'jpe': {'found_count': 58,
  'check_count': 0,
  'conflict_match': [],
  'check': []},
 'res': {'found_count': 164,
  'check_count': 0,
  'conflict_match': [],
  'check': []},
 'qje': {'found_count': 196,
  'check_count': 0,
  'conflict_match': [],
  'check': []}}

In [115]:
# check_scopus=[]

# for i in jid:
#     if i != 'qje':
#         continue
#     for j in scopus_recon[i]['check']:
#         check_scopus.append(j)

# check_scopus

In [116]:
merged[merged['year'].isin(year_range)]['issue_url'].unique().shape

(2048,)

In [117]:
content=['Article', 'Comment', 'Reply', 'Rejoinder']
content_ex=['MISC','Discussion','Review', 'Review2', 'Errata']

In [118]:
match_summary=[]
scopus_summary=[]
for i in jid:
    sids=list(merged[(merged['jid']==i)&(merged['scopus_id'].isna()==False)]["scopus_id"].unique())
    sids_n=merged[(merged['jid']==i)&(merged['scopus_id'].isna()==False)].shape[0]
    nrm_merged=merged[(merged['jid']==i)&(merged['content_type'].isin(content)==True)&(merged['year'].isin(year_range)==True)].shape[0]

    temp=scopus[(scopus["scopus_id"].isin(sids)==False)&(scopus["scopus_jid"]==i)&(scopus["scopus_year"].isin(ex_years)==False)].reset_index(drop=True)
    scopus_mis[i]=temp

    gr_2020=scopus_plus[scopus_plus['scopus_jid']==i].shape[0]
    ign_misc=scopus_ignore[scopus_ignore['scopus_jid']==i].shape[0]
    sc_err=scopus_wrong[scopus_wrong['scopus_jid']==i].shape[0]

    scopus_count=scopus_base[(scopus_base['scopus_jid']==i)&(scopus_base['scopus_year'].isin(ex_years)==False)].shape[0]
    result=len(sids)*100/scopus_count

    result3=sids_n*100/merged[(merged['jid']==i)]['URL'].unique().shape[0]
    
    sids_nrm=list(merged[(merged['jid']==i)&(merged['year'].isin(year_range)==True)&(merged['scopus_id'].isna()==False)&(merged['content_type'].isin(content)==True)]['scopus_id'].unique())
    sids_nrm_n=merged[(merged['jid']==i)&(merged['year'].isin(year_range)==True)&(merged['scopus_id'].isna()==False)&(merged['content_type'].isin(content)==True)].shape[0]

    
    result2=sids_nrm_n*100/nrm_merged

    scopus_summary.append({
        "journal": i,
        "articles on scopus": scopus_base[(scopus_base['scopus_jid']==i)].shape[0],
        "article year > 2020": gr_2020,
        "ignored misc articles": ign_misc,
        "discarded metadata with errors": sc_err,
        "scopus match candidates": scopus_base[scopus_base['scopus_jid']==i].shape[0]-gr_2020-ign_misc-sc_err,
        "match %": f"{result:.4f}",
        "many scopus one jstor": len(merged[(merged['jid']==i)&(merged['scopus_indicator']==4)]['scopus_id'].unique()),
        "many jstor one scopus": len(merged[(merged['jid']==i)&(merged['scopus_indicator']==3)]['scopus_id'].unique()),
        "direct match": len(merged[(merged['jid']==i)&(merged['scopus_indicator']==2)]['scopus_id'].unique()),
        "scopus matched": len(merged[(merged['jid']==i)&(merged['scopus_id'].isna()==False)&(merged['scopus_indicator']==0)&(merged['s_fix']!=1)]['scopus_id'].unique()),
        "scopus match on adj": len(merged[(merged['jid']==i)&(merged['scopus_id'].isna()==False)&(merged['scopus_indicator']==0)&(merged['s_fix']==1)]['scopus_id'].unique()),
        "scopus approx. matched": len(merged[(merged['jid']==i)&(merged['scopus_indicator']==1)]['scopus_id'].unique()),
        "scopus unmatched": len(scopus[(scopus["scopus_jid"]==i)&(scopus["scopus_id"].isin(sids)==False)]['scopus_id'].unique()),
    })

    

    match_summary.append({
        "journal": i,
        "jstor <=2020": merged[(merged['jid']==i)]['URL'].unique().shape[0],
        "scopus <=2020": scopus_count,
        "matches*": sids_n,
        "match %": f"{result3:.4f}",
        "2020>= NMR >=1940": nrm_merged,
        "2020>= NMR >=1940 matches": sids_nrm_n,
        "2020>= NMR >=1940 match %": f"{result2:.4f}",
        # "unmatched scopus articles": temp.shape[0],
        # "unmatched scopus articles post 1940": temp[temp['scopus_year'].isin(year_range)==True].shape[0],
    })
    # cids=cleaned['title'].unique()
    # cids.sort()

summary=pd.DataFrame(match_summary)
summary.to_csv(base_path+"/Combined/011_scopus_match_summary.csv", index=False)

sc_summary=pd.DataFrame(scopus_summary)
sc_summary.to_csv(base_path+"/Combined/011_scopus_summary.csv", index=False)

In [119]:
summary
# *matches include duplicate matches 

Unnamed: 0,journal,jstor <=2020,scopus <=2020,matches*,match %,2020>= NMR >=1940,2020>= NMR >=1940 matches,2020>= NMR >=1940 match %
0,aer,27564,4360,4225,15.328,12804,4198,32.7866
1,ecta,9339,1633,1605,17.186,5348,1591,29.7494
2,jpe,14345,1293,1290,8.9927,4768,1276,26.7617
3,res,4140,3037,3019,72.9227,3242,2855,88.0629
4,qje,6869,4502,4485,65.2933,3700,3187,86.1351


In [120]:
sc_summary

Unnamed: 0,journal,articles on scopus,article year > 2020,ignored misc articles,discarded metadata with errors,scopus match candidates,match %,many scopus one jstor,many jstor one scopus,direct match,scopus matched,scopus match on adj,scopus approx. matched,scopus unmatched
0,aer,4360,0,129,12,4219,96.7661,2,4,1,4004,81,127,0
1,ecta,1910,277,27,1,1605,98.2854,0,0,1,1541,26,37,0
2,jpe,1542,249,1,2,1290,99.768,0,0,5,1183,44,58,0
3,res,3339,302,3,15,3019,99.4073,6,0,13,2682,154,164,0
4,qje,4502,0,4,14,4484,99.6002,32,1,25,3968,262,196,0


In [121]:
sc_summary[['journal', 'articles on scopus', 'article year > 2020',
       'ignored misc articles', 'discarded metadata with errors',
       'scopus match candidates', 'match %']].to_csv(base_path+"/Combined/011_scopus_summary_p1.csv", index=False)

In [122]:
sc_summary[['journal','scopus match candidates','many scopus one jstor',
       'many jstor one scopus', 'direct match', 'scopus matched',
       'scopus match on adj', 'scopus approx. matched', 'scopus unmatched']].to_csv(base_path+"/Combined/011_scopus_summary_p2.csv", index=False)

In [123]:
merged['year']=merged['year'].astype(int)
merged.to_pickle(base_path+"/Combined/011_merged_proc_scopus_inception_2020.pkl")

In [124]:
merged.content_type.unique()

array(['Article', 'MISC', 'Comment', 'Reply', 'Errata', 'Rejoinder',
       'Discussion', 'Review', 'Review2'], dtype=object)