## This section aims to match up Scopus records and Jstor articles
If an article's affiliations, citations or abstracts are recorded on Scopus, I want to exclude them from the set of pdf's that are sent to docParser. Matching up the Scopus data is also useful for comparing the textual accuracy of OCR parsers. I use volume, issue, year and page numbers which are common to both the scopus data and the Jstor metadata to match articles. 

Then I use a sequence comparison between the journal titles of the matched articles to decide if the scopus data has been matched correctly. If the match ratio is below 70%, the title is investigated and if wrong, the scopus data for that matched article is eihter corrected or discarded. If the scopus data is missing all of affiliations, abstract and citations fields then the match is also discarded.

Finally, if the document type of scopus is different to the classification done during the cleaning section, the article is reclassified according to the Scopus document type.

In [None]:
import pandas as pd
import numpy as np
from difflib import SequenceMatcher as sq
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import datetime
import pickle

In [485]:
base_path="/Users/sijiawu/Work/Thesis/Data"

jid=["aer", 'ecta', 'jpe', 'res', 'qje']
cleaned={}

for i in jid:
    # print(i)
    cleaned[i]=pd.read_excel(base_path+'/Processed/'+i.upper()+'_processed.xlsx')
    # print(j_data[i].dtypes)
    cleaned[i]['volume']=cleaned[i]['volume'].astype(int)
    cleaned[i]['year']=cleaned[i]['year'].astype(int)
    cleaned[i]['pages']=cleaned[i]['pages'].str.strip()
    cleaned[i]['number']=cleaned[i]['number'].astype(str).str.strip()
    cleaned[i]=cleaned[i].drop_duplicates(subset=['URL'], keep="last").reset_index(drop=True)
    cleaned[i]['jid']=i


scopus = pd.read_excel(base_path+'/SCOPUS/api_output/scopus_all.xlsx')
cleaned=pd.concat(cleaned.values()).reset_index(drop=True)

In [486]:
cleaned.journal.unique()

array(['The American Economic Review', 'Econometrica',
       'Journal of Political Economy', 'The Review of Economic Studies',
       'The Quarterly Journal of Economics'], dtype=object)

In [487]:
s_fix=[{'scopus_id': '10.2307/1914237', 'Volume': 46.0, 'Issue': 3},
 {'scopus_id': '10.1086/698748', 'Volume': 126.0, 'Issue': 'S1'},
 {'scopus_id': '10.1086/698750', 'Volume': 126.0, 'Issue': 'S1'},
 {'scopus_id': '10.1086/698749', 'Volume': 126.0, 'Issue': "S1"},
 {'scopus_id': '10.1086/698751', 'Volume': 126.0, 'Issue': 'S1'},
 {'scopus_id': '10.1145/2492002.2482556', 'Volume': 127, 'Issue': 2},
 {'scopus_id': '10.1086/698752', 'Volume': 126.0, 'Issue': "S1"},
 {'scopus_id': '10.1086/698759', 'Volume': 126.0, 'Issue': "S1"},
 {'scopus_id': '10.1086/698760', 'Volume': 126.0, 'Issue': "S1"},
 {'scopus_id': '10.1086/261816', 'Volume': 100, 'Issue': 2},
 {'scopus_id': '10.2307/2295860', 'Volume': 26, 'Issue': 1},
 {'scopus_id': '10.2307/2295857', 'Volume': 26, 'Issue': 1},
 {'scopus_id': '10.2307/2295770', 'Volume': 21, 'Issue': 3},
 {'scopus_id': '10.2307/2296006', 'Volume': 21, 'Issue': 2},
 {'scopus_id': '10.2307/2967659', 'Volume': 4, 'Issue': 1},
 {'scopus_id': '10.2307/2967395', 'Volume': 6, 'Issue': 2},
 {'scopus_id': '10.1093/qje/100.Supplement.823', 'Volume': 100.0, 'Issue': "supplement"},
 {'scopus_id': '10.1093/qje/90.2.344', 'Volume': 89, 'Issue': 4 },
 {'scopus_id': '10.1093/qje/52.Supplement.5', 'Volume': 52.0, 'Issue': "supplement"},
 {'scopus_id': '10.1093/qje/52.Supplement.117', 'Volume': 52.0, 'Issue': "supplement"},
 {'scopus_id': '10.1093/qje/52.Supplement.140', 'Volume': 52.0, 'Issue': 'supplement'},
 {'scopus_id': '10.1093/qje/52.Supplement.9', 'Volume': 52.0, 'Issue': 'supplement'},
 {'scopus_id': '10.2307/2297467', 'Volume': 52.0, 'Issue': '1'},
 {'scopus_id': '10.2307/2297172', 'Volume': 46, 'Issue': '1'},

 {'scopus_id':'10.3982/ECTA17318', 'Title':'CORRIGENDUM TO “TRADING AND INFORMATION DIFFUSION IN OVER-THE-COUNTER MARKETS”'},
 {'scopus_id':'10.3982/ECTA10449', 'Title': 'CORRIGENDUM TO "COMPETING MECHANISMS IN A COMMON VALUE ENVIRONMENT"'},
 {'scopus_id':'10.2307/1914014', "Title":'Estimating the Time Costs of Highway Congestion'},
 {'scopus_id':'10.2307/2171801', 'Title':'individual income, incomplete information, and aggregate consumption'},
 {'scopus_id': '10.2307/1911189', 'Title':'Local Asymptotic Specification Error Analysis'},
 {'scopus_id':'10.2307/1914143','Title':'On the Value of Sample Separation Information'},
 {'scopus_id': '10.2307/2171792', 'Title': 'A Comment on "Learning, Mutation, and Long-Run Equilibria in Games"'},
 {'scopus_id': '10.1111/j.1468-0262.2002.00446.x', "Title": 'Rationalizing Choice Functions by Multiple Rationales'},
 {"Title":'Extending the Classical Normal Errors-in-Variables Model', 'scopus_id': '10.2307/1912823'},
 {"Title":'Several Tests for Model Specification in the Presence of Alternative Hypotheses', 'scopus_id':  '10.2307/1911522'},
 {"Title":'On Seemingly Unrelated Regressions with Error Components','scopus_id':'10.2307/1912824'},
 {"Title":'Approximating a Truncated Normal Regression with the Method of Moments','scopus_id':'10.2307/1912173'},
 {"Title":'Nonlinear Regression on Cross-Section Data' , 'scopus_id':  '10.2307/1913132'},
 {'Title':'PERCEIVED AMBIGUITY AND RELEVANT MEASURES', 'scopus_id': '10.3982/ECTA9872'},
 {'scopus_id':'10.1086/260935','Title':'Uncertainty and Exhaustible Resource Markets'},
{'scopus_id':'10.1086/261170','Title':'Environmental Regulations and Productivity Growth: The Case of Fossil-fueled Electric Power Generation'},
{'scopus_id':'10.1086/261012','Title':'Subsidies to New Energy Sources: Do They Add to Energy Stocks?'},

    {'scopus_id': '10.1086/710554', 'Title': 'erratum: chinese college admissions and school choice reforms: a theoretical analysis'},
    {'scopus_id': '10.1086/703048','Title': 'erratum: the demand for effective charter schools'},
    {'scopus_id': '10.1086/658496', 'Title': 'erratum: measurement error and the relationship between investment and q'},
 {'scopus_id': '10.1086/662074', 'Title': 'erratum: the competitive saving motive: evidence from rising sex ratios and savings rates in china'},
 {'scopus_id': '10.1086/597025', 'Title': 'erratum: moral hazard versus liquidity and optimal unemployment insurance'},
 {'scopus_id': '10.1086/250083', 'Title': 'wages, implicit contracts, and the business cyle: evidence from canadian micro data'},
 {'scopus_id': '10.1086/250041', 'Title': 'new evidence on property tax capitalization'},
  {'scopus_id': '10.1086/523713', 'Title': "erratum: 'the accident externality from driving'"},
  {'scopus_id': '10.1086/500278', 'Title': 'the political economy of corporate control and labor rents'},
   {'scopus_id': '10.1086/317683',  'Title': 'parental benefits from intergenerational coresidence: empirical evidence from rural pakistan'},
   {'scopus_id': '10.1086/317684',  'Title': 'a ricardian model with a continuum of goods under nonhomotheticpreferences: demand complementarities, income distribution, andnorth-south trade'},
   {'scopus_id': '10.1086/338746',  'Title': 'on theories explaining the success of the gravity equation'},
 {'scopus_id': '10.1086/262009',  'Title': "the economics of polygyny in sub-saharan africa: female productivity and the demand for wives in côte d'ivoire"},
 {'scopus_id': '10.1086/261761', 'Title': 'procyclical labor productivity and competing theories of the business cycle: some evidence from interwar u.s. manufacturing industries'},
 {'scopus_id': '10.1086/261869', 'Title': 'trade liberalization and the theory of endogenous protection: an econometric study of u.s. import policy'},
 {'scopus_id': '10.1086/261728', 'Title': 'tax reform and u.s. economic growth'},
 {'scopus_id': '10.1086/261985',
  'Title': 'can imperfect competition explain the difference between primal and dual productivity measures? estimates for u.s. manufacturing'},
  {'scopus_id': '10.1086/260965',
  'Title': 'equalizing discrimination and cartel pricing in transport rate regulation'},
 {'scopus_id': '10.1086/260914',
  'Title': 'completed fertility and its timing'},
 {'scopus_id': '10.1086/260744',
  'Title': 'the transition of land to urban use'},
 {'scopus_id': '10.1086/260864',
  'Title': 'economic losses from forecasting error in agriculture'},
 {'scopus_id': '10.1086/260963',
  'Title': 'interpreting economic time series'},
 {'scopus_id': '10.1086/260737',
  'Title': "an economic basis for the 'national defense argument' for aiding certain industries"},
    {'scopus_id': '10.1086/318605',
  'Title': 'how much did the liberty shipbuilders learn? new evidence for an old case study'},
 {'scopus_id': '10.1086/261107',
  'Title': 'wasteful commuting',
  },
    {'scopus_id':'10.1086/685753','Title':'Erratum: Production versus Revenue Efficiency with Limited Tax Capacity: Theory and Evidence from Pakistan'},


{'scopus_id': '10.1093/restud/rdt002',
  'Title': 'estimating ethnic preferences using ethnic housing quotas in singapore'},
{'scopus_id': '10.1111/1467-937X.00176',
  'Title': 'international trade and currency exchange'},
   {'scopus_id': '10.2307/2297131',
  'Title': 'on sums of production set frontiers'},
   {'scopus_id': '10.2307/2297356',
  'Title': "inefficiency and the demand for 'money' in a sequence economy: a correction"},
  {'scopus_id': '10.2307/2296982',
  'Title': 'more on prices vs. quantities'},

  {'scopus_id': '10.2307/2296360',
  'Title': 'the optimal linear income-tax'},
   {'scopus_id': '10.2307/2296783',
  'Title': 'majority voting and social choice'},
  {'scopus_id': '10.2307/2296504',
  'Title': 'the effect of demand on prices in british manufacturing: another view'},
  {'scopus_id': '10.2307/2296731',
  'Title': '[comment on garegnani]: a reply'},
   
   {'scopus_id': '10.2307/2296621',
  'Title': "[the existence and persistence of cycles in a non-linear model: kaldor's 1940 model re-examined]: a comment"},
  {'scopus_id': '10.2307/2296732',
  'Title': 'expedient choice of transforms in phase-diagramming'} ,
    {'scopus_id': '10.2307/2967562',
  'Title': '[economic theory and socialist economy]: a rejoinder'},
  {'scopus_id': '10.1093/restud/rdz014',
  'Title': 'corrigendum: community enforcement of trust with bounded memory'},
 
{'scopus_id': '10.1093/restud/rds039',
  'Title': 'erratum: endogenous games and mechanisms: side payments among players'},
   {'scopus_id': '10.1111/j.1467-937X.2009.00590.x',
  'Title': 'erratum: sovereign debt without default penalties'},
  {'scopus_id': '10.2307/2298038',
  'Title': 'r&d and economic growth'},
  {'scopus_id': '10.2307/2296610',
  'Title': 'learning by doing and infant industry protection: a partial equilibrium approach'},
  {'scopus_id': '10.2307/2296626',
  'Title': "note on 'the structure of utility functions'"},


{'scopus_id': '10.2307/2296467',
  'Title': 'on experimental research in oligopoly'},
  {'scopus_id': '10.2307/2296435',
  'Title': 'on putty-clay: a comment'},
   {'scopus_id': '10.2307/2974429',
  'Title': 'marginal productivity and the macro-economic theories of distribution: reply to pasinetti and robinson'},
     {'scopus_id': '10.2307/2296246',
  'Title': 'monetary and value theory: comments'},
   {'scopus_id': '10.2307/2295775',
  'Title': '[a note on a point in value and capital]: a reply'},
 {'scopus_id': '10.2307/2295782',
  'Title': '[a futher note on the theory of inflation]: a reply'},
  {'scopus_id': '10.2307/2295776',
  'Title': '[a note on a point in value and capital]: a rejoinder'},
 {'scopus_id': '10.2307/2296228',
  'Title': 'memorandum on the sterling assets of the british colonies: a comment'},
   {'scopus_id': '10.2307/2295778',
  'Title': '[the role of national income estimates in the statistical policy of an underdeveloped area]: a rejoinder'},
   {'scopus_id': '10.2307/2296089',
  'Title': '[a new view of the economics of international readjustment]: a comment'},
   {'scopus_id': '10.2307/2295758',
  'Title': '[community indifference]: a comment'},

{'scopus_id': '10.2307/2967547',
  'Title': 'taxation and production: the wicksell analysis'},

 {'scopus_id': '10.2307/2967621',
  'Title': 'notes on the elasticity of substitution: i'},
 {'scopus_id': '10.2307/2967620',
  'Title': 'taxation and returns: a rejoinder'},
  {'scopus_id': '10.2307/2967530',
  'Title': '[the power of undervalued currency]: a reply'},
 {'scopus_id': '10.2307/2967557',
  'Title': 'a proposal for making monetary management effective in the united states'},
  {'scopus_id': '10.2307/2967506',
  'Title': "further notes on elasticity of substitution: i. note on dr. machlup's article"},
   {'scopus_id': '10.2307/2967660',
  'Title': 'on the economic theory of socialism: part one'},
  {'scopus_id': '10.2307/2295980',
  'Title': "'the economist and the state'--an addendum"},
 
  {'scopus_id': '10.2307/2296572',
  'Title': 'the consumption and the output turnpike theorems in a von neumann type of model--a finite term problem'},
 {'scopus_id': '10.2307/2296247',
  'Title': 'on the invariance of demand for cash and other assets'},
   {'scopus_id': '10.2307/2296545',
  'Title': 'the general instability of a class of competitive growth processes'},
   {'scopus_id': '10.1093/restud/rdt011',
  'Title': 'r&d and productivity: estimating endogenous productivity'},
 {'scopus_id': '10.1111/1467-937X.t01-1-00029',
  'Title': 'strategic delay in a real options model of r&d competition'},
  {"scopus_id":"10.1093/restud/rdv008","Year": "2015"},
  {"scopus_id":"10.1093/restud/rdu039","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu041","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu040","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu045","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv012","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu043","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu037","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv001","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv003","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu036","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu044","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv009","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu025","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv005","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu024","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv006","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv004","Year": "2015"},
{"scopus_id":"10.1093/restud/rdu017","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv011","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv018","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv024","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv017","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv015","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv023","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv020","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv016","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv019","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv025","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv027","Year": "2015"},
{"scopus_id":"10.1093/restud/rdv014","Year": "2015"},
{"scopus_id":"10.1093/restud/rdt038","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt046","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt043","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt041","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt044","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt045","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt035","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt039","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt042","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt036","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt040","Year": "2014"},
{"scopus_id":"10.1093/restud/rdt037","Year": "2014"},
{"scopus_id":"10.2307/2296270","Year": "1951"},
{"scopus_id":"10.2307/2296265","Year": "1951"},
{"scopus_id":"10.2307/2296268","Year": "1951"},
{"scopus_id":"10.2307/2296267","Year": "1951"},
{"scopus_id":"10.2307/2296271","Year": "1951"},
{"scopus_id":"10.2307/2296266","Year": "1951"},
{"scopus_id":"10.2307/2296272","Year": "1951"},
{"scopus_id":"10.2307/2296269","Year": "1951"},
{'scopus_id':'10.2307/2967529','Title':'The Power of Undervalued Currency: A Methodological Comment'},
{'scopus_id':'10.2307/2967420','Title':'[Complementarity and Interrelations of Shifts in Demand]: A Comment'},
{'scopus_id':'10.2307/2295860','Title':"[A Note on Kaldor's `Speculation and Economic Stability']: A Comment"},
{'scopus_id':'10.2307/2296500', 'Title':'the existence of optimal distributed lags'},
  {'scopus_id':'10.2307/2296392', 'Title': "correction to 'on optimal development in a multi-sector economy'"},
{'scopus_id':'10.1093/restud/rdt040','Title':'RETRACTED: Growing up in a Recession'},
     {'scopus_id': '10.2307/1882546',
  'Title': 'positive theory of capital: comments'},
   {'scopus_id': '10.1093/qje/qjx039',
  'Title': "erratum to 'the short-term impact of unconditional cash transfers to the poor: experimental evidence from kenya'"},
 {'scopus_id': '10.1093/qje/qjx025',
  'Title': "erratum to 'field of study, earnings, and self-selection'"},
   {'scopus_id': '10.1093/qje/qjx036',
  'Title': "erratum to 'leveraging lotteries for school value-added: testing and estimation'"},
   {'scopus_id': '10.1093/qje/qjv009',
  'Title': 'erratum: note on proposition 1(a-b) in dal bó, finan and rossi (2013)—strengthening state capabilities: the role of financial incentives in the call to public service, the quarterly'},
   {'scopus_id': '10.1093/qje/qjr047',
  'Title': "erratum: accountability and flexibility in public schools: evidence from boston's charters and pilots"},
  {'scopus_id': '10.1162/0033553042476134',
  'Title': 'erratum: investor protection, optimal incentives, and economic growth'},
   {'scopus_id': '10.2307/2946652',
  'Title': 'corrigendum'},
 {'scopus_id': '10.2307/2937963',
  'Title': 'the dominant-firm advantage in multiproduct industries: evidence from the u. s. airlines'},
   {'scopus_id': '10.2307/1884282',
  'Title': 'the cyclical component of u. s. economic activity'},
 {'scopus_id': '10.2307/1886022',
  'Title': 'adverse selection in the market for slaves: new orleans, 1830-1860'},
   {'scopus_id': '10.2307/1885872',
  'Title': 'the role of knowledge in r&d efficiency'},
   {'scopus_id': '10.2307/1885351',
  'Title': 'choosing new industrial capacity: on-site expansion, branching, and relocation'},





 {'scopus_id': '10.2307/1881983',
  'Title': 'bohm-bawerk on rae'},
 {'scopus_id': '10.2307/1884977',
  'Title': 'final objections to the risk theory of profit: a reply'},
   {'scopus_id': '10.2307/1883669',
  'Title': "'the variation of productive forces': a comment"},
  {'scopus_id': '10.2307/1883305',
  'Title': 'the concept of value: a rejoinder'},
 {'scopus_id': '10.2307/1883561',
  'Title': 'the insurance of bank deposits in the west: ii'},
 {'scopus_id': '10.2307/1883940',
  'Title': 'the sherman act: its design and its effects'},
 {'scopus_id': '10.2307/1883702',
  'Title': 'the monetary theory of the trade cycle and its statistical test cycle and its statistical test',
  },

{'scopus_id': '10.2307/1885821',
  'Title': "'superest ager'"}, 

  {'scopus_id': '10.2307/1883281',
  'Title': 'saving and investment: saving in process analysis'},
 {'scopus_id': '10.2307/1883288',
  'Title': "chamberlin's monopoly supply curve: reply"},
 {'scopus_id': '10.2307/1883282',
  'Title': 'saving and investment: saving and savings'},

{'scopus_id': '10.2307/1882614',
  'Title': "professor knight's capital theory: a reply"},
{'scopus_id': '10.2307/1884071',
  'Title': 'experiments in wheat control: the agricultural adjustment act, 1933'},
 {'scopus_id': '10.2307/1882349',
  'Title': 'the historical emergence of quantity theory: comments'}, 
{'scopus_id': '10.2307/1883552',
  'Title': 'more pitfalls in demand and supply curve analysis: some comments'}, 
{'scopus_id': '10.2307/1884803',
  'Title': 'the yellow dog contract: an explanation'},
 {'scopus_id': '10.2307/1883902',
  'Title': "sombart's die drei nationalökonomien by werner sombart"},
 {'scopus_id': '10.2307/1882966',
  'Title': 'note: the duration of business cycles'},
 {'scopus_id': '10.2307/1882440',
  'Title': "'superest ager'",},
 {'scopus_id': '10.2307/1883945',
  'Title': 'theories of the labor movement, as set forth in recent literature'},
{'scopus_id': '10.2307/1883240',
  'Title': '[notice on international trade, november, 1931]: a correction'},
 {'scopus_id': '10.2307/1883543',
  'Title': 'joint and overhead cost and railway rate policy'},
 {'scopus_id': '10.2307/1883551',
  'Title': 'more pitfalls in demand and supply curve analysis: a final word'},
 {'scopus_id': '10.2307/1883546',
  'Title': 'time series and the derivation of demand and supply curves a study of coffee and tea, 1850-1930'},
 {'scopus_id': '10.2307/1883402',
  'Title': "some remarks on professor hansen's view on technological unemployment: a rejoiner"},


  {'scopus_id': '10.1093/qje/54.4_Part_1.679',
  'Title': 'ad valorem and specific taxes',},
{'scopus_id': '10.2307/1883283',
  'Title': 'saving and investment: final comment'},
 {'scopus_id': '10.2307/1884088',
  'Title': 'professor chamberlin on monopolistic and imperfect competition:reply'},
 {'scopus_id': '10.2307/1882895',
  'Title': 'indemnity payments and gold movements: rejoinder'},
 {'scopus_id': '10.2307/1882894',
  'Title': 'indemnity payments and gold movements: a reply'},
{'scopus_id': '10.2307/1882619',
  'Title': 'german exchange control, 1931-1939: from an emergency measure to a totalitarian institution'},
 {'scopus_id': '10.2307/1883032',
  'Title': 'book review: residential real estate'},
 {'scopus_id': '10.2307/1883341',
  'Title': 'the use of the short-cut graphic method of multiple correlation: further comment'},

{'scopus_id': '10.2307/1885152',
  'Title': 'in defense of monopoly: further comment'},
 {'scopus_id': '10.2307/1882140',
  'Title': 'professor leontief on lord keynes: further comment'},
 {'scopus_id': '10.2307/1880675',
  'Title': 'reparation labor--a preliminary analysis'},
 {'scopus_id': '10.2307/1883474',
  'Title': 'the influence of unionism upon earnings: addendum'},
 {'scopus_id': '10.2307/1885055',
  'Title': 'multiple-plant firms: note on the allocation of output'},
 {'scopus_id': '10.2307/1884830',
  'Title': 'does the consumer benefit from price instability?: further comment'},
{'scopus_id': '10.2307/1880677',
  'Title': "'ability to pay'"},

{'scopus_id': '10.2307/1883259',
  'Title': 'strikes and lock-outs of great britain'},
 {'scopus_id': '10.2307/1884831',
  'Title': 'does the consumer benefit from price instability?: reply'},
{'scopus_id': '10.1093/qje/54.4_Part_1.673',
  'Title': 'the shifting of sales taxes'},

{'scopus_id': '10.2307/1884125',
  'Title': 'real and money wage rates: further comment'},
 {'scopus_id': '10.1093/qje/54.4_Part_1.686',
  'Title': 'ad valorem and specific taxes: rejoinder'},
 {'scopus_id': '10.2307/1883339',
  'Title': 'the scale of agricultural production: rejoinder'},
 {'scopus_id': '10.2307/1884126',
  'Title': 'real and money wage rates: rejoinder'},
 {'scopus_id': '10.2307/1882067',
  'Title': 'the planning approach in public economy: further comment'},
{'scopus_id': '10.2307/1882620',
  'Title': 'the past and future of exchange control'},
  {'scopus_id': '10.2307/1880733',
  'Title': '[peak loads and efficient pricing]: further comment'}, 
{'scopus_id': '10.2307/1880537',
  'Title': '[an optimal unemployment rate: comment]: reply'}, 
{'scopus_id': '10.2307/1882187',
  'Title': 'minimum wages and the long-run elasticity of demand for low-wage labor'},
 {'scopus_id': '10.2307/1880807',
  'Title': '[the consumer does benefit from feasible price stability]: a comment [2]'},
 {'scopus_id': '10.2307/1885931',
  'Title': 'the analysis of revenue sharing in a new approach to collective fiscal decisions'},

 {'scopus_id': '10.2307/1886080',
  'Title': '[alvin h. hansen]: tribute'},
 {'scopus_id': '10.2307/1881802',
  'Title': 'the marginalist principle in a discrete production model under uncetain demand: response'},
 {'scopus_id': '10.2307/1886079',
  'Title': '[alvin h. hansen]: caring for the real problems'},
 {'scopus_id': '10.2307/1886081',
  'Title': '[alvin h. hansen]: some reminiscences'},
 {'scopus_id': '10.2307/1882029',
  'Title': 'foreward'},
{'scopus_id': '10.2307/2936149',
  'Title': 'statistical cost analysis re-revisited: comment'},
 {'scopus_id': '10.2307/1882598',
  'Title': 'reduced forms of rational expectations models'},
 {'scopus_id': '10.2307/1885881',
  'Title': 'two-stage expenditure minimization and some welfare applications'},
  {'scopus_id': '10.2307/1880553',
  'Title': "['the new view of investment']: reply"},
 {'scopus_id': '10.2307/1879331',
  'Title': '[d.h. robertson]: reply'},
 {'scopus_id': '10.2307/1880543',
  'Title': 'the peculiar economics of professional sports: a contribution to the theory of the firm in sporting competition and in market competition'},
 {'scopus_id': '10.2307/1879635',
  'Title': '[credit risk and credit rationing]: reply'},
 {'scopus_id': '10.2307/1883214',
  'Title': 'the commodity structure of world trade: reply'},
 {'scopus_id': '10.2307/1884407',
  'Title': '[real effects of foreign surplus disposal in underdeveloped economies]: further comment'},
 {'scopus_id': '10.2307/1880826',
  'Title': '[steel, administered prices and inflation]: reply'},
 {'scopus_id': '10.2307/1879370',
  'Title': 'regional allocation of investment: an aggregative study in the theory of development programming'},
 {'scopus_id': '10.2307/1879637',
  'Title': '[the stability of growth equilibrium]: reply'},
 {'scopus_id': '10.2307/1880824',
  'Title': '[security and a financial theory of investment]: reply'},
 {'scopus_id': '10.2307/1879634',
  'Title': '[credit risk and credit rationing]: further comment'},
 {'scopus_id': '10.2307/1883212',
  'Title': 'usher and schumpeter on invention, innovation and technological change: reply'},
{'scopus_id': '10.2307/1884198',
  'Title': 'some econometrics of growth: great ratios of economics'},
 {'scopus_id': '10.2307/1885136',
  'Title': 'utility, strategy, and social decision rules: reply'},
 {'scopus_id': '10.2307/1884362',
  'Title': 'structural aspects of monetary velocity: reply'},
 {'scopus_id': '10.2307/1884360',
  'Title': 'state and regional payments mechanisms: reply'},
 {'scopus_id': '10.2307/1883210',
  'Title': 'an economic justification of protectionism: reply'},
{'scopus_id': '10.2307/1882293',
  'Title': 'economic science only--or political economy?'},
 {'scopus_id': '10.2307/1884857',
  'Title': 'the balanced budget: reply'},

{'scopus_id': '10.1093/qje/71.2.324',
  'Title': 'the soviet ural-kuznetsk combine: a study in investment criteria and industrialization policies'},
 {'scopus_id': '10.2307/1884667',
  'Title': 'on the geometry of welfare economics: a suggested diagrammatic treatment of some basic propositions'},
 {'scopus_id': '10.2307/1881924',
  'Title': 'marketing structure and economic development: reply'},

 {'scopus_id': '10.2307/1885855',
  'Title': 'the retarded acceptance of the marginal utility theory: the schumpeter prize fund'},
 {'scopus_id': '10.2307/1882107',
  'Title': 'the economic issues of compulsory health insurance: further comment'},
 {'scopus_id': '10.2307/1884154',
  'Title': 'the multiplier, flexible exchanges, and international: reply'},
 {'scopus_id': '10.2307/1881696',
  'Title': 'profit theory--where do we go from here'},
 {'scopus_id': '10.2307/1885316',
  'Title': 'keynes and the forces of history: reply'},
 {'scopus_id': '10.2307/1879504',
  'Title': 'european unification and the dollar problem: further comment'},
 {'scopus_id': '10.2307/1882225',
  'Title': "the citizen's ephemerides of the physiocrats"},
 {'scopus_id': '10.2307/1879503',
  'Title': 'european unification and the dollar problem: comment'},
 {'scopus_id': '10.2307/1879508',
  'Title': 'clandestine capital movements in balance of payments estimates: reply'},
{'scopus_id': '10.2307/1879540',
  'Title': 'money demand and the interest rate level: reply'}, 
{'scopus_id': '10.2307/1882738',
  'Title': 'reply: proportionality, divisibility, and economies of scale: two comments'},
 {'scopus_id': '10.2307/1884395',
  'Title': "'competitive' output in bilateral monopoly: comment"},
{'scopus_id':'10.2307/1883113',
 'Title':'[The French Economic Situation and the State of Finances]: Correspondence'},
{'scopus_id': '10.2307/1884624',
  'Title': 'railroads: recent books and neglected problems: a correction'},
   {'scopus_id': '10.2307/1879531',
  'Title': 'review of the troops (a chapter from the history of economic analysis)'},

{'scopus_id':'10.2307/1879532',
 'Title':'Distance Inputs and the Space-Economy 1 Part I: The Conceptual Framework'}
 ]

for i in s_fix:
    for j in i.keys():
        if j!='scopus_id':
            scopus.loc[scopus['scopus_id']==i['scopus_id'], j]=i[j]


  scopus.loc[scopus['scopus_id']==i['scopus_id'], j]=i[j]


In [488]:
def proc_var(df, field):
    for i in df.index:
        if df.loc[i, field]=='nan':
            df.loc[i, field]=pd.NA
        try:
            df.loc[i, field]=str(int(float(df.loc[i, field])))
        except:
            df.loc[i, field]=str(df.loc[i, field]).strip()
    

In [489]:
proc_var(scopus,'Volume')
proc_var(scopus,'Year')
proc_var(scopus,'Issue')
proc_var(scopus,'Page start')
proc_var(scopus,'Page end')


  df.loc[i, field]=str(int(float(df.loc[i, field])))


In [490]:

proc_var(cleaned,'year')
proc_var(cleaned, 'pages')

  df.loc[i, field]=str(int(float(df.loc[i, field])))


In [491]:
cleaned.head()

Unnamed: 0,issue_url,ISSN,URL,journal,number,publisher,title,urldate,volume,year,abstract,author,pages,reviewed-author,uploaded,author_split,title_10,content_type,type,jid
0,https://www.jstor.org/stable/10.2307/e26966476,"00028282, 19447981",https://www.jstor.org/stable/26966477,The American Economic Review,12,American Economic Association,Front Matter,2023-09-04 00:00:00,110,2020,,,,,1.0,,,MISC,N,aer
1,https://www.jstor.org/stable/10.2307/e26966476,"00028282, 19447981",https://www.jstor.org/stable/26966478,The American Economic Review,12,American Economic Association,Competition and Entry in Agricultural Markets:...,2023-09-04 00:00:00,110,2020,African agricultural markets are characterized...,Lauren Falcao Bergquist and Michael Dinerstein,3705-3747,,1.0,"['Lauren Falcao Bergquist', 'Michael Dinerstein']",,Article,N,aer
2,https://www.jstor.org/stable/10.2307/e26966476,"00028282, 19447981",https://www.jstor.org/stable/26966479,The American Economic Review,12,American Economic Association,Discounts and Deadlines in Consumer Search,2023-09-04 00:00:00,110,2020,We present a new equilibrium search model wher...,Dominic Coey and Bradley J. Larsen and Brennan...,3748-3785,,1.0,"['Dominic Coey', 'Bradley J. Larsen', 'Brennan...",,Article,N,aer
3,https://www.jstor.org/stable/10.2307/e26966476,"00028282, 19447981",https://www.jstor.org/stable/26966480,The American Economic Review,12,American Economic Association,A Model of Competing Narratives,2023-09-04 00:00:00,110,2020,We formalize the argument that political disag...,Kfir Eliaz and Ran Spiegler,3786-3816,,1.0,"['Kfir Eliaz', 'Ran Spiegler']",,Article,N,aer
4,https://www.jstor.org/stable/10.2307/e26966476,"00028282, 19447981",https://www.jstor.org/stable/26966481,The American Economic Review,12,American Economic Association,A Few Bad Apples Spoil the Barrel: An Anti-Fol...,2023-09-04 00:00:00,110,2020,We study anonymous repeated games where player...,Takuo Sugaya and Alexander Wolitzky,3817-3835,,1.0,"['Takuo Sugaya', 'Alexander Wolitzky']",,Article,N,aer


In [492]:
rename_scopus={
 'jid': 'scopus_jid',
 'scopus_id': 'scopus_id',
 'authorgroup': 'scopus_authorgroup',
 'authors': 'scopus_authors',
 'affiliations': 'scopus_affiliations',
 'references': 'scopus_references',
 'Author full names': 'scopus_author_full_names',
 'Title': 'scopus_title',
 'Year': 'scopus_year',
 'Source title': 'scopus_source_title',
 'Volume': 'scopus_volume',
 'Issue': 'scopus_issue',
 'Art. No.': 'scopus_art_no',
 'Page start': 'scopus_page_start',
 'Page end': 'scopus_page_end',
 'Page count': 'scopus_page_end',
 'Cited by': 'scopus_cited_by',
 'DOI': 'scopus_doi',
 'Abstract': 'scopus_abstract',
 'Publisher': 'scopus_publisher',
 'Document Type': 'scopus_document_type',
 'Publication Stage': 'scopus_publication_stage',
 'Open Access': 'scopus_open_access',
 'Source': 'scopus_source',
 'EID': 'scopus_eid'
}

scopus = scopus.rename(columns=rename_scopus)

scopus['scopus_title']=scopus['scopus_title'].str.lower().str.strip().str.replace('‘',"'").str.replace('’',"'").str.replace('"',"'").str.replace('–','-').str.replace('‐','-').str.replace('™','').str.replace('- ','-').str.replace(' -','-').str.replace('“',"'").str.replace("*","").str.replace('”',"'").str.replace('behaviour','behavior').str.strip()
scopus['scopus_title']=scopus['scopus_title'].str.strip().str.split().apply(' '.join).str.strip()
for i in scopus.index:
    if '†' in scopus.loc[i, 'scopus_title']:
        scopus.loc[i, 'scopus_title']=scopus.loc[i, 'scopus_title'].strip()[:-1].strip()
    if scopus.loc[i, 'scopus_title'].strip()[-1]==".":
        scopus.loc[i, 'scopus_title']=scopus.loc[i, 'scopus_title'].strip()[:-1].strip()


cleaned['title']=cleaned['title'].str.lower().str.strip().str.replace('‘',"'").str.replace('’',"'").str.replace('"',"'").str.replace('–','-').str.replace('‐','-').str.replace('™','').str.replace('- ','-').str.replace(' -','-').str.replace('“',"'").str.replace('”',"'").str.replace("*","").str.replace('behaviour','behavior').str.strip()
cleaned['title']=cleaned['title'].str.strip().str.split().apply(' '.join).str.strip()
for i in cleaned.index:
    if '†' in cleaned.loc[i, 'title']:
        cleaned.loc[i, 'title']=cleaned.loc[i, 'title'].strip()[:-1].strip()
    if str(cleaned.loc[i, 'title']).strip()[-1]==".":
        cleaned.loc[i, 'title']=cleaned.loc[i, 'title'].strip()[:-1].strip()

In [493]:
cleaned.loc[cleaned["number"]==datetime.datetime(2023, 5, 6, 0, 0),'number']='5-6'
cleaned.loc[cleaned["number"]==datetime.datetime(2023, 3, 4, 0, 0),'number']='3-4'
cleaned.loc[cleaned["number"]==datetime.datetime(2023, 1, 2, 0, 0),'number']='1-2'

fix_cleaned=[{"URL":"https://www.jstor.org/stable/40263865", "number":'index'},
{"URL":"https://www.jstor.org/stable/40263866", "number":'index'},
{"URL":"https://www.jstor.org/stable/40263867", "number":'index'},
{"URL":"https://www.jstor.org/stable/1905372", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905373", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905374", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905375", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905376", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905377", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905378", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1905379", "number":'supplement: guide to econometrica'},
{"URL":"https://www.jstor.org/stable/1907284", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907285", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907286", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907287", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907288", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907289", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907290", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907291", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907292", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907293", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907294", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907295", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907296", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907297", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907298", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907299", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907300", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907301", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907302", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907303", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907304", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907305", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907306", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907307", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907308", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907309", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907310", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907311", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907312", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907313", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907314", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907315", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907316", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907317", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907318", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907319", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907320", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907321", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907322", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907323", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907324", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1907325", "number":'supplement: report of the washington meeting'},
{"URL":"https://www.jstor.org/stable/1906934", "number":'supplement'},
{"URL":"https://www.jstor.org/stable/1906935", "number":'supplement'},
{"URL":"https://www.jstor.org/stable/1906936", "number":'supplement'},
{'URL':"https://www.jstor.org/stable/1914237", "title":'Simultaneity in the Birth Rate Equation: The Effects of Education, Labor Force Participation, Income and Health'},

{  'title': 'firm wage differentials and labor market sorting: reconciling theory and evidence',
    'URL': 'https://doi.org/10.1086/695505'},
  {  'title': 'charity and the bequest motive: evidence from seventeenth-century wills',
    'URL': 'https://www.jstor.org/stable/10.1086/317685'},
  { 'title': 'parental benefits from intergenerational coresidence: empirical evidence from rural pakistan',
    'URL': 'https://www.jstor.org/stable/10.1086/317683'},
     {  'title': 'disability insurance benefits and labor supply',
   'URL': 'https://www.jstor.org/stable/10.1086/317682'},
  { 'title': 'redistributing income under proportional representation',
    'URL': 'https://www.jstor.org/stable/10.1086/317680'},
  { 'title': 'in sickness and in health: risk sharing within households in rural ethiopia',
    'URL': 'https://www.jstor.org/stable/10.1086/316098'},
  {  'title': 'are invisible hands good hands? moral hazard, competition, and the second-best in health care markets',
    'URL': 'https://www.jstor.org/stable/10.1086/317672'},
  { 'title': 'risk sharing, sorting, and early contracting',
    'URL': 'https://www.jstor.org/stable/10.1086/317675'},

 { 'title': 'putty-clay and investment: a business cycle analysis',
    'URL': 'https://www.jstor.org/stable/10.1086/317673'},
 {  'title': 'is child labor inefficient?',
    'URL': 'https://www.jstor.org/stable/10.1086/316097'},
 {'title': 'federal mandates by popular demand',
  'URL': 'https://www.jstor.org/stable/10.1086/317669'},
 {'title': 'earnings within education groups and overall productivity growth',
   'URL': 'https://www.jstor.org/stable/10.1086/316101'},
 {'title': 'homework in development economics: household production and the wealth of nations',
   'URL': 'https://www.jstor.org/stable/10.1086/316102'},
 {
  'title': 'age and the quality of work: the case of modern american painters',
  'URL': 'https://www.jstor.org/stable/10.1086/316099'},
 {
  'title': 'an alternative approach to search frictions',
  'URL': 'https://www.jstor.org/stable/10.1086/317674'},
 {
  'title': 'using consumer theory to test competing business cycle models',
  'URL': 'https://www.jstor.org/stable/10.1086/250009'},
   {
  'title': 'induced innovation in american agriculture: a reconsideration',
  'URL': 'https://www.jstor.org/stable/2138675'},
  {'title': 'luxuries are easier to postpone: a proof',
   'URL': 'https://www.jstor.org/stable/10.1086/317668'},
 { 'title': 'balladurette and juppette: a discrete analysis of scrapping subsidies',
   'URL': 'https://www.jstor.org/stable/10.1086/316096'},
 { 'title': 'the making of an oligopoly: firm survival and technological change in the evolution of the u.s. tire industry',
   'URL': 'https://www.jstor.org/stable/10.1086/316100'},
 { 'title': 'equilibrium price dispersion in retail markets for prescription drugs',
    'URL': 'https://www.jstor.org/stable/10.1086/316103'},
 { 'title': 'estimating a bargaining model with asymmetric information: evidence from medical malpractice disputes',
    'URL': 'https://www.jstor.org/stable/10.1086/317677'},
 {'title': 'extensive margins and the demand for money at low interest rates',
    'URL': 'https://www.jstor.org/stable/10.1086/317676'},
 { 'title': 'measurement error and the relationship between investment and q',
   'URL': 'https://www.jstor.org/stable/10.1086/317670'},
 {'title': 'hierarchies and the organization of knowledge in production',
    'URL': 'https://www.jstor.org/stable/10.1086/317671'},
{'URL':'https://www.jstor.org/stable/26550458','title':'The Past, Present, and Future of Economics: A Celebration of the 125-Year Anniversary of the JPE and of Chicago Economics'},
   {
  'title': 'a closed form solution for a model of precautionary saving',
  'URL': 'https://www.jstor.org/stable/2298063'},
  {
  'title': 'optimal population and capital over time: the maximum perspective',
  'URL': 'https://www.jstor.org/stable/2297172'},
  {'URL':'https://www.jstor.org/stable/2967554',
   'title':'notes on the determinateness of the utility function'},
    {  'title': 'notes on the elasticity of substitution: iii.-the elasticity of substitution and the incidence of an imperial inhabited house duty',
  'URL': 'https://www.jstor.org/stable/2967623'},
    {
  'title': 'notes on the elasticity of substitution: i',
  'URL': 'https://www.jstor.org/stable/2967621'},
      {
  'title': 'comment on samuelson and modigliani',
  'URL': 'https://www.jstor.org/stable/2974427'},
   {
  'title': "prices and the turnpike: i. the story of a mare's nest",
  'URL': 'https://www.jstor.org/stable/2295705'},
   {
  'title': "prices and the turnpike: ii. proof of a turnpike theorem: the 'no joint production' case",
  'URL': 'https://www.jstor.org/stable/2295706'},

  {  'title': 'prices and the turnpike: iii. paths of economic growth that are optimal with regard only to final states: a turnpike theorem',
  'URL': 'https://www.jstor.org/stable/2295707'},
{
  'title': 'notes on the determinateness of the utility function: ii',  
  'URL': 'https://www.jstor.org/stable/2967553'},
 {
  'title': "further notes on index numbers: ii. mr. lerner's supplementary limits for price index numbers",
  'URL': 'https://www.jstor.org/stable/2967511'},
 {
  'title': 'further notes on elasticity of substitution: iii. the question of symmetry',
  'URL': 'https://www.jstor.org/stable/2967508'},
 {
  'title': 'further notes on index numbers: i',
  'URL': 'https://www.jstor.org/stable/2967510'},
 {  'title': 'notes on the elasticity of substitution: iv. the elasticity of substitution and the elasticity of demand for one factor of production',
  'URL': 'https://www.jstor.org/stable/2967624'},
 {
  'title': 'notes on the determinateness of the utility function: i',
  'URL': 'https://www.jstor.org/stable/2967552'},
  {
  'title': "further notes on elasticity of substitution: i. note on dr. machlup's article",
  'URL': 'https://www.jstor.org/stable/2967506'},
  {
  'title': 'the chicago plan of banking reform: ii the application of the proposals in england',
  'URL': 'https://www.jstor.org/stable/2967558'},
 {
  'title': 'a symposium on the theory of the forward market: iii. mr. kaldor on the forward market',
  'URL': 'https://www.jstor.org/stable/2967409'},
  {
  'title': 'economic thought in the soviet union: ii.—economic planning and control',
  'URL': 'https://www.jstor.org/stable/2295720'},
{'URL':'https://www.jstor.org/stable/2296392', 'author':'D. Gale'},
 {
  'title': "'the variation of productive forces': a comment",
  'URL': 'https://www.jstor.org/stable/1883669'},

{
  'title': 'on decreasing cost and comparative cost: a rejoinder',
  'URL': 'https://www.jstor.org/stable/1884880'},
{
  'title': 'the positive theory of capital and its critics: iii',
  'URL': 'https://www.jstor.org/stable/1882376'},
{
  'title': 'the social point of view in economics. ii',
  'URL': 'https://www.jstor.org/stable/1883624'},
{
  'title': 'positive contributions of scientific management the elimination of some losses characteristic of present-day manufacture',
  'URL': 'https://www.jstor.org/stable/1885947'},
 {
  'title': 'some fallacies in the interpretation of social costs: a reply',
  'URL': 'https://www.jstor.org/stable/1884879'},
{
  'title': 'chapters on machinery and labor: iv. the introduction of machinery and trade-union policy',
  'URL': 'https://www.jstor.org/stable/1884618'},
   {
  'title': 'toward an understanding of the metropolis: i. some speculations regarding the economic basis of urban concentration',
  'URL': 'https://www.jstor.org/stable/1884617'},

  
 {
  'title': 'sociological elements in economic thought: i. historical',
  'URL': 'https://www.jstor.org/stable/1883862'},

  
{
  'title': 'theories of business fluctuations: i. a classification of the theories',
  'URL': 'https://www.jstor.org/stable/1885554'},
{
  'title': 'rural coöperative credit in china: a record of seven years of experimentation',
  'URL': 'https://www.jstor.org/stable/1883900'},
 {
  'title': 'carrier property consumed in operation and the regulation of profits: a discussion of the i.c.c. report on depreciation',
  'URL': 'https://www.jstor.org/stable/1882471'},

  
 {
  'title': "consumers' surplus in international trade. a supplementary note",
  'URL': 'https://www.jstor.org/stable/1882443'},

  
 {
  'title': 'the interdependence of the price-levels p, p′ and π',
  'URL': 'https://www.jstor.org/stable/1885616'},

  {
  'title': "mr. keynes's consumption function: rejoinder",
  'URL': 'https://www.jstor.org/stable/1885042'},

 {
  'title': 'book review: residential real estate',
  'URL': 'https://www.jstor.org/stable/1883032'},

{
  'title': 'paradoxes in capital theory: a symposium: changes in the rate of profit and switches of techniques',
  'URL': 'https://www.jstor.org/stable/1882911'},
 {
  'title': 'professor hansen and keynesian interest theory: comment',
  'URL': 'https://www.jstor.org/stable/1882002'},

{

 'title':'[The French Economic Situation and the State of Finances]: Correspondence',
 "URL":'https://www.jstor.org/stable/1883113'
},
{
  'title': 'chapters on machinery and labor: i. the introduction of semi-automatic bottle machines',
  'URL': 'https://www.jstor.org/stable/1882432'},
  {
    "title":"chapters on machinery and labor: ii. the introduction of automatic bottle machines",
    "URL":"https://www.jstor.org/stable/1883265"
},

 {
  'title': 'chapters on machinery and labor: iii. machinery and the displacement of skill',
  'URL': 'https://www.jstor.org/stable/1885817'},
]

for i in fix_cleaned:
    for j in i.keys():
        if j!='URL':
            cleaned.loc[cleaned['URL']==i['URL'],j]=i[j]


cleaned['title']=cleaned['title'].str.lower()
scopus['scopus_title']=scopus['scopus_title'].str.lower()


In [494]:
#discard scopus titles that are post 2020
year_range=[]
for i in range(1940,2021):
    year_range.append(str(i))

ex_years=['2021', '2022', '2023', '2024']

scopus_plus=scopus[(scopus["scopus_year"].isin(ex_years)==True)].reset_index(drop=True)
scopus=scopus[(scopus["scopus_year"].isin(ex_years)==False)].reset_index(drop=True)

In [None]:
# aer_scopus=scopus[scopus["scopus_jid"]=="aer"].reset_index(drop=True)
# ecta_scopus=scopus[scopus["scopus_jid"]=="ecta"].reset_index(drop=True)
# jpe_scopus=scopus[scopus["scopus_jid"]=="jpe"].reset_index(drop=True)
# res_scopus=scopus[scopus["scopus_jid"]=="res"].reset_index(drop=True)
# qje_scopus=scopus[scopus["scopus_jid"]=="qje"].reset_index(drop=True)

scopus_mis={}

merged=pd.merge(cleaned, scopus, how='left', left_on=['title', 'year', 'jid'], right_on=['scopus_title', 'scopus_year', 'scopus_jid' ])


Unnamed: 0,journal,total in jstor for period,articles on scopus,exact matches on title and year,match %,Non-MISC + Non-Rev + post 1940,Non-MISC + Non-Rev + post 1940 matches,NMR-post-1940 match %,unmatched scopus articles,unmatched scopus articles post 1940
0,aer,27566,4360,4056,93.03,13389,4033,30.12,305,305
1,ecta,9351,1633,1565,95.84,5377,1554,28.9,69,69
2,jpe,14346,1294,1266,97.84,4863,1262,25.95,29,29
3,res,4157,3037,2874,94.63,3281,2730,83.21,164,147
4,qje,6904,4502,4269,94.82,3722,3042,81.73,234,167


In [496]:
merged["scopus_indicator"]=0

In [497]:
merged['number'].head()

0    12
1    12
2    12
3    12
4    12
Name: number, dtype: object

In [498]:
scopus_recon={}
for l in jid:
    print(l)
    a=0
    ff=[]
    b=0
    check=[]

    for i in scopus_mis[l].index:
        found=0
        max_r=0
        sim=0
        m_sim=0
        target=None
        for j in merged[(merged['year']==scopus_mis[l].loc[i, 'scopus_year'])&(merged['jid']==l)].index:
            
            seq_rat=sq(None, scopus_mis[l].loc[i,'scopus_title'],merged.loc[j, 'title']).ratio()
            fuzz_rat=fuzz.token_sort_ratio(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0], str(merged.loc[j,'author']).split(' and ')[0])
            
            if (fuzz_rat>70) & (seq_rat>=sim):
                sim=seq_rat
                m_sim=j
            if (seq_rat>0.95) & (fuzz_rat>95)& (str(scopus_mis[l].loc[i,'scopus_issue'])==str(merged.loc[j,'number'])) & (pd.isna(merged.loc[j,'scopus_id'])==True):
                print("execute")
                if seq_rat>max_r:
                    max_r=seq_rat
                    target=j
                found+=1
                print(found)

                print(scopus_mis[l].loc[i,'scopus_title'])
                print('----match----'+merged.loc[j, 'title']+'    '+str(seq_rat))
                print(scopus_mis[l].loc[i,'scopus_issue']+ '    '+str(merged.loc[j,'number']))
                print(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0]+ '    '+str(merged.loc[j,'author']).split(' and ')[0])
                a+=1
                print('\n')
        if found>1:
            ff.append({i:target})
            for k in scopus.columns:
                merged.loc[target,k]=scopus_mis[l].loc[i,k]
        elif found==1:
            print(target)
            for k in scopus.columns:
                merged.loc[target,k]=scopus_mis[l].loc[i,k]
            merged.loc[target, 'scopus_indicator']=1
        else:
            
            if sim!=0:
                print(sim)
                print(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0])
                print(str(merged.loc[m_sim, 'author']).split(' and ')[0])
                print(fuzz.token_sort_ratio(str(scopus_mis[l].loc[i,'scopus_author_full_names']).split('(')[0], str(merged.loc[m_sim, 'author']).split(' and ')[0]))
                print(str(scopus_mis[l].loc[i,'scopus_issue']))
                print(str(merged.loc[m_sim,'number']))
                # print(temp.loc[i,'scopus_title'])
                # print(sim)
                # print(m_sim)
                # print('{"scopus_id":"'+temp.loc[i,'scopus_id'] + '", "Title":"'+merged.loc[m_sim, 'title']+'"}')
                check.append({
                    "scopus_id":scopus_mis[l].loc[i,'scopus_id'],
                    "Title":merged.loc[m_sim, 'title'],
                    "scopus_title":scopus_mis[l].loc[i,'scopus_title'],
                    "sim":sim,
                    "as":scopus_mis[l].loc[i,'scopus_author_full_names'],
                    "p":scopus_mis[l].loc[i,'scopus_page_start'],
                    "v":scopus_mis[l].loc[i,'scopus_volume'],
                    "i":scopus_mis[l].loc[i,'scopus_issue'],
                    "a":merged.loc[m_sim, 'author'],
                    "url":merged.loc[m_sim, 'URL'],
                })
                b+=1

    scopus_recon[l]={
        "found_count": a, 
        "check_count": b, 
        "conflict_match": ff, 
        "check": check
        }

aer
execute
1
does regulatory jurisdiction affect the quality of investment-a dviser regulation?
----match----does regulatory jurisdiction affect the quality of investment-adviser regulation?    0.9938650306748467
10    10
Charoenwong, Ben     Ben Charoenwong


179
execute
1
structural interpretation of vector autoregressions with incomplete identifcation: revisiting the role of oil supply and demand shocks
----match----structural interpretation of vector autoregressions with incomplete identification: revisiting the role of oil supply and demand shocks    0.9962825278810409
5    5
Baumeister, Christiane     Christiane Baumeister


243
execute
1
'acting wife': marriage market incentives and labor market investments
----match----acting wife': marriage market incentives and labor market investments    0.9928057553956835
11    11
Bursztyn, Leonardo     Leonardo Bursztyn


455
execute
1
the 'pupil' factory: specialization and the production of human capital in school
----match----the 'pupi

In [499]:
for i in jid:
    print(scopus_recon[i])

{'found_count': 108, 'check_count': 95, 'conflict_match': [], 'check': [{'scopus_id': '10.1257/aer.107.5.716', 'Title': 'the effect of state taxes on the geographical location of top earners: evidence from star scientists', 'scopus_title': 'journal of economic perspectives', 'sim': 0.30303030303030304, 'as': 'Moretti, Enrico (7005593972)', 'p': '716', 'v': '107', 'i': '5', 'a': 'Enrico Moretti and Daniel J. Wilson', 'url': 'https://www.jstor.org/stable/44871748'}, {'scopus_id': '10.1257/aer.107.5.719', 'Title': 'the economist as plumber', 'scopus_title': 'american economic journal: applied economics', 'sim': 0.4117647058823529, 'as': 'Duflo, Esther (6602205596)', 'p': '719', 'v': '107', 'i': '5', 'a': 'Esther Duflo', 'url': 'https://www.jstor.org/stable/44250353'}, {'scopus_id': '10.1257/aer.106.6.1562', 'Title': 'hal r. varian, distinguished fellow 2015', 'scopus_title': 'erratum: optimal expectations and limited medical testing: evidence from huntington disease (american economic rev

In [516]:
content=['Article', 'Comment', 'Reply', 'Rejoinder']
content_ex=['MISC','Discussion','Review', 'Review2']

In [520]:
match_summary=[]
for i in jid:
    sids=list(merged[merged['jid']==i]['scopus_id'].unique())
    sids_nrm=list(merged[(merged['jid']==i)&(merged['year'].isin(year_range)==True)&(merged['content_type'].isin(content)==True)]['scopus_id'].unique())
    nrm_merged=merged[(merged['jid']==i)&(merged['content_type'].isin(content)==True)&(merged['year'].isin(year_range)==True)].shape[0]
    
    approx=sum(merged[merged['jid']==i]["scopus_indicator"])


    temp=scopus[(scopus["scopus_id"].isin(sids)==False)&(scopus["scopus_jid"]==i)].reset_index(drop=True)
    scopus_mis[i]=temp
    
    result=len(sids)*100/scopus[(scopus['scopus_jid']==i)].shape[0]
    result2=len(sids_nrm)*100/nrm_merged

    match_summary.append({
        "journal": i,
        # "total articles to 2020": merged[merged['jid']==i].shape[0],
        "articles on scopus": scopus[(scopus['scopus_jid']==i)].shape[0],
        "exact matches on title and year": len(sids)-approx,
        "approx matches on title, exact year+issue": approx,
        "unmatched scopus articles": temp.shape[0],
        "match %": f"{result:.2f}",
        "Non-MISC + Non-Rev + post 1940": nrm_merged,
        "Non-MISC + Non-Rev + post 1940 matches": len(sids_nrm),
        "NMR-post-1940 match %": f"{result2:.2f}",
        # "unmatched scopus articles post 1940": temp[temp['scopus_year'].isin(year_range)==True].shape[0],
    })
    # cids=cleaned['title'].unique()
    # cids.sort()

summary=pd.DataFrame(match_summary)
summary.to_csv("011_scopus_match_summary.csv", index=False)

In [513]:
merged.to_pickle("011_merged_proc_scopus_inception_2020.pkl")

In [524]:
sum(summary['Non-MISC + Non-Rev + post 1940 matches'])/sum(summary['Non-MISC + Non-Rev + post 1940'])

0.43311720698254363

In [502]:
# sids=list(merged['scopus_id'].unique())
# print(len(sids))
# temp=scopus[(scopus["scopus_id"].isin(sids)==False)&(scopus["scopus_jid"]=='qje')&(scopus["scopus_year"].isin(ex_years)==False)].reset_index(drop=True)
# print(temp.shape)
# print(res_scopus.shape)

In [503]:
# for i in temp.sort_values(by=['scopus_volume']).index:
#     print(str(i)+' '+str(temp.loc[i,'scopus_year'])+' '+str(temp.loc[i,'scopus_volume'])+' '+temp.loc[i,'scopus_issue']+'  '+temp.loc[i,'scopus_page_start']+ '   '+temp.loc[i,'scopus_title']+ '   '+temp.loc[i,'scopus_id']+ '   '+str(temp.loc[i, 'scopus_author_full_names']))

In [504]:
# url='https://www.jstor.org/stable/2296392'
# sc_id=17
# print(merged.loc[merged['URL']==url,'title']==temp.loc[sc_id,'scopus_title'])
# print(merged.loc[merged['URL']==url,'title'].values[0])
# print(merged.loc[merged['URL']==url,'author'].values[0])
# print(merged.loc[merged['URL']==url,'year'].values[0])
# print(merged.loc[merged['URL']==url,'scopus_title'].values[0])
# print()

# print(temp.loc[sc_id,'scopus_title'])
# print(temp.loc[sc_id,'scopus_id'])
# print(temp.loc[sc_id,'scopus_year'])
# print(temp.loc[sc_id,'scopus_author_full_names'])
# print(temp.loc[sc_id,'scopus_affiliations'])
# print(temp.loc[sc_id,'scopus_references'])

# print(type(merged.loc[merged['URL']==url,'year'].values[0]))
# print(type(temp.loc[sc_id,'scopus_title'])) 
# print(type(temp.loc[sc_id,'scopus_year'])) 

# # {'scopus_id':'10.1093/restud/rdv008','Year':'2015'}
# # {'scopus_id':'10.2307/2967420','Title':'The Power of Undervalued Currency: A Methodological Comment'}
# # {'scopus_id':'','Title':'[Complementarity and Interrelations of Shifts in Demand]: A Comment'}
# # {'scopus_id':'10.2307/2295860','Title':'[A Note on Kaldor's `Speculation and Economic Stability']: A Comment'}

# #{'scopus_id':'10.2307/2296500', 'Title':'the existence of optimal distributed lags'}

# {'scopus_id':'10.2307/2296392', 'Title': "correction to 'on optimal development in a multi-sector economy'"}
# {'URL':'https://www.jstor.org/stable/2296392', 'author':'D. Gale'}
# {'scopus_id':'10.1093/restud/rdt040','Title':'RETRACTED: Growing up in a Recession'}