# Obtain Notebook
This notebook was used to obtain information and parse search results from Google Scholar. It relies on a customized version of `scholarly.py` and not all parts of this code will work with a standard release.

In [1]:
#Import statements
import time
from newscholarly import scholarly
from fp.fp import FreeProxy
import pandas as pd
import numpy as np
import os
import random

### Author dataframe `winnersdf`
Reads in .CSV file of 2010-2019 Nobel Prize winners in scientific categories (Physics, Economics, Medicine/Physiology, Chemistry) as `pandas` dataframe. *Strictly speaking, the Economics prize is not a Nobel Prize, being more akin to the "Nobel Memorial Prize in Economic Sciences" and established by a Swedish bank.* *[Wikipedia](https://en.wikipedia.org/wiki/Nobel_Memorial_Prize_in_Economic_Sciences)*

In [5]:
#previously created .CSV file of 2010 - 2019 science Nobel Prize wins
winnersdf = pd.read_csv("files/science10years.csv",index_col=[0])
winnersdf.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95 entries, 1 to 168
Data columns (total 5 columns):
name1         95 non-null object
text1         95 non-null object
year          95 non-null int64
field         95 non-null object
additional    95 non-null object
dtypes: int64(1), object(4)
memory usage: 4.5+ KB


In [7]:
#looking at a few entries
winnersdf[92:95]

Unnamed: 0,name1,text1,year,field,additional
166,Peter A. Diamond,"Peter A. Diamond, Dale T. Mortensen and Christ...",2010,economics,"['a>, <a href=""https:', 'a> and <a href=""https..."
167,Dale T. Mortensen,"Peter A. Diamond, Dale T. Mortensen and Christ...",2010,economics,"['a>, <a href=""https:', 'a> and <a href=""https..."
168,Christopher A. Pissarides,"Peter A. Diamond, Dale T. Mortensen and Christ...",2010,economics,"['a>, <a href=""https:', 'a> and <a href=""https..."


## Initial Functions
#### Proxy Method
Function gets and sets proxy for `scholarly` package with `fp`'s `FreeProxy`.

In [4]:
def setschproxy(countries=['US','CA','BR']):
    '''
    Loops until it gets a random working proxy from FreeProxy and sets it to scholarly.use_proxy.  If it's taking 100
    seconds or longer, it sleeps for 300 before resetting the cumulative amount of time and continuing to loop.
    It should not infinite loop because FreeProxy runs into an error and terminates when unable to connect to the internet.
    
    Args
        countries (list) : defaults to ['US','CA','BR'], to be used as a list of country codes for FreeProxy's country_id
        
    Returns
        prints the cumulative amount of time so far it's taken to find and set a working proxy, until more than 100 seconds,
        at the end of the loop and prints True or False whether it has sucessfully set a proxy
    '''
    hasproxy=False
    totaltime=0
    while not hasproxy: #until hasproxy is True
        start=time.time()
        proxy = FreeProxy(rand=True, timeout=1, country_id=countries).get() #gets random proxy
        hasproxy = scholarly.use_proxy(http=proxy, https=proxy)
        end= time.time()
        totaltime+=(end-start)
        if (totaltime>100) and (hasproxy==False): #if it's taking some time
            time.sleep(300) #try giving a break
            totaltime=0 #reset totaltime
        print(totaltime, hasproxy)

#### Fill in Author object function
Uses `scholarly`'s `.search_author` and `.fill` methods to get information on an Author object using author name.

In [129]:
def fillauth(authorname):
    '''
    Attempts to get information for authorname from Google Scholar's author search and if there is a result, then more
    detailed information from that webpage.
    
    Args
        authorname (str) : the name of an author to search Google Scholar Author pages for
    
    Returns
        isfilled (bool) : whether Author is filled (True) or not (False)
        authobj (int or Author) : 1 if Author object was unable to be created, 0 if a different name was found
                                    otherwise returns Author object that matches authorname
    '''
    authorname = authorname.strip()#remove trailing and leading spaces
    setschproxy() #sets proxy
    authobj=1 #a default value for authobj to check if it's changed
    is_filled = False
    print("about to search")
    try: # tries to create Author object
        authobj = next(scholarly.search_author(authorname))
    except:
        print("did not find",authorname)
        
    if authobj !=1: #if authobj has changed from default value
        aname = authobj.name
        print("not yet filled",aname)
        ismatch, matchlist = matchname(original=authorname,comparison=aname) #compare author names and variations
        if ismatch==False:#if names are sufficiently different according to matchname
            authobj=0 #this is a different author and should try another name
            print("different author, try new name") #print out name match status update
            return is_filled, authobj
        else:
            setschproxy() #sets new proxy
            try:
                authobj.fill()
            except: #if .fill() didn't work, will return the basic info in authobj, but not filled
                print("did not fill",authobj.id)
        print("filled:",authobj.filled)
        is_filled = authobj.filled #sets is_filled to filled status of Author object
    return is_filled, authobj

### Gets publication information
The function tries different possibilities for author names to get appropriate search results, tries to fill in information about each author from author page, and tries to fill in information for first (defaults to 3) publications listed on their page. It also saves each author's information as a .CSV file and publication information, each as individual .CSV files. It saves the names of coauthors from those first three publications to be returned in the dictionary `coaudict`, with each key being an author from authorlist and the paired value being the corresponding list of coauthors.

In [123]:
def findauthpubinfo(authorlist,pubnum=3):
    '''
    Takes in list of author names and saves each (filled or unfilled) that has a search result on Google Scholar authors as
    a .CSV and tries to fill the first pubnum quantity publications listed on the author page, which is in order of number
    of citations.
    
    Args
        authorlist (list) : a list of names to find information for, from their Google Scholar Author pages
        pubnum (int) : defaults to 3, the number of first pubnum publications to fill (get detailed information for), for
                        each author
    
    Returns
        coaudict (dict) : the keys are authorlist items and the values are lists of coauthors from first pubnum quantity 
                            of publications that are not in authorlist and if the publication has more than 20 authors,
                            only the first one (the lead) is included in that key's list
        newnameslist (list) : a revised authorlist which includes alternate abbreviations or capitalizations of some names
    '''
    coaudict = dict()
    newnameslist=authorlist.copy() #a copy that does not change the original authorlist
    for i, eachname in enumerate(authorlist): #loops through each author name in the list
        coaudict[eachname]=[]#each name has an entry for coauthors of their "top publications"
        eachname=eachname.strip()#remove trailing and leading spaces
        fileid=str(i)+eachname[-4:] #creates a unique identifier for the file so it's unlikely to be overwritten
        print(fileid,eachname)
        expectries = 0
        isfilled, foundname = False, 1 #resetting variables before loop
        nameobj = 0 #resetting nameobj
        while expectries<2 and foundname==1:#tries twice with expected name
            expectries+=1
            isfilled,foundname = fillauth(eachname) #returns first item boolean is_filled, second item is the Author object or 0 or 1
            #when foundname is 0, indicates fillauth found a result but with a different name
        #end while loop
        if foundname==0 or foundname==1: #if the Author object was not created and did not find the wrong name
            othernames = listnames(eachname) #names in multiple formats
            for othername in othernames:
                newnameslist.append(othername)
            for othername in othernames: #loop over variations
                if othername==eachname:#if it's the same as the original
                    continue #go to next one
                foundname=1 #reset foundname in case it did not find the name to account for the next loop trying new names
                tries=0 #reset number of tries
                time.sleep(77)
                print("trying new name",othername)
                #if foundname is 0, the search gave a different name; probably a different person
                while tries<2 and foundname==1: #if foundname is 0 (got a different name)
                    #or foundname is an Author object, exits loop to get next alternate
                    isfilled, foundname = fillauth(othername) #tries fillauth with one different version of the name
                    tries+=1 #two attempts to get author information
                #end while loop
                if foundname!=0 and foundname!=1:#check for whether it created an Author object
                    break #leave loop with Author object for relevant name
        #separate if statements, want to test one first, then the other (not either-or)
        if foundname!=0 and foundname!=1: #if the Author object was created, with the right author name
            nameobj = foundname #then an Author object was returned from fillauth(), if not 0 or 1
            newnameslist.append(nameobj.name) #it has an acceptable name that should be added to the newnamelist
            isfilled = nameobj.filled
            if isfilled==False:
                print("filling outside of fillauth",nameobj.name,isfilled)
                filltries=0
                while (filltries<1) and (isfilled==False): #tries once to fill author information
                    filltries+=1
                    time.sleep(75)
                    setschproxy()
                    try:
                        nameobj.fill() #tries to fill Author object
                    except:
                        print("tried to fill",filltries,nameobj.name,isfilled)
                    isfilled = nameobj.filled
                print("filled outside fillauth:",isfilled)
            #after trying to fill author, or if it was already filled
            haspubs = saveauthinfo(nameobj,fileid) #saves as .CSV, returns False if it had no publications or is not filled
            if haspubs: #if nameobj has a list of publication objects
                savepubinfo(nameobj.publications,fileid,pubnum) #gets top pubnum publications and saves each as .CSV files
                for eachpub in nameobj.publications:
                    if eachpub.filled:#if the publication is filled
                        try: #now nameobj.publications has pubnum publications filled, tries to get bib["listauthors"]
                            coaudict[eachname],newnameslist = coauthors(eachpub.bib["listauthors"],coaudict[eachname],newnameslist,eachname)
                        except:
                            print("Ran into some trouble with the authors:",eachpub.bib["author"])
                    else:#assume the publication is not filled
                        continue
        else:
            print(f"Could not find info for {eachname} nor {othernames}.")
            continue #go on to next author name
    #removes duplicates from newnameslist
    newnameslist = [newnamesli for newnamesli in set(newnameslist)]
    return coaudict, newnameslist

#### Sample code and output
- Getting author names from `winnersdf`
- Using `findauthpubinfo()` to fill author and publication information on Richard Thaler

In [None]:
name31list = [winner for winner in winnersdf.name1[:31]] #gets author names from Nobel Prize winners' dataframe

In [39]:
winnersdf[10:23]

Unnamed: 0,name1,text1,year,field,additional
13,Esther Duflo,"Abhijit Banerjee, Esther Duflo and Michael Kre...",2019,economics,"['a>,\xa0<a href=""https:', 'a>\xa0and\xa0<a hr..."
14,Michael Kremer,"Abhijit Banerjee, Esther Duflo and Michael Kre...",2019,economics,"['a>,\xa0<a href=""https:', 'a>\xa0and\xa0<a hr..."
16,Arthur Ashkin,Arthur Ashkin “for the optical tweezers and th...,2018,physics,"['ashkin', 'a> “for the optical tweezers and t..."
17,Gérard Mourou,Gérard Mourou and Donna Strickland “for their ...,2018,physics,"['a> and <a href=""https:', 'a>\xa0“for their m..."
18,Donna Strickland,Gérard Mourou and Donna Strickland “for their ...,2018,physics,"['a> and <a href=""https:', 'a>\xa0“for their m..."
19,Frances H. Arnold,Frances H. Arnold “for the directed evolution ...,2018,chemistry,"['arnold', 'a>\xa0“for the directed evolution ..."
20,George P. Smith,Frances H. Arnold “for the directed evolution ...,2018,chemistry,"['arnold', 'a>\xa0“for the directed evolution ..."
21,Sir Gregory P. Winter,Frances H. Arnold “for the directed evolution ...,2018,chemistry,"['arnold', 'a>\xa0“for the directed evolution ..."
22,James P. Allison,James P. Allison and Tasuku Honjo\r\r\r\n“for ...,2018,medicine,"['allison', 'a> and <a href=""https:', 'a><br']"
23,Tasuku Honjo,James P. Allison and Tasuku Honjo\r\r\r\n“for ...,2018,medicine,"['allison', 'a> and <a href=""https:', 'a><br']"


In [64]:
namertlist = ["Richard H. Thaler"] #just last author left in that year
coauth30dict, revname30list = findauthpubinfo(namertlist,5) #saves each author and their first 5 publications' info

0aler Richard H. Thaler
11.537138938903809 False
13.498255968093872 False
56.01778292655945 True
about to search
Try 0 https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Richard%20H.%20Thaler
OK 200
did not find Richard H. Thaler
22.418460845947266 True
about to search
Try 0 https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Richard%20H.%20Thaler
OK 200
did not find Richard H. Thaler
trying new name Richard Thaler
6.0188939571380615 True
about to search
Try 0 https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Richard%20Thaler
OK 200
not yet filled Richard Thaler
19.257790565490723 False
40.874321937561035 True
Try 0 https://scholar.google.com/citations?hl=en&user=Tvzd5GgAAAAJ&pagesize=20
OK 200
filled: True
saved 0aler filled author information, with publication basics
filling publication Tvzd5GgAAAAJ:1sJd4Hv_s6UC
13.079080581665039 False
27.86325716972351 True
Try 0 https://scholar.google.com/citations?hl=en&view_

In [65]:
coauth30dict #coauthors for Richard Thaler

{'Richard H. Thaler': ['Cass R. Sunstein', 'Werner FM De Bondt']}

In [66]:
revname30list #possible alternate names to search

['Richard H. Thaler',
 'Richard Thaler',
 'R H Thaler',
 'Richard H Thaler',
 'R. H. Thaler',
 'RH Thaler',
 'Richard H. Thaler',
 'Richard Thaler']

## Getting list of coauthors
Lists were saved as a long String in a .CSV file in the `listauthors` attribute in `allpubsdf`

In [9]:
#split a string into list of names separated by commas
def stringtolist(longstring):
    '''
    Designed for strings that were a list, but then saved as a string. It removes quotes, [] and blank space, returning
    a list of items that were separated by commas: "['item1','item2']" as a string -> a list of strings item1, item2
    
    Args
        longstring (str) : a string that is formatted like a list
    
    Returns
        noquoteitems (list) : a list of the Strings that were in longstring separated by commas, without quotes '' or ""
    '''
    longstring = longstring.strip("[\"]")#could account for "Author1 and Author2" here by splitting at " and "
    items = longstring.split(',')
    noquoteitems = [item.strip().strip("\'") for item in items]
    return noquoteitems

### Making an author name list and dictionaries of author names with coauthor names
This code block compares coauthor names from `allpubsdf` to author names from `winnersdf` to create a list of revised author names `revwinners` and a dictionary of author names as keys with a list of coauthor names as values, from the `coauthors` function.

In [58]:
#this only works if all authors' last four letters of their names are unique
winnernames = [win for win in winnersdf.name1[:32]] #up to last author name; Richard H. Thaler
#create dictionary of all coauthor names
coauth32dict = dict()
revwinners = [] #to be copy of winnernames to which to add variations that matched, returned by coauthors()
for winnername in winnernames:
    #winnername = winnername.strip()#remove any trailing or leading blank space
    coauth32dict[winnername]=[]
    revwinners.append(winnername) #create copy of winnernames
#get coauthor names and other info from publication object .CSV files
for coauthstring,authid in zip(allpubsdf["bib_listauthors"],allpubsdf.author_fileID):
    authlast4 = authid[-4:]
    #convert coauthnames from string to list
    coauthnames = stringtolist(coauthstring)#remove trailing and leading blank space, "" or '' and []
    for winnername in winnernames:
        if authlast4==winnername[-4:]:#where last four characters of author_fileID match a winnername's, add to dictionary entry
            coauth32winnames = coauth32dict[winnername]
            coauth32dict[winnername],revwinners=coauthors(coauthnames,coauth32winnames,revwinners,winnername)
            break #and then leave this loop to get to next authlast4
        else:
            continue

Richard H. Thaler was not added.
Added new author Cass R. Sunstein to coauthor list.
Only one author, ['Richard H. Thaler and Cass R Sunstein'], was not added.
Only one author, ['Thomas C Leonard'], was not added.
Added new author Werner FM De Bondt to coauthor list.
Added new author Richard Thaler to coauthor list.
Only one author, ['Richard Thaler'], was not added.
Arthur Ashkin was not added.
Added new author James M Dziedzic to coauthor list.
Added new author JE Bjorkholm to coauthor list.
Added new author Steven Chu to coauthor list.
Arthur Ashkin was not added.
James M Dziedzic was not added.
JE Bjorkholm was not added.
Steven Chu was not added.
Only one author, ['Arthur Ashkin'], was not added.
Arthur Ashkin was not added.
James M Dziedzic was not added.
Added new author T Yamane to coauthor list.
Arthur Ashkin was not added.
James M Dziedzic was not added.
Richard Henderson was not added.
Added new author Joyce M Baldwin to coauthor list.
Added new author Thomas A Ceska to coau

In [271]:
coauth32dict["Esther Duflo"] #testing dictionary

['Marianne Bertrand',
 'Sendhil Mullainathan',
 'Abhijit V Banerjee',
 'Rachel Glennerster',
 'Cynthia Kinnan']

In [273]:
revwinners#this is no different from original list

['James Peebles',
 'Michel Mayor',
 'Didier Queloz',
 'John B. Goodenough',
 'M. Stanley Whittingham',
 'Akira Yoshino',
 'William G. Kaelin Jr',
 'Sir Peter J. Ratcliffe',
 'Gregg L. Semenza',
 'Abhijit Banerjee',
 'Esther Duflo',
 'Michael Kremer',
 'Arthur Ashkin',
 'Gérard Mourou',
 'Donna Strickland',
 'Frances H. Arnold',
 'George P. Smith',
 'Sir Gregory P. Winter',
 'James P. Allison',
 'Tasuku Honjo',
 'William D. Nordhaus',
 'Paul M. Romer',
 'Rainer Weiss',
 'Barry C. Barish',
 'Kip S. Thorne',
 'Jacques Dubochet',
 'Joachim Frank',
 'Richard Henderson',
 'Jeffrey C. Hall',
 'Michael Rosbash',
 'Michael W. Young',
 'Richard H. Thaler']

### Revised dictionary includes 1-3 coauthors for each winner
Keys that have no values (0 coauthors) are excluded. The first (at most, three) coauthor names for each winner that are not also winners are added to `first3coauth` dictionary using `findnames` function.

In [65]:
#now that I have a list of coauthors for each winner, getting first three coauthor names that are not in revwinners
#if only one or two, okay
first3coauth=dict()
for winn,coau32list in coauth32dict.items():
    if len(coau32list)>0:
        first3coauth[winn]=[]
        first3 = findnames(authornames=revwinners,name=winn,namelist=coau32list,firstfew=4)
        first3coauth[winn]=first3
    else:
        print("no coauthors/info for",winn)
first3coauth

no coauthors/info for James Peebles
no coauthors/info for Michel Mayor
no coauthors/info for Didier Queloz
no coauthors/info for John B. Goodenough
Added new top name Shoufeng Yang
Added new top name Peter Y Zavalij
got 2 names for M. Stanley Whittingham
no coauthors/info for Akira Yoshino
no coauthors/info for William G. Kaelin Jr
no coauthors/info for Sir Peter J. Ratcliffe
Added new top name Daniel J Klionsky
Added new top name Guang L Wang
Added new top name Bing-Hua Jiang
Added new top name Elizabeth A Rue
got 4 names for Gregg L. Semenza
Added new top name Andrew F Newman
Added new top name Rachel Glennerster
Added new top name Cynthia Kinnan
got 3 names for Abhijit Banerjee
Added new top name Marianne Bertrand
Added new top name Sendhil Mullainathan
Added new top name Rachel Glennerster
Added new top name Cynthia Kinnan
got 4 names for Esther Duflo
no coauthors/info for Michael Kremer
Added new top name James M Dziedzic
Added new top name JE Bjorkholm
Added new top name Steven C

{'M. Stanley Whittingham': ['Shoufeng Yang', 'Peter Y Zavalij'],
 'Gregg L. Semenza': ['Daniel J Klionsky',
  'Guang L Wang',
  'Bing-Hua Jiang',
  'Elizabeth A Rue'],
 'Abhijit Banerjee': ['Andrew F Newman',
  'Rachel Glennerster',
  'Cynthia Kinnan'],
 'Esther Duflo': ['Marianne Bertrand',
  'Sendhil Mullainathan',
  'Rachel Glennerster',
  'Cynthia Kinnan'],
 'Arthur Ashkin': ['James M Dziedzic',
  'JE Bjorkholm',
  'Steven Chu',
  'T Yamane'],
 'Gérard Mourou': ['Toshiki Tajima', 'Sergei V Bulanov', 'A Braun', 'G Korn'],
 'Donna Strickland': ['P Maine', 'P Bado', 'M Pessot', 'G Mourou'],
 'Frances H. Arnold': ['Todd Thorsen',
  'Richard W Roberts',
  'Stephen R Quake',
  'Anne Y Fu'],
 'Sir Gregory P. Winter': ['John McCafferty',
  'Andrew D Griffiths',
  'David J Chiswell',
  'Lutz Riechmann'],
 'James P. Allison': ['Daniel L Barber',
  'E John Wherry',
  'David Masopust',
  'Baogong Zhu'],
 'Joachim Frank': ['Michael Radermacher',
  'Pawel Penczek',
  'Jun Zhu',
  'Yanhong Li'],


In [96]:
#get the values (each is a list of names) from first3coauth dictionary and put into a list of lists
listoflists=[]
for vals in first3coauth.values():
    listoflists.append(vals)

In [91]:
#with dictionary of authors listing top 1-4 coauthors, loop through to fill their top 5 publications
#for each, add to comparison list so if a name appears in more than one author's list, it won't be filled twice
completelist=[]
cocoauthdict = dict()
for firstcoauths in first3coauth.values()[]:
    if len(firstcoauths)>1: #if there is more than one coauthor for each
        for num, coauth in enumerate(firstcoauths):
            if not(coauth in completelist) and not(coauth in revwinners):
                coauth = coauth.strip()
                codict, completelist = findauthorpubinfo(coauth,completelist,num,pubnum=5)
                completelist.append(coauth)
                cocoauthdict[coauth]=codict[coauth]
            if num>1: #if it recently tried to get info for the third name for this author
                break

0Yang Shoufeng Yang
12.17975378036499 True
about to search
Try 0 https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Shoufeng%20Yang%20%28%E6%9D%A8%E5%AE%88%E5%B3%B0%29
Try 1 https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Shoufeng%20Yang%20%28%E6%9D%A8%E5%AE%88%E5%B3%B0%29
did not find Shoufeng Yang (杨守峰)
11.928751468658447 False
37.17643475532532 False
45.2666916847229 True
about to search
Try 0 https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Shoufeng%20Yang%20%28%E6%9D%A8%E5%AE%88%E5%B3%B0%29
OK 200
not yet filled Shoufeng Yang (杨守峰)
8.080149173736572 False
20.520059823989868 True
Try 0 https://scholar.google.com/citations?hl=en&user=Jmkx86MAAAAJ&pagesize=20
OK 200
filled: True
saved 0Yang filled author information, with publication basics
filling publication Jmkx86MAAAAJ:u5HHmVD_uO8C
7.791932582855225 True
Try 0 https://scholar.google.com/citations?hl=en&view_op=view_citation&citation_for_view=Jmkx86MAAAAJ

KeyboardInterrupt: 

In [136]:
len(completelist)

239

In [92]:
#preserving list here
completelist

['A. F. Newman',
 'andrew f. newman',
 'Peter Zavalij',
 'Shoufeng Yang (杨守峰)',
 'daniel j klionsky',
 'Guang L Wang',
 'G. L. Wang',
 'Daniel J Klionsky',
 'guang l. wang',
 'Andrew F. Newman',
 'daniel j. klionsky',
 'Marianne Bertrand',
 'a f newman',
 'd j klionsky',
 'AF Newman',
 'Guang L. Wang',
 'dj klionsky',
 'Daniel J. Klionsky',
 'Guo-Liang Wang',
 'DJ Klionsky',
 'Shoufeng Yang',
 'D J Klionsky',
 'andrew newman',
 'af newman',
 'g l wang',
 'Daniel Klionsky',
 'g. l. wang',
 'Peter Y Zavalij',
 'gl wang',
 'Bing-Hua Jiang',
 'andrew f newman',
 'a. f. newman',
 'daniel klionsky',
 'd. j. klionsky',
 'G L Wang',
 'guang l wang',
 'Rachel Glennerster',
 'Andrew F Newman',
 'Andrew Newman',
 'GL Wang',
 'D. J. Klionsky',
 'A F Newman',
 'Cynthia Kinnan',
 'Sendhil Mullainathan',
 'Guang Wang',
 'guang wang',
 'Sendhil Mullainathan']

## Methods to get and save author, publication info

In [133]:
def findauthorpubinfo(authorname,complist,filenum,pubnum=3):
    '''
    Takes in one author name and saves (filled or unfilled) as .CSV and same for each first pubnum quantity publications.
    It returns revised complist (includes alternate capitalizations or initials + (Western-style) last name of some names) 
    and a dictionary with list of co-authors from those first pubnum quantity of publications that are not in complist.
    
    Args
        authorname (str) : A name to be searched for in Google Scholar Author pages
        complist (list) : list of names to be compared against to be sure coauthors are not already accounted for
        filenum (int) : a number to be included in the .CSV file(s) name(s)
        pubnum (int) : defaults to 3, the number of first few publications to fill complete details for
    
    Returns
        coaudict (dict) : dictionary of one item: key being authorname and value being a list of other authors credited on
                            pubnum publications that do not appear in complist
        newnameslist (list) : list of names from complist, with addition of any variations on authorname that were 
                            searched for, retrieved as a result or would be potential matches
    '''
    coaudict=dict()
    coaudict[authorname]=[]#an entry for coauthors of their "top" publications
    newnameslist=complist.copy()
    eachname = notitles(authorname)#removes prefixes and suffixes
    fileid=str(filenum)+eachname[-4:] #creates a unique identifier for the file so it's unlikely to be overwritten
    print(fileid,eachname)
    expectries = 0
    isfilled, foundname = False, 1 #set variables before loop
    nameobj = 0 #set nameobj
    while expectries<2 and foundname==1:#tries twice with expected name
        expectries+=1
        #if authorname=="Shoufeng Yang":#the following name would not be recognized by my matchname code
        #     isfilled,foundname = fillauth("Shoufeng Yang (杨守峰)")#so this was a specific manual fix
        #else:
        isfilled,foundname = fillauth(authorname) #returns first item boolean is_filled, second item is the Author object 
        # or 0 or 1 when foundname is 0, indicates fillauth found a result but with a different name
    #end while loop
    if foundname==0 or foundname==1: #if the Author object was not created and did not find the wrong name
        nametries=1 #tried with authorname already
        othernames = listnames(eachname) #names in multiple formats
        for othername in othernames:
            newnameslist.append(othername)
        for othername in othernames: #loop over variations
            if othername==eachname:#if it's the same as the original
                continue #go to next one
            foundname=1 #reset foundname in case it did not find the name to account for the next loop trying new names
            tries=0 #reset number of tries
            time.sleep(77)
            print("trying new name",othername)
            #if foundname is 0, the search gave a different name; probably a different person
            while tries<2 and foundname==1: #if foundname is 0 (got a different name)
                #or foundname is an Author object, exits loop to get next alternate
                isfilled, foundname = fillauth(othername) #tries fillauth with one different version of the name
                tries+=1 #two attempts to get author information
            #end while loop
            nametries+=1 #may want to put nametries into while tries<2 foundname==1 loop and put higher limit
            if (foundname!=0 and foundname!=1) or nametries>4:#check for whether it created Author object or tried 4 names
                break #leave loop with Author object for relevant name
    #separate if statements, want to test one first, then the other (not either-or)
    if foundname!=0 and foundname!=1: #if the Author object was created, with the right author name
        nameobj = foundname #then an Author object was returned from fillauth(), if not 0 or 1
        newnameslist.append(nameobj.name) #it has an acceptable name that should be added to the newnamelist
        isfilled = nameobj.filled
        if isfilled==False:
            print("filling outside of fillauth",nameobj.name,isfilled)
            filltries=0
            while (filltries<1) and (isfilled==False): #tries once to fill author information
                filltries+=1
                time.sleep(75)
                setschproxy()
                try:
                    nameobj.fill() #tries to fill Author object
                except:
                    print("tried to fill",filltries,nameobj.name,isfilled)
                isfilled = nameobj.filled
            print("filled outside fillauth:",isfilled)
        #after trying to fill author, or if it was already filled
        haspubs = saveauthinfo(nameobj,fileid) #saves as .CSV, returns False if it had no publications or is not filled
        if haspubs: #if nameobj has a list of publication objects
            savepubinfo(nameobj.publications,fileid,pubnum) #gets top pubnum publications and saves each as .CSV files
            for eachpub in nameobj.publications:
                if eachpub.filled:#if the publication is filled
                    try: #now nameobj.publications has pubnum publications filled, tries to get bib["listauthors"]
                        coaudict[authorname],newnameslist = coauthors(eachpub.bib["listauthors"],coaudict[authorname],
                                                                      newnameslist,eachname)
                    except:
                        print("Ran into some trouble with the author(s):",eachpub.bib["author"])
                else:#assume the publication is not filled
                    continue
    else:
        print(f"Could not find info for {eachname} nor {authorname} nor {othernames}.")
    #removes duplicates from newnameslist
    newnameslist = [newnamesli for newnamesli in set(newnameslist)]
    return coaudict, newnameslist

In [121]:
#this does what it says just fine as long as there are no prefixes or suffixes
def noinitials(firstMlast):
    '''
    Given a name with initials, returns the name without them. Except if the name is in the format I. Lima, the entire 
    name is returned unchanged.
    
    Args
            firstMlast (str) : a name of the format W. Juliet Tango or Walter J Tango (with or without periods)
    
    Returns
            firstlast (str) : the name without middle initials or first initials, except if the name had fewer than 3 parts
            firstMlast (str) : if there were two or fewer parts to the name, losing an initial would leave a last name only
    '''
    firstMlast = firstMlast.strip()#remove any trailing or leading spaces
    namelist = firstMlast.split(' ')
    firstlast="" #initialize as String
    #looping through which parts are abbreviated or not
    if len(namelist)>2:
        for name in namelist[:-1]:#excepting last name
            namelen=len(name)
            if not(namelen==1 and name[0].isalpha()) and not(namelen==2 and name[1]=="."):#if is not initials B or B.
                firstlast=firstlast+name+" "
            else:#it is an initial
                continue
    
        return firstlast+namelist[-1] #adding on the last name
    else:
        return firstMlast #this name was one or two parts and did not want to remove one or both to create a sole last name

In [120]:
def firstinitial(firstmlast):
    '''
    Given a name in the format Alpha Bravo Lastname or Alpha B. Lastname or A. Bravo Lastname, returns the names 
    A. B. Lastname and AB Lastname. This expects names to be in the Western order of Firstname Lastname and for two last
    names to be separated by something other than a space, otherwise the first one would be treated as a middle name.
    
    Args
            firstmlast (str) : a name assumed not to be in the format ABC Lastname; would treat that like ABC LASTNAME
            
    Returns
            ABName (str) : a name with initials for all except the last name (Western-style last name)
            ABjoined (str) : two initials (or more) are fused together instead of separated
    '''
    ABjoined =""
    ABName=""
    firstmlast = firstmlast.strip()#removes trailing and leading blank space
    splitname = firstmlast.split(' ') #splits name where there is a space
    nameAsList = isinitials(splitname) #gets initials, if in form of ['UJ', 'Final'], returns ['U','J','Final']
    
    if len(nameAsList)>1:
        for namepart in nameAsList[:-1]: #loop through all except the last part of the name
            partlen=len(namepart)
            if (partlen==2 and namepart[1]=="." and namepart[0].isalpha()): #if it is already an initial like A.
                ABName = ABName+namepart.upper()+" "#add to name
                ABjoined+=namepart[0].upper()#would need the A, not the period
                continue #go on to next part
            elif partlen==1: #if it is an initial without a period
                if namepart.isalpha(): #if it's a "letter", then it's an initial already
                    ABName = ABName+namepart.upper()+". "#add to name with period
                    ABjoined+=namepart.upper()
                else:
                    continue #goes to next namepart, since this namepart is a symbol
            elif partlen>1:
                for letter in namepart:
                    if letter.isalpha(): #finds the first character that is not a symbol
                        ABName=ABName+letter.upper()+". "#adds the first letter of namepart with a period: Vi is added as V.
                        ABjoined=ABjoined+letter.upper()
                        break #leave loop after adding initials to the names (no need for other letters)
        #end for loop
        ABName = ABName+nameAsList[-1] #should already have space after any initial and if mononym, returns full name
        ABjoined = ABjoined+" "+nameAsList[-1] #would not have any space as all initials are together
    else:#it's a mononym and has no need for initials
        ABName = nameAsList[0]
        ABjoined = nameAsList[0]
    
    return ABName, ABjoined

In [119]:
def alternatenames(origin):
    '''
    Given a name, returns versions with initials instead of first names, with and without periods, without
    middle initials or without first initials (if middle name is what origin goes by).
        
    Args
        origin (str) : a name with the first name or initial followed by middle name or initial followed by last name
    
    Returns
        alternames (list) : a list of names derived from origin; without prefixes or suffixes
    '''
    origin = origin.strip() #remove trailing or leading blank space
    aName = origin #so can check if had suffixes or prefixes
    aName = notitles(aName)#removes select suffixes or prefixes
    flastname,fmlastname = firstinitial(aName) #gets initials + last name in two formats: AB Lastname and A. B. Lastname 
    nomiddle = noinitials(aName) #gets first and last names minus middle initials
    alternames=[]
    #check if origin matches any alternates - if so, not returning them
    for alter in [aName,nomiddle,flastname,fmlastname]:
        if origin!=alter:
            alternames.append(alter)
        else:
            continue
            
    return alternames

In [118]:
def withdots(namewodots):
    '''
    Takes away periods from initials if namewodots has them or adds them if missing.
    
    Args
        namewodots (str) : a name with initials in the format A Lastname or Alpha B. Lastname
    
    Returns
        nameWdots (str) : the given string with periods after initials if it had none or without periods if it had them
    '''
    partslist = namewodots.split(' ')
    nameWdots=""
    for part in partslist[:-1]:#all but last name
        partlen =len(part)
        if partlen==1 and part.isalpha(): # W -> W.
            nameWdots+=part.upper()+". "#add period if it's a single letter
        elif partlen==2 and part[1]==".":#is initial with a dot W.
            nameWdots+=part[0].upper()+" " #removes the period, makes it uppercase letter if not w. -> W
        else:#if it's not an initial
            nameWdots+=part+" "#rejoin without a period
    nameWdots+=partslist[-1]#add last name back on
    return nameWdots

In [125]:
def listnames(name):
    '''
    Given name, returns a list of alternate versions, including the original, without trailing or leading blank space.
    There will be all lowercase and capitalized, with and without periods after intials, without middle initials, etc.
    
    Args
        name (str) : a name with given names separated by spaces and by a space between given names and last name, in
                    Western-style order
    
    Returns
        morenames (list) : a list of Strings; names generated by various functions operating on name
    '''
    name = name.strip()
    newnames = alternatenames(name)#returns a list
    newnames.append(name)
    morenames = newnames.copy()
    for newname in newnames:
        dotname = withdots(newname)#get periods or remove them using withdots
        morenames.append(dotname)
        morenames.append(newname.lower())#all lowercase
        morenames.append(dotname.lower())#variations with and without periods
    morenames = [morename for morename in set(morenames)]
    return morenames

In [115]:
testcases = ["Sir Foxtrot Lima Niner","Juliet Tango III","Murasaki","Romeo D. Umbrella","Sir A Bravo Charlie Jr"]
for testcase in testcases:
    print(listnames(testcase))

['F L Niner', 'Sir Foxtrot Lima Niner', 'F. L. Niner', 'FL Niner', 'Foxtrot Lima Niner']
['Juliet Tango III', 'Juliet Tango', 'J. Tango', 'J Tango']
['Murasaki']
['R D Umbrella', 'Romeo D Umbrella', 'Romeo D. Umbrella', 'Romeo Umbrella', 'R. D. Umbrella', 'RD Umbrella']
['AB Charlie', 'A. Bravo Charlie', 'A. B. Charlie', 'Bravo Charlie', 'Sir A. Bravo Charlie Jr', 'Sir A Bravo Charlie Jr', 'A Bravo Charlie', 'A B Charlie']


In [115]:
def notitles(fullname):
    '''
    Removes certain prefixes and/or suffixes (Sir, Jr, III) or returns fullname stripped of trailing and leading blank 
    space.
    
    Args
        fullname (str) : a name with the first three characters matching Sir and/or the last three matching Jr or III
    
    Returns
        (str) : the name without those particular suffixes or prefixes (Sir, Jr, III)
    '''
    if fullname[:4] == "Sir ":#if starts with a prefix Sir (or could be Dame or Rev. or Hon.)
        fullname = fullname[4:] #so remove prefix
    suffix = fullname[-3:].lower()
    if suffix==" jr" or suffix=="iii": #if it has a suffix Jr or III (could also be Jr. . . or Esq.)
        fullname=fullname[:-3] # so remove the suffix and maybe a space

    return fullname.strip() #removes any newly trailing or leading spaces

In [114]:
def isinitials(shortname):
    '''
    takes in a name as a list separated by ' ' and checks whether it's all capitals like AL HOTEL or if the first name is
    the first and middle initial fused together more like AB Lastname and returns them as a list ['A','B','Lastname']
    '''
    lastname = shortname[-1]#the last item should be a last name if the first item is the first initials
    if (lastname[1].isupper()==False):#if the second character in lastname is lowercase
        #then the whole name is not uppercase (so it accounts for XM Li but does not account for VI McDONALD)
        initials = shortname[0]#the initial(s) are the first item in the list of shortname
        inilen=len(initials)
        if inilen==1:#there are no other initials to separate out
            return shortname
        elif inilen==2:
            if (initials[1].isalpha() and initials[1].isupper()) and (initials[0].isalpha() and initials[0].isupper()):
                #if the characters are uppercase and not symbols
                sepname=[initials[0],initials[1]]
                for name in shortname[1:]:#adds on the rest of the name
                    sepname.append(name)
                return sepname
            else:#the two characters in the beginning of shortname are not initials like JE
                return shortname
        else: #length of first item is more than 2
            if(initials[1].islower()):#if second char is lowercase
                return shortname#then this is most likely a complete name and not fused initials
            else:#made a mistake with checking the last name and so this is a complete name, in uppercase
                #or could separate out all of the initials but more than two is unlikely, I would think
                return shortname
    else: #most likely, the entire last name is uppercase
        return shortname #returns given name list

In [126]:
def matchname(original, comparison):
    '''
    If given an original name and new variation, checks whether they match closely enough. If the comparison is good enough
    returns boolean and list of alternate names for original.
    This function may miss some similarities, particularly between shortened versions of names, like not matching Jon with
    Jonathan.
    '''
    #should create new method to return lists of useful names
    #remove both's trailing and leading spaces
    original = original.strip()
    comparison = comparison.strip()
    nopresuf = notitles(original)
    orignames = listnames(original)#list of unique variations, including original and nopresuf
    
    #check to see if comparison is in orignames
    #if so, it's a match
    #if not, check for comparison extra middle name/initial or first name matching original's first initial
    #but only against nopresuf version; not original
    
    if (comparison in orignames) or (comparison.lower() in orignames):
        return True, orignames #it does match exactly and returns the list of variations
    elif (comparison.lower() in [orli.lower() for orli in orignames]):
        return True, orignames.append(comparison)#it does match and returns the list of variations
    else:
        origparts = nopresuf.split(' ')
        comparts = comparison.split(' ')
        #assumes nopresuf is not in the format AB Lastname where A and B are first and middle initials
        olastname = origparts[-1].lower()
        clastname = comparts[-1].lower()
        if clastname!=olastname: #comparison to last name of nopresuf; if not exact, not a match
            return False, orignames #not a match
        else: #exact match for last name
            #separate out initials from comparison or original if both are uppercase unless all of lastname is uppercase
            iniorig = isinitials(origparts)
            inicomp = isinitials(comparts)
            samelen = (len(inicomp)==len(iniorig))#check if have same number of parts/initials
            if samelen: #if they do have the same number, check if initials match, then names, for each
                #original has priority!
                for o,c in zip(iniorig,inicomp):
                    inio = o[0].lower()
                    inic = c[0].lower()
                    if inio==inic:#the same name would have all parts of the name in the same order
                        return False, orignames
                    elif o.lower()==c.lower():#if the entire name matches
                        continue
                    elif inio==inic: #the first initials match
                        orilen = len(o)
                        comlen = len(c)
                        if comlen==orilen: #if the comparable parts of the name are the same length
                            #if they were 1 character long, would have already matched
                            if orilen==2: #if both are two characters long
                                if o[1]!=c[1]: #if second characters do not match
                                    if o[1]=="." or c[1]==".": #check if one has a period
                                        #if one does, then they are equivalent, whichever it is: V. == Vi
                                        continue #continue to next c and o pair in the loop
                                    else:#they are not the same name, like Al and Ab, or Yi and Ye
                                        return False, orignames
                            else: #they are the same length but longer than 2 characters and are not identical
                                return False, orignames #they do not match
                        else:#different lengths, first initial matches
                            if (orilen==2 and comlen==1) or (comlen==2 and orilen==1):
                                #if one or both is an initial (A compared to A. or Al), they match
                                continue
                            elif (orilen==2 and o[1]==".") or (comlen==2 and c[1]=="."): #one is an initial with a period
                                continue
                            elif (orilen==1 or comlen==1):#one is a single initial: G matches George
                                continue
                            else: #they are different lengths, neither is an initial, so they are different names
                                return False, orignames
                    else: #all possibilities already accounted for
                        continue
                #end for loop
                return True, orignames #the two match initials and names closely enough if did not already return
            else: #inicomp and iniorig are not the same length; do not have the same number of names
                #so one may have an extra middle/first name/initial (Ant B Lastname compared to Ant Lastname)
                #(or Q Fred Lastname compared to Fred Lastname - less likely)
                #loop through one name and compare each part except the last [:-1] to each other part of the other
                #simpler check to see if middle name/initial was missing, but first initial/name matches
                #if (len(iniorig)<len(inicomp)) and (len(iniorig)>=2): #if inicomp is longer
                if True:
                    origfirst=iniorig[0].lower()
                    compfirst=inicomp[0].lower()
                    if compfirst==origfirst:
                        return True, orignames #first and last match, not accounting for different middle names
                    elif compfirst[0]==origfirst[0]:
                        orilen=len(origfirst)
                        comlen = len(compfirst)
                        if (orilen==2 and comlen==1) or (comlen==2 and orilen==1):
                            #if one or both is an initial (A compared to A. or Al), they match
                            return True, orignames
                        elif (orilen==2 and origfirst[1]==".") or (comlen==2 and compfirst[1]=="."):
                            #one is an initial with a period
                            return True, orignames
                        elif (orilen==1 or comlen==1):#one is a single initial: G matches George
                            return True, orignames
                        else: #they are different lengths, neither is an initial, so the first names are different names
                            return False, orignames
                    else:#they are not similar enough
                        return False, orignames

## Copied returned results to save as list
List of 2019 winners and alternate names, dictionary of 2019 winner names that had search result with values that are lists of at most five authors also listed on winner's top five publications.

In [92]:
revname10list = namelist.copy() #namelist could be the first :11 names (or :10, without Duflo)
revlist=["J. Peebles","M. Mayor","D. Queloz",'John Goodenough', 'J. B. Goodenough',
         'Stanley Whittingham','M. S. Whittingham',"A. Yoshino",'William Kaelin', 'W. G. Kaelin','Peter Ratcliffe',
         'P. J. Ratcliffe']
for abbrevname in revlist:
    revname10list.append(abbrevname)

In [96]:
revname10list =['James Peebles',
 'Michel Mayor',
 'Didier Queloz',
 'John B. Goodenough',
 'M. Stanley Whittingham',
 'Akira Yoshino',
 'William G. Kaelin Jr',
 'Sir Peter J. Ratcliffe',
 'Gregg L. Semenza',
 'Abhijit Banerjee',
 'Esther Duflo','Michael Kremer',
 'J. Peebles',
 'M. Mayor',
 'D. Queloz',
 'John Goodenough',
 'J. B. Goodenough',
 'Stanley Whittingham',
 'M. S. Whittingham',
 'A. Yoshino',
 'William Kaelin',
 'W. G. Kaelin',
 'Peter Ratcliffe',
 'P. J. Ratcliffe'] #all these names have been searched for

In [94]:
#create dictionaries and name lists based on what was recorded already
coauth10dict = {"M. Stanley Whittingham":["Shoufeng Yang","Peter Y. Zavalij"],"Gregg L. Semenza":["Guang L. Wang",
                "Bing-Hua Jiang","Elizabeth A. Rue","Jo A. Forsythe","Narayan V. Iyer","Faton Agani","Sandra W. Leung",
                "Robert D. Koos","Hua Zhong","Angelo M. De Marzo","Erik Laughner","Michael Lim","David A. Hilton",
                "David Zagzag","Peter Buechler","William B. Isaacs","Jonathan W. Simons"],"Abhijit Banerjee":
                ["Abhijit V. Banerjee","Andrew F. Newman","Rachel Glennerster","Cynthia Kinnan"]}

In [95]:
#hand-typed list of co-authors for this author from saved .CSV files
coauth10dict["Esther Duflo"]=["Marianne Bertrand","Sendhil Mullainathan","Abhijit V. Banerjee"
                              "Rachel Glennerster","Cynthia Kinnan"]

In [146]:
coauthdict

{'Eric Betzig': ['George H Patterson',
  'Rachid Sougrat',
  'O Wolf Lindwasser',
  'Scott Olenych',
  'Juan S Bonifacino',
  'Michael W Davidson',
  'Jennifer Lippincott-Schwartz',
  'Harald F Hess',
  'Jay K Trautman'],
 'Stefan W. Hell': ['Stefan W Hell', 'Jan Wichmann'],
 'William E. Moerner': ['Anika Kinkhabwala',
  'Zongfu Yu',
  'Shanhui Fan',
  'Yuri Avlasevich',
  'Klaus Müllen',
  'WE Moerner']}

In [127]:
#get coauthor list for each dictionary item
def coauthors(authlist,coauthlist,derivlist,authname):
    '''
    adds an authlist author to coauthlist if is a name not in derivlist and if not the only author
    all are 1-D lists or arraylikes, authlist is iterable and coauthlist can be appended

    Args
        authname (str) : author name
        authlist (list) : from Publication object's bib attribute "listauthors" for the authname
        coauthlist (list) : pre-existing list of authors connected to authname
        derivlist (list) : alternate names for authname and any other authors that should not be added to coauthlist
    
    Returns 
        coauthlist (list) : same as argument, except with added co-author names from authlist
        derivlist (list) : the same as the argument, except if there was a close match, was updated to include variations
    '''
    numauths = len(authlist)
    if numauths==1:#if there is one name
        print(f"Only one author, {authlist}, was not added.")
        #if it's different, won't add to list because it's not truly a coauthor if the original author isn't even listed
        #if it's the same name, won't add because it's already been handled elsewhere
    elif numauths>20:#if there are more than 20 authors listed
        auth=authlist[0] #gets lead author name
        ismatch, namelist = matchname(original=authname,comparison=auth)#checks if authname is the lead author
        if ismatch==False: #if authname not the lead author, authname is likely one of 20+ coauthors
            if not(auth in derivlist) and not(auth in coauthlist):#not in derivlist and not in coauthlist already
                coauthlist.append(auth)#add to list
                print(f"Added only first of {numauths} authors to coauthor list: {auth}")
        else: #was a close match of authname
            derivlist.append(auth)
            for nameli in namelist:
                derivlist.append(nameli)
            print(f"Did not add any of {numauths} authors to coauthor list: {auth}")
    else: #one to 20 authors listed
        for auth in authlist: #loop through each author attached to the publication to check if already in a list
            if not(auth in derivlist) and not(auth in coauthlist):#not in derivlist and not in coauthlist already
                coauthlist.append(auth) #handle alternate spellings elsewhere
                print(f"Added new author {auth} to coauthor list.")
            else: #Author doesn't need to be added again
                print(f"{auth} was not added.")
    return coauthlist, derivlist

In [128]:
#get complete list of top firstfew coauthor names
def findnames(name,namelist,authornames,firstfew=3):
    '''
    using name and list of coauthors and list of lead authors' names, gets first three names from namelist and checks if 
    already in authors' name list or its new list and returns the new list
    '''
    topthreelist=[]
    amt=0
    if name in namelist:
        namelist.remove(name)#to be sure it's not adding a name that's the original author
    while amt<firstfew:  #only until get firstfew
        for coname in namelist:
            conameintop3 = False #assume not in topthreelist
            conameinauthornames=False#assume not in authornames
            eachlist = listnames(coname)
            for each in eachlist:#with eachlist, check none of its items are in authornames nor in topthreelist
                if each in topthreelist:#if coname is in topthreelist, it already exists
                    conameintop3=True
                    break
                if each in authornames: #if it's in authornames, this coname has already been accounted for
                    conameinauthornames=True
                    break
            #end for loop to check coauthor name against authornames and topthreelist
            #if in one of the lists, add eachlist variations to that list?
            if not(conameinauthornames) and not(conameintop3):#if neither coname, nor its variations 
            #were in authornames or topthreelist
                nmatches=0
                for top3name in topthreelist:#then matchname coname against names in topthreelist
                    matches,similist = matchname(original=top3name,comparison=coname)
                    #matchname returns boolean of whether they match, and list of variations of top3name
                    if matches:
                        nmatches+=1
                    if nmatches>0:
                        break
                #ends for loop comparing top3name to coname
                if nmatches==0:
                    topthreelist.append(coname)
                    print("Added new top name",coname)
                    amt+=1
            if amt==firstfew or (coname==namelist[-1]):#break loop since have desired quantity of coauthor names
                print(f"got {amt} names for {name}") #(or already gone through all the names)
                amt=firstfew #set amt to firstfew so will exit while loop as well
                break
        #end for loop
    #end while loop
    return topthreelist

In [26]:
def saveauthinfo(author_object, file_id):
    '''
    saves information from an author object in a dataframe and then csv file
    returns:
            (bool) : whether the author object was filled (and so saved with publications) or not (saved only attributes)
    '''
    authordf = convertToDf(author_object.listdata()) #dataframe of author attributes, except for publications
    #first column is name of attribute (excepting publications), second column is corresponding value
    #attributes are:
    #          author.affiliation, author.citedby, author.citedby5y, author.cites_per_year, author.coauthors, 
    #          author.email,author.hindex,author.hindex5y, author.i10index, author.i10index5y, author.id, author.interests,
    #          author.name, author.nav, author.url_picture
    if author_object.filled: #if Author object is filled, has publications
        try:
            listpublications = author_object.publications
        except: #if getting publications gives an error:
            authordf.to_csv("sfiles/"+file_id+"_co_author.csv")
            print(f"saved {file_id} author information, no publications")
            return False
        
        endindex = 100+len(listpublications) #to differentiate, index starts at 100
        pubsdf = pd.DataFrame(data=listpublications,index=range(100,endindex)) #create dataframe of publication information
        #each entry is one publication's details
        infodf = pd.concat([authordf, pubsdf]) #combine into one DataFrame publications and author information
        infodf.to_csv("sfiles/"+file_id+"_pubs_co_author.csv") #save dataframe as .CSV file
        print(f"saved {file_id} filled author information, with publication basics")
        return True
    else: #if not filled, there is no publication information, but still want to save it
        authordf.to_csv("sfiles/"+file_id+"_co_author.csv")
        print(f"saved {file_id} author information, no publications")
        return False

In [25]:
def convertToDf(datadict):
    '''
    takes a dictionary, puts each key-value pair into a list and then into a dataframe
    '''
    dataitems = [item for item in datadict.items()] #the items are converted to a list
    dataAsdf = pd.DataFrame(data=dataitems) #the first column is the keys, second column is the values
    return dataAsdf 

In [24]:
def savepubinfo(publicationlist,idcode,amt=3):
    '''
    takes list of publication objects and tries to save each one as a .CSV
    idcode (str) : an identifier for the file names of the publications' .CSV files; for example, the fourth filled 
                    publication will be saved as: "sfiles/someid3_pub.csv" when idcode = "someid" or idcode could be a 
                    path to be within sfiles: "sfiles/myid/code0_pub.csv" would be the location of the first filled 
                    publication, for instance, where idcode = "myid/code"
    num (int) : how many publications to get, defaults to 3
    '''
    #saves each filled publication as its own CSV
    pubslen = len(publicationlist)
    idx=0
    ctlist=[]
    if pubslen<amt: #if there are fewer publications than requested
        amt = pubslen #reset num to be the number of publications that exist
    while amt>0 and idx<pubslen: #get desired amt by subtracting as long as index is not >= len(publicationlist)
        pub_object = publicationlist[idx]#gets publication object
        citestitle = [pub_object.bib["cites"],pub_object.bib["title"]]
        if (citestitle in ctlist):#if same title and number of citations pair in list already
            idx+=1 #skip rest, restart loop with next index value because likely same publication
            #alternately: fill this likely duplicate publication and increase amt by 1 as long as it's still less 
            #than pubslen in order to compensate for going through with it
        else:
            filename = "sfiles/"+idcode+str(idx)+"_pub.csv"
            pubtries=0 #reset tries
            print("filling publication",pub_object.id_citations)
            while pub_object.filled == False and pubtries<4: #four tries to fill publication object
                pubtries+=1
                setschproxy() #sets proxy
                try:
                    pub_object.fill() #tries to fill Publication object
                except:
                    continue
                if pubtries%3==0: #after a few times, takes a break
                    time.sleep(100)
            #end while loop
            print("filled:",pub_object.filled)
            if pub_object.filled:#if it was filled, then save
                amt-=1 #now needs one fewer filled publication object
                pubdf = convertToDf(pub_object.listdata())
                #save as .CSV
                pubdf.to_csv(filename)
                print("Saved publication",pub_object.bib['title'],"by",pub_object.bib['author'])
                ctlist.append(citestitle)#adds cites and name of publication to list to compare
            idx+=1 #sets next index value
    #done with method, either found sufficient amt of publications, or checked or tried entire list and did not

## Concatenating all publications, authors data

In [148]:
#current directory for previous version of the repository
os.listdir(os.curdir)

['.git',
 '.ipynb_checkpoints',
 'draft.ipynb',
 'files',
 'LICENSE.md',
 'scholar.log',
 'sfiles']

In [45]:
spath = os.path.join(os.curdir,"sfiles/")
sfileslist = os.listdir(spath)

In [46]:
sfileslist[0:3]

['.ipynb_checkpoints', '0aler0_pub.csv', '0aler1_pub.csv']

In [47]:
#remove first entry; not publication or author .CSV
sfileslist.pop(0)

'.ipynb_checkpoints'

In [137]:
spath = os.path.join(os.curdir,"sfiles/")
scofileslist = os.listdir(spath)

In [54]:
def pubdataframe(pubfile, path="sfiles/"):
    '''
    Given the name of a file and path to the file (assuming a subfolder located in the current directory), sets the column
    labeled "0" as the index and transposes the dataframe so then that index becomes the column names.
    
    Args
        pubfile (str) : the filename of the desired .CSV file
        path (str) : defaults to "sfiles/", a folder that stores pubfile
        
    Returns
        df (pandas DataFrame) : the dataframe, transposed, with pubfile's first column now as the 
                                column names, haing replaced the index
    '''
    filepath = os.path.join(os.curdir,path) #joins current directory to provided path
    filepath = os.path.join(filepath, filename) #joins to where file is stored
    datafile = pd.read_csv(filepath,index_col=[0])#read in .CSV with first column set to be index
    df = datafile.set_index('0').T #set first column (of feature names) to be column names
    return df

In [55]:
#put every publication as one record into a dataframe
allpubsdf = pd.DataFrame()#empty dataframe to initialize
for filename in sfileslist:
    #all have format: sfiles/1name2_pub.csv (will be slightly different for coauthors)
    fileids = filename.split('_')
    fileid = fileids[0] # format #last# (one has first number as 10, others are single digit)
    #check if fileids has three parts and if the last is pub.csv
    if len(fileids)==2 and fileids[-1]=="pub.csv":
        pubdf = pubdataframe(filename,"sfiles/") #gets file as dataframe with attributes as columns
        pubdf["fileID"]=fileid #adds column with file name's identifier
        allpubsdf = pd.concat([allpubsdf,pubdf],axis="index") #all column names are the same; concatenate to full dataframe

allpubsdf.head()

In [None]:
allpubsdf.to_csv("csvdata/publications.csv") #save in csvdata folder

In [None]:
allpubsdf = pd.read_csv("csvdata/publications.csv",index_col=[0])

In [140]:
allpubsdf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 70 entries, 1 to 1
Data columns (total 24 columns):
bib_eprint         26 non-null object
bib_cites          70 non-null object
citations_link     70 non-null object
url_scholarbib     0 non-null object
url_add_sclib      0 non-null object
bib_abstract       44 non-null object
bib_author_list    0 non-null object
bib_venue          0 non-null object
bib_year           70 non-null object
bib_gsrank         0 non-null object
bib_title          70 non-null object
bib_url            44 non-null object
bib_author         70 non-null object
bib_listauthors    70 non-null object
bib_journal        56 non-null object
bib_volume         60 non-null object
bib_number         60 non-null object
bib_publisher      64 non-null object
bib_pages          59 non-null object
source             70 non-null object
id_citations       70 non-null object
cites_per_year     70 non-null object
fileID             70 non-null object
author_fileID      70 non-null obj

In [57]:
allpubsdf["author_fileID"] = [fileid[:-1] for fileid in allpubsdf["fileID"]]
allpubsdf.author_fileID.value_counts()

4gham     5
8enza     5
2land     5
0aler     5
4rank     5
3nold     5
3oung     5
1urou     5
10uflo    5
0hkin     5
5nter     5
9rjee     5
0rson     5
6ison     5
Name: author_fileID, dtype: int64

In [138]:
#put every publication as one record into a dataframe
allcopubsdf = pd.DataFrame()#empty dataframe to initialize
for filename in scofileslist:
    #all have format: sfiles/1name2_pub.csv or sfiles/1name2_co_pub.csv
    fileids = filename.split('_')
    fileid = fileids[0] # format #last# (one has first number as 10, others are single digit)
    #check if fileids has three parts, if second-to-last is co and if the last is pub.csv
    if len(fileids)==3 and fileids[-1]=="pub.csv" and fileids[-2]=="co": #this indicates a co-author's filled publication
        copubdf = pubdataframe(filename,"sfiles/") #gets file as dataframe with attributes as columns
        copubdf["fileID"]=fileid #adds column with file name's identifier
        allcopubsdf = pd.concat([allcopubsdf,copubdf],axis="index") #all column names are the same; concatenate

allcopubsdf.head()

0,bib_eprint,bib_cites,citations_link,url_scholarbib,url_add_sclib,bib_abstract,bib_author_list,bib_venue,bib_year,bib_gsrank,...,bib_listauthors,bib_journal,bib_volume,bib_number,bib_publisher,bib_pages,source,id_citations,cites_per_year,fileID
1,,5176,/scholar?cites=7360008283070022454,,,An intense electromagnetic pulse can create a ...,,,1979,,...,"['Toshiki Tajima', 'John M Dawson']",Physical Review Letters,43,4,American Physical Society,267,citations,qwN3bpIAAAAJ:u5HHmVD_uO8C,"{1982: 15, 1983: 15, 1984: 14, 1985: 50, 1986:...",0jima0
1,http://eli.jinr.ru/pdf/RevModPhys_Bulanov.pdf,1854,/scholar?cites=13321283530247006292,,,The advent of ultraintense laser pulses genera...,,,2006,,...,"['Gerard A Mourou', 'Toshiki Tajima', 'Sergei ...",Reviews of modern physics,78,2,American Physical Society,309,citations,qwN3bpIAAAAJ:u-x6o8ySG0sC,"{2006: 24, 2007: 95, 2008: 135, 2009: 149, 201...",0jima1
1,,1095,/scholar?cites=9015642514743274857,,,,,,2004,,...,"['T Esirkepov', 'M Borghesi', 'SV Bulanov', 'G...",Physical review letters,92,17,American Physical Society,175003,citations,qwN3bpIAAAAJ:d1gkVwhDpl0C,"{2004: 5, 2005: 25, 2006: 36, 2007: 33, 2008: ...",0jima2
1,https://cds.cern.ch/record/277813/files/SCAN-9...,563,/scholar?cites=5913791804350952448,,,A laser pulse with a power of∼ 3 TW and a dura...,,,1995,,...,"['K Nakajima', 'D Fisher', 'T Kawakubo', 'H Na...",Physical Review Letters,74,22,American Physical Society,4428,citations,qwN3bpIAAAAJ:9yKSN-GCB0IC,"{1995: 7, 1996: 31, 1997: 48, 1998: 34, 1999: ...",0jima3
1,,533,/scholar?cites=1520635000302782633,,,,,,2009,,...,"['Andreas Henig', 'S Steinke', 'M Schnürer', '...",Physical Review Letters,103,24,American Physical Society,245003,citations,qwN3bpIAAAAJ:zdX0sdgBH_kC,"{2009: 2, 2010: 42, 2011: 65, 2012: 60, 2013: ...",0jima4


In [139]:
allcopubsdf.to_csv("csvdata/copublications.csv")#save complete dataframe of all co-author publications' info as .CSV

In [326]:
def combinepubsdf(pubseries):
    '''
    given a pandas two-column DataFrame listing publications' information, turns each entry into an item for each of the 
    columns cites, title, year, filled, id_citations, source, which are labeled in the first column
    and returns this new horizontal, one-record format
    
    Args
        pubseries (pandas DataFrame)
    
    Returns
        combinedf (pandas DataFrame)
    '''
    combinedf = pd.DataFrame()
    for pubdetails in pubseries:
        pubdetails = pubdetails.strip('{}')#remove first and last { }
        pubinfo = pubdetails.split("\r\n")
        pubinfo = [item.strip() for item in pubinfo]#remove blank space from each
        pairs=[]
        pairidx=0
        for each in pubinfo:
            eachlist = each.split("\': ")
            if pairidx==0:#first item in list is bib: cites: 1234
                firstlist=[]
                for index, el in enumerate(eachlist):
                    noquote = el.strip("{},\'")
                    if (noquote[-5:]=="title"):
                        #print("noquote ends with title, right?",noquote)
                        nqlist = noquote.split(",")
                        for nql in nqlist[:-1]:#all but last
                            firstlist.append(nql)
                        firstlist.append("title")
                    elif(noquote[-4:]=="year"):
                        #print("noquote ends with year, right?",noquote)
                        nqlist = noquote.split(",")
                        for nql in nqlist[:-1]:#except last item
                            firstlist.append(nql)
                        firstlist.append("year")
                    else:
                        firstlist.append(noquote)
                pairs.append(firstlist)
                pairidx+=1
                #so enumerate and separate out title, cites, etc. without actually touching any text in the title
            elif len(eachlist)!=2:#should be split into two parts
        #if not two parts, might be part of title
                for el in eachlist:#add to previous item
                    pairs[pairidx-1].append(el)
            else:
                pairs.append(eachlist)
                pairidx+=1
        #print(pairs)
        #end for loop over each list of pairs; similar to ["attribute", "info1", "info2"]
        
        if len(pairs[0])>3:#if more than three items, likely has format of bib cites 123 title Titleinfo year 2001
            pair1 = pairs[0][:3]#split into parts, first part bib cite 1243
            pair2 = pairs[0][3:]
            pair2len = len(pair2)
            if pair2len==2:
                pairs.append(pair2)
                pairs[0]=pair1
            elif pair2len>2:
                lastk=pair2len-1
                for k, kitem in enumerate(pair2):
                    item = kitem.strip("{},\'").strip() #remove quotes, blank space
                    #print(kitem, item)
                    if item=="title":
                        titledex=k
                    elif item=="year":
                        yeardex=k
                        lastk=k+1
                if lastk!=pair2len-1:#if the item after year is not the last item
                    print("year was not the last attribute",pair2[lastk+1:])
                pairs.append(pair2[yeardex:yeardex+2])#year and one place after it
                titletext=""
                for titlepart in pair2[titledex+1:yeardex]:
                    titlepart.strip("{}\'")
                    titletext+=titlepart
                pairs.append(["title",titletext])
                pairs[0]=pair1#replace first item in pairs with pair1
        pairlist=[]
        for eachpair in pairs:#loop through list
            templist=[] #the following can be re-written to make more sense
            limepot=False
            for idx, eachitem in enumerate(eachpair): #eachpair typically has two items
                stripped = eachitem.strip("{}\',")#remove certain punctuation from the strings
                stripped = stripped.strip('\"') #remove double quotes
                if stripped=="bib":#leave out bib (it makes a trio in the first eachpair, typically: bib cites 1234)
                    continue
                elif stripped=='Limepots':#special exception for this being split at \r\n as well
                    text = ""#all of this should be added to the title
                    for part in eachpair:#loop through to concatenate this together
                        textpart = part.strip("\'\",")#strip double quote also
                        text= text+" "+textpart #concatenated into one string
                    pairlist[-1][-1]+=text#add to latest addition to pairlist
                    limepot=True
                    break #go on to next eachpair from pairs
                elif len(eachpair)>2 and idx<1: #if there are more than two items and this is the first item but not "bib"
                    templist.append(stripped)#add the first item
                    text = ""#to make the second item consist of everything after the first item
                    for part in eachpair[1:]:#loop through to add all but first item
                        textpart = part.strip("{}\',") #strip out select characters
                        text+=textpart #concatenated into one string
                    templist.append(text)
                    break #go on to next eachpair from pairs
                else:
                    templist.append(stripped) #add stripped version to list within list
            #end for loop creating templist
            if not(limepot):
                pairlist.append(templist)
        #end for loop stripping out further puncutation and creating two-item lists in one list (pairlist)
        info = [duo[1] for duo in pairlist] #put data to become one column
        cols = [duo[0] for duo in pairlist] #put attributes to become the index
        thispubdf = pd.DataFrame(data=info,index=cols)#easier to set future column names to index
        thispubdf = thispubdf.T #and then transpose
        try:
            combinedf= pd.concat([combinedf,thispubdf],axis="index")#and then add every other pub to this; concatenate it all together
        except:
            print("that did not do it")
            print(combinedf)
    
    return combinedf

In [138]:
spath = os.path.join(os.curdir,"sfiles/")
sfileslist = os.listdir(spath)
sfileslist.pop(0) #remove ipynb_checkpoints (it's the first file)

'.ipynb_checkpoints'

In [322]:
sfileslist[-45:-38]

['3nold_pubs_author.csv',
 '3oung0_pub.csv',
 '3oung1_pub.csv',
 '3oung2_pub.csv',
 '3oung3_pub.csv',
 '3oung4_pub.csv',
 '3oung_pubs_author.csv']

In [331]:
unfilledpubs = pd.DataFrame()#put all publications into one dataframe
allauthsdf = pd.DataFrame()#put all authors' info into one dataframe
for anyfile in sfileslist:
    fileids = anyfile.split('_')
    #check if fileids has two or three parts and what the last part is
    if len(fileids)>=2 and fileids[-1]=="author.csv": #author info ends with author.csv
        fileid = fileids[0] # format #last# (one has first number as 10, others are single digit)
        apdf = pd.read_csv("sfiles/"+anyfile+"",index_col=[0])#read in file with first column as index
        lastauidx = apdf.loc[apdf["0"]=="filled"].index[0] #last piece of information is if the Author object was filled
        audf = apdf.loc[0:lastauidx]#using that index for whether Author was filled
        authrowdf = audf.set_index("0").T #transpose to make single-row frame with first column as column names
        authrowdf["fileID"]=fileid #adds column with file name's identifier
        if fileids[-2]=="co":#if co is the second-to-last part of fileids
            authrowdf["coauthor"]=1#adds column for coauthor categorical
        else:#if not co, create coauthor column with 0
            authrowdf["coauthor"]=0
        allauthsdf = pd.concat([allauthsdf,authrowdf])#concatenate authrowdf to dataframe of authors
        
        lastidx = apdf.last_valid_index()
        if lastidx>100: #if not, then has no publications listed
            pseries = apdf.loc[100:lastidx,"0"]#includes value at lastidx    
            compubs = combinepubsdf(pseries)#put all publications into one dataframe
            compubs["authorname"]=authrowdf.name[0]#add column where every entry is the author's name from authrowdf
            compubs["fileID"]=fileid #adds column with file name's identifier
            if fileids[-2]=="co":#if co is the second-to-last 
                compubs["coauthor"]=1#adds column for coauthor categorical
            else:#not a co-author
                compubs["coauthor"]=0
            unfilledpubs = pd.concat([unfilledpubs,compubs],ignore_index=True)
            #concatenate compubs to other files' publications

['cites', '16442']
['title', 'Nudge: Improving decisions about health, wealth, and happiness']
['year', '2008']
['filled', 'False']
['id_citations', 'Tvzd5GgAAAAJ:1sJd4Hv_s6UC']
['source', 'citations']
['cites', '16419']
['title', 'Nudge: Improving Decisions about Health, Wealth and Happiness']
['year', '2008']
['filled', 'False']
['id_citations', 'Tvzd5GgAAAAJ:HGTzPopzzJcC']
['source', 'citations']
['cites', '16335']
['title', 'Richard H. Thaler, Cass R. Sunstein, Nudge: Improving decisions about health, wealth, and happiness']
['year', '2008']
['filled', 'False']
['id_citations', 'Tvzd5GgAAAAJ:1taIhTC69MYC']
['source', 'citations']
['cites', '9758']
['title', 'Does the stock market overreact?']
['year', '1985']
['filled', 'False']
['id_citations', 'Tvzd5GgAAAAJ:u5HHmVD_uO8C']
['source', 'citations']
['cites', '7147']
['title', 'Toward a positive theory of consumer choice']
['year', '1980']
['filled', 'False']
['id_citations', 'Tvzd5GgAAAAJ:d1gkVwhDpl0C']
['source', 'citations']
['cit

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




['cites', '939']
['title', 'Optical levitation by radiation pressure']
['year', '1971']
['filled', 'False']
['id_citations', '3QEKOmMAAAAJ:W7OEmFMy1HYC']
['source', 'citations']
['cites', '842']
['title', 'History of optical trapping and manipulation of small-neutral particle, atoms, and molecules']
['year', '2000']
['filled', 'False']
['id_citations', '3QEKOmMAAAAJ:eQOLeE2rZwMC']
['source', 'citations']
['cites', '824']
['title', 'Motion of atoms in a radiation trap']
['year', '1980']
['filled', 'False']
['id_citations', '3QEKOmMAAAAJ:Tyk-4Ss8FVUC']
['source', 'citations']
['cites', '795']
['title', 'Trapping of atoms by resonance radiation pressure']
['year', '1978']
['filled', 'False']
['id_citations', '3QEKOmMAAAAJ:YsMSGLbcyi4C']
['source', 'citations']
['cites', '788']
['title', 'Applications of laser radiation pressure']
['year', '1980']
['filled', 'False']
['id_citations', '3QEKOmMAAAAJ:Y0pCki6q_DkC']
['source', 'citations']
['cites', '629']
['title', 'Force generation of organe

#### Looking at `allauthsdf`
Details of Author objects' information, including winners and coauthors.

In [336]:
allauthsdf.info()

<class 'pandas.core.frame.DataFrame'>
Index: 31 entries, 1 to 1
Data columns (total 17 columns):
affiliation       31 non-null object
id                31 non-null object
name              31 non-null object
citedby           31 non-null object
citedby5y         29 non-null object
cites_per_year    29 non-null object
coauthors         29 non-null object
email             29 non-null object
hindex            29 non-null object
hindex5y          29 non-null object
i10index          29 non-null object
i10index5y        29 non-null object
interests         31 non-null object
url_picture       31 non-null object
filled            31 non-null object
fileID            31 non-null object
coauthor          31 non-null int64
dtypes: int64(1), object(16)
memory usage: 4.4+ KB


In [347]:
allauthsdf.coauthor.value_counts(normalize=True)

1    0.516129
0    0.483871
Name: coauthor, dtype: float64

In [348]:
allauthsdf.head()

0,affiliation,id,name,citedby,citedby5y,cites_per_year,coauthors,email,hindex,hindex5y,i10index,i10index5y,interests,url_picture,filled,fileID,coauthor
1,"University of Chicago, Booth School of Business",Tvzd5GgAAAAJ,Richard Thaler,151057,61520.0,"{1991: 412, 1992: 487, 1993: 505, 1994: 679, 1...",[],@chicagobooth.edu,98.0,80.0,183.0,145.0,['Behavioral Economics'],https://scholar.google.com/citations?view_op=m...,True,0aler,0
1,Bell Labs retired,3QEKOmMAAAAJ,Arthur Ashkin,41204,9838.0,"{1981: 138, 1982: 143, 1983: 123, 1984: 196, 1...",[],,62.0,35.0,117.0,53.0,"['Laser trapping', 'solar power']",https://scholar.google.com/citations?view_op=m...,True,0hkin,0
1,"Norman Rostoker Chair Professor, UC Irvine",qwN3bpIAAAAJ,Toshiki Tajima,27945,7924.0,"{1982: 80, 1983: 105, 1984: 99, 1985: 191, 198...",[],@uci.edu,75.0,37.0,317.0,132.0,"['plasma physics', 'accelerators', 'lasers']",https://scholar.google.com/citations?view_op=m...,True,0jima,1
1,University of Michigan,pGGjfS8AAAAJ,Daniel J. Klionsky,90326,43675.0,"{1997: 301, 1998: 326, 1999: 328, 2000: 533, 2...",[{'affiliation': 'Department of Biomedical Sci...,@umich.edu,129.0,85.0,353.0,297.0,['Autophagy'],https://scholar.google.com/citations?view_op=m...,True,0nsky,1
1,"Professor of Economics, Booth School of Busine...",zGJKZpkAAAAJ,Marianne Bertrand,44683,,,,@chicagobooth.edu,,,,,['Marianne Bertrand'],https://scholar.google.com/citations?view_op=m...,False,0rand,1


#### Comparison to `unfilledpubs`

In [334]:
#where unfilledpubs has a missing year
unfilledpubs[unfilledpubs["year"].isna()==True].head()

Unnamed: 0,authorname,cites,coauthor,fileID,filled,id_citations,source,title,year
258,Rachel Glennerster,96,1,1ster,False,Vq3KWOsAAAAJ:5Ul4iDaHHb8C,citations,"& Contestabile, M.(2015). Promoting an open re...",


In [345]:
unfilledpubs.authorname.value_counts()

Peter Zavalij           20
Bing-Hua Jiang          20
Sendhil Mullainathan    20
Toshiki Tajima          20
Esther Duflo            20
Abhijit Banerjee        20
Donna Strickland        20
Rachel Glennerster      20
Stanley Whittingham     20
Frances Arnold          20
James Allison           20
Sergei V. Bulanov       20
Cass Sunstein           20
Gérard Mourou           20
Patricia Bado           20
Joachim Frank           20
Richard Thaler          20
Daniel J. Klionsky      20
Michael W. Young        20
andrew f newman         20
Cynthia Kinnan          20
Steven Chu              20
Gregg L. Semenza        20
Jun Zhu                 20
Guo-Liang Wang          20
Arthur Ashkin           20
richard henderson       20
Greg Winter             20
Shoufeng Yang (杨守峰)     20
Name: authorname, dtype: int64

In [332]:
unfilledpubs.tail()

Unnamed: 0,authorname,cites,coauthor,fileID,filled,id_citations,source,title,year
575,Abhijit Banerjee,714,0,9rjee,False,HLpqZooAAAAJ:4fKUyHm3Qg0C,citations,Reputation effects and the limits of contracti...,2000
576,Abhijit Banerjee,696,0,9rjee,False,HLpqZooAAAAJ:tOudhMTPpwUC,citations,On the road: Access to transportation infrastr...,2012
577,Abhijit Banerjee,677,0,9rjee,False,HLpqZooAAAAJ:L8Ckcad2t8MC,citations,Six randomized evaluations of microcredit: Int...,2015
578,Abhijit Banerjee,675,0,9rjee,False,HLpqZooAAAAJ:fPk4N6BV_jEC,citations,Empowerment and efficiency: Tenancy reform in ...,2002
579,Abhijit Banerjee,611,0,9rjee,False,HLpqZooAAAAJ:_OXeSy2IsFwC,citations,The experimental approach to development econo...,2009


In [333]:
unfilledpubs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 580 entries, 0 to 579
Data columns (total 9 columns):
authorname      580 non-null object
cites           580 non-null object
coauthor        580 non-null int64
fileID          580 non-null object
filled          580 non-null object
id_citations    580 non-null object
source          580 non-null object
title           580 non-null object
year            579 non-null object
dtypes: int64(1), object(8)
memory usage: 40.9+ KB


### Edge case with publication title
There was an extra column created because of how the code interpreted the separation between two parts of this publication's title. The following code identifies, recreates and solves this problem.

In [280]:
unfilledpubs[unfilledpubs['Limepots'].isna()==False]

Unnamed: 0,Limepots,authorname,cites,coauthor,fileID,filled,id_citations,source,title,year
456,An Analysis of the Autobiographical Narrative ...,Michael W. Young,26,0,3oung,False,oYfxmfYAAAAJ:qjMakFHDy7sC,citations,'Our Name is Women; We are Bought with Limesti...,1983


In [308]:
#there was particular difficulty with this title splitting at a tricky place, so this tested the code
eachpair=['"Limepots', 'An Analysis of the Autobiographical Narrative of "', "'a Kalauna Woman',"]
pairlist=[['cites', '26'], ['title', "'Our Name is Women; We are Bought with Limesticks and "]]
text = ""#all of this should be added to the title
for part in eachpair:#loop through to concatenate this together
    #textpart = part.strip("{}\',") #strip out select characters
    textpart = part.strip("\'\",")#strip double quote also
    print("part:", part, "text:",textpart)
    text= text+" "+textpart #concatenated into one string
print(pairlist)
pairlist[-1][-1]+=text#add to latest addition to templist
print(pairlist,eachpair,type(eachpair))

part: "Limepots text: Limepots
part: An Analysis of the Autobiographical Narrative of " text: An Analysis of the Autobiographical Narrative of 
part: 'a Kalauna Woman', text: a Kalauna Woman
[['cites', '26'], ['title', "'Our Name is Women; We are Bought with Limesticks and "]]
[['cites', '26'], ['title', "'Our Name is Women; We are Bought with Limesticks and LimepotsAn Analysis of the Autobiographical Narrative of a Kalauna Woman"]] ['"Limepots', 'An Analysis of the Autobiographical Narrative of "', "'a Kalauna Woman',"] <class 'list'>


### Saving `allauthsdf` and `unfilledpubs` in csvdata folder

In [337]:
#save as .CSV files
allauthsdf.to_csv("csvdata/authorinfo.csv")
unfilledpubs.to_csv("csvdata/unfilledpubs.csv")