# Data Acquisition

**Overview**  
In this notebook, I use the scholarly package to get publication titles and years from Google Scholar, for animal science professors at 5 universities in the US.

**The schools**  
In addition to UC Davis, I chose four universities geographically spread across the US. I chose Cornell University in New York, Texas A&M University, Ohio State University, and Florida State University. As abbreviations and variable names throughout this project, UC Davis will be "Davis", Cornell Univerity will be "Cornell", Texas A&M will be "TAMU", Ohio State will be "Ohio", and Florida State will be "Florida".

**The professors**  
I took all current (not Emiriti) professors from the animal science departments at these universities. UC Davis had 44, Cornell had 16, Texas A&M had 52, Ohio State had 33, and Florida State had 38. In total, there were 183. 

**The scholarly package**  
I used the scholarly.search_author() function to search by author. Given the name of any author, it checks for the person's author profile and if it exists, the function returns an object containing information about all the publications linked to that profile. Pros and cons of this package are discussed at the end of this notebook.

In [60]:
import scholarly
import pandas as pd

def get_pubs(name, school):
    """Extract from Google Scholar all publications for a given person"""
    search_query = scholarly.search_author(name + " " + school)
    try:
        author = next(search_query).fill()
        listt = [author.publications[0].bib['title'],author.publications[1].bib['title']]
        try:
            listy = [author.publications[0].bib['year'],author.publications[1].bib['year']]
        except:
            listy = [0,0]
        for i in range(len(author.publications)):
            listt.append(author.publications[i].bib['title'])
            try: 
                listy.append(author.publications[i].bib['year'])
            except:
                listy.append(0)
        df = pd.DataFrame({"title": listt[2:], "year":listy[2:]})
        df.insert(0, "school", school)
        df.insert(1, "name", name.split[-1])
    except:
        print("No profile for", name)
        df = pd.DataFrame()
    return df;   

In [1]:
davis_profs = ["Trish Berger", "Richard A. Blatchford", "David Bunn", "Hao Cheng",
               "Fred S. Conte", "Anna C. Denicol", "Mary E. Delany", "Edward J. DePeters",
               "John Eadie", "James Fadel", "Jackson Gross", "Matthias Hess", 
               "Kristina Horback", "Russ Hovey", "Josh Hull", "Silas Hung", 
               "Ermias Kebreab", "Annie J. King", "Kirk Klasing", "Dietmar Kueltz",  
               "Yanhong Liu", "Elizabeth A. Maga", "Maja M. Makagon", "Bernie May", 
               "Juan F. Medrano", "Deanne Meyer", "Michael J. Mienaltowski", 
               "Michael R. Miller", "Frank Mitloehner", "James D. Murray", 
               "Anita M. Oberbauer",  "James W. Oltjen", "Lee Allen Pettey", 
               "Peter H. Robinson", "Pablo J. Ross", "Roberto D. Sainz", "Andrea Schreier", 
               "Anne E. Todgham", "Cassandra B. Tucker", "Alison Van Eenennaam", 
               "Jason V. Watters", "Crystal Yang", "Huaijun Zhou", "Richard A. Zinn"]

cornell_profs = ["Yves Boisclair", "Dan Brown", "Walter Butler", "Debbie Cherney", 
                "Jerrie Gavalchin", "Julio Giordano", "Heather Huson", 
                "Patricia Johnson", "Quirine Ketterings", "Xin Gen Lei", 
                "Joseph McFadden", "Thomas Overton", "Susan Quirk", "Vimal Selvaraj",
                "Michael Thonney", "Michael Van Amburgh"]

tamu_profs = ["Ashley Arnold", "Jason Banta", "Fuller Bazer", "Rodolfo Cardoso",
             "Bruce Carpenter", "Gordon Carstens", "Alejandro Castillo", "Jason Cleere",
             "Reinaldo Cooke", "H Russell Cross", "Courtney Daigle", "Kathrin Dunlap", 
             "Davis Forrest", "Kerri Gehring", "Clare Gill", "Jason Gill", "Ron Gill", 
             "Davey Griffin", "Thomas Hairgrove", "Dan Hale", "Steve Hammack", "Andy Herring",
             "Nancy Ing", "Jenny Jennings", "Ellen Jordan", "Chris Kerth", "G Cliff Lamb", 
             "Jessica Leatherwood", "Charles Long", "Ted McCollum", "Rhonda Miller",
             "Wes Osburn", "Joe Paschal", "Shawn Ramsey", "Ron Randel", "Reid Redden", 
             "Penny Riggs", "David Riley", "Jim Sanders", "Carey Satterfield", "Jeff Savell",
             "Chris Skaggs", "Stephen B. Smith", "Matthew Taylor", "Luis Tedeschi", 
             "Daniel Waldron", "Thomas H Welsh", "Sarah White", "Travis Whitney", 
             "Tryon Wickersham", "Gary Williams", "Guoyao Wu"]

ohio_profs = ["Lisa Bielke", "Stephen Boyles", "Daniel Clark", "Kimberly Cole", 
             "Michael Cressman", "Michael Davis", "Maurice Eastridge", "Thaddeus Ezeji",
             "Jeffrey Firkins", "John Foltz", "Lyda Garcia", "Alvaro Garcia Guerra",
             "Kelly George", "Sheila Jacobi", "Justin Kieffer", "Chanhee Lee", "Kichoon Lee",
             "Michael Lilburn", "Pasha Lyvers Peffer", "Steven Moeller", "Luis Moraes", 
             "Herbert Ockerman", "Joseph Ottobre", "Monique Pairis-Garcia", "Elizabeth Parker", 
             "Tony Parker", "William Pope", "Alejandro Relling", "Ramesh Selvaraj",
             "Sandra Velleman", "William Weiss", "Macdonald Wick", "Zhongtang Yu"]

florida_profs = ["Adegbola Adesogan", "John Arthington", "Mario Binelli", "Jeremy Block", 
                "John Bromfield", "Samantha A. Brooks", "Ilaria Capua", "Chad Carr", 
                "Geoff Dahl", "Albert De Vries", "Nicholas DiLorenzo", "John P. Driver", 
                "Mauricio Elzo", "Antonio Faciola", "Luiz Ferraretto", "Timothy J. Hackmann",
                "Peter Hansen", "H. Arie Havelaar", "Matthew Hersom", "Kwang Cheol Jeong",
                "Jimena Laporta", "Raluca Mateescu", "Joel McQuagge", "Emily K. Miller-Cushon",
                "Philipe Moriel", "Corwin D. Nelson", "Pascal Oltenacu", 
                "Francisco Penagaricano", "Jose Eduardo Santos", "Jason M. Scheffler",
                "Tracy L. Scheffler", "Charlie Staples", "Saundra TenBroeck", "Todd Thrift", 
                "Lori Warren", "Carissa Wickens", "Sally Williams", "Stephanie Wohlgemuth"]

In [6]:
print("There are", len(davis_profs), "professors from Davis")
print("There are", len(cornell_profs), "professors from Cornell")
print("There are", len(tamu_profs), "professors from Texas A&M")
print("There are", len(ohio_profs), "professors from Ohio State")
print("There are", len(florida_profs), "professors from Florida State")

print("The total is", len(davis_profs) + len(cornell_profs) + len(tamu_profs) + len(ohio_profs) + len(florida_profs))


There are 44 professors from Davis
There are 16 professors from Cornell
There are 52 professors from Texas A&M
There are 33 professors from Ohio State
There are 38 professors from Florida State
The total is 183


In [242]:
def get_school(profs_list, school):
    """Proccesses all the professors for a given school"""
    global schooldf
    for x in profs_list:
        df = get_pubs(x, school)
        schooldf = schooldf.append(df)
    return;

In [250]:
# Use the first item to initialize the data frame
schooldf = get_pubs("Trish Berger", "Davis") 

In [None]:
get_school(davis_profs[1:], "Davis") # Skip the first one to avoid duplicates
get_school(cornell_profs, "Cornell")
get_school(tamu_profs, "TAMU")
get_school(ohio_profs, "Ohio")
get_school(flordia_profs, "Florida")

After collecting all the data, I will export it to a text file so that I don't need to rerun this code every time. This is because a) it takes forever to run due to built-in sleeps, b) it's rude to make so many redundant requests and c) it's good to have a local copy of my data in case Google Scholar makes changes in ways that make my script or scholarly not work anymore.

In [None]:
# Save data in case Google scholar makes changes that break the package I'm using
schooldf.to_csv("data.txt", sep=' ', index=False, header=False)

**Pros and Cons of the scholarly package**  
The major problem with this package is that it requires that the person's Google Scholar profile is set up. A Google Scholar profile is basically a collection of all papers authored by a specific person, and the person needs to log onto Google Scholar and verify using the email address from their university. Of the 183 professors, 56 of them had profiles set up (14 at UCD, 2 at Cornell, 14 at Texas A&M, 7 at Ohio State, 22 at Florida State).

The second major problem with scholarly is that it is a web scraper and doesn't use an official API. This means that this package may stop working if there are any change in the Google Scholar site or results pages. For this reason, I exported and saved my data.

For the search term, I used the professor's full name, including middle initial if available, and included the university name. By doing this, I was able to get the correct profile most of the time (only two were wrong). I went through the results by hand, and since I have some experience in the animal science field, it was easy for me to tell when the wrong profile was identified.

When the profile is set up, the quality of data is excellent. Because of the nature of the Google Scholar profile, once the correct profile is identified, every paper associated with that profile is authored by the correct person. This is helpful because it allows me to assess wrongness at the professor level (around 50 checks), whereas a traditional Google search would require checking each paper for the author (around 9000 checks).

Additionally, every paper authored by that person is returned. This means that the number of publications returned is meaningful. In contrast, a traditional Google search may return thousands of results, and I would simply cut if off after a certain number, making the number of publications arbitrary.

There were a few issues reading in some of the publication titles, usually towards the end of the results for a professor. In these cases, I noticed it was common that a year was not available, so I set those years equal to 0. Then in the next notebook, I remove any observations with a year of 0. This was a fairly effective way to remove most mistakes.