# Gathering past faculty information

Author: Andrea Mock

Having gathered faculty information for the current academic year (2020-2021) we are also interested in the faculty information from the past. One way to extract past data is to use the help of the web archives, a site that allows you to retrieve previous versions of a website. With the help of the web archives we can scrape previous versions of the faculty roster page and in a subsequent analyze the data for different years. 

## 1. Extracting faculty data for the 2019-2020 academic year

In [1]:
import pandas as pd
from collections import Counter

# import for plotting later
from matplotlib import pyplot as plt
plt.style.use('fivethirtyeight')

In [2]:
#import chrome webdriver
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

In [3]:
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get('https://web.archive.org/web/20200420232345/https://www.wellesley.edu/provost/facultyroster')

[WDM] - Current google-chrome version is 88.0.4324
[WDM] - Get LATEST driver version for 88.0.4324


 


[WDM] - Get LATEST driver version for 88.0.4324
[WDM] - Trying to download new driver from http://chromedriver.storage.googleapis.com/88.0.4324.96/chromedriver_mac64.zip
[WDM] - Driver has been saved in cache [/Users/andreamock/.wdm/drivers/chromedriver/mac64/88.0.4324.96]


In [4]:
profs = browser.find_elements_by_xpath('//div[@class="field-item even"]')[0]

In [15]:
# extracting relevant p elements 
profInfo = profs.find_elements_by_xpath('//p')[2:452]

In [16]:
# look at the information for one professor
profInfo[-1].text

'Zitnick, Dan\nLecturer in Middle Eastern Studies\nB.A., M.A., University of Michigan'

In [137]:
# restructure all data entries
def restructureData(data):
    allInstructors = []
    for person in data:
        info = person.text.split('\n')
        if len(info) == 7: # special case when p element contains two people
            info1 = info[4:6] +[None,None] + [info[-1]]
            allInstructors.append(info1)
            info = info[:2] +[None,None] + [info[2]]
        if len(info) == 3: 
            info = info[:2] +[None,None] + [info[-1]]
        elif len(info) == 2:
            if "B.F.A." in info[1]: # special case of person just having university info and no title
                info = [info[0]] + [None, None, None] + [info[-1]]
            else:
                info = info[:2] + [None, None, None]
        elif len(info) == 4:
            if 'University' in info[2]:
                info = info[:2] + [None, None] + [''.join([info[2],info[-1]])]
            else:
                info = info[:3] + [None] + [info[-1]]
        allInstructors.append(info)
    return allInstructors

In [22]:
# create dataframe with professor information
prof_df2019 = pd.DataFrame(restructureData(profInfo))
prof_df2019.columns = ['name', 'title', 'title2', 'title3', 'education']

In [23]:
# part of our dataset
prof_df2019.head()

Unnamed: 0,name,title,title2,title3,education
0,"Aadnani, Rachid",Senior Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,"Abeberese, Ama Baafra",Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,"Adams, Kris",Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
3,"Adhikari, Prabal",Visiting Lecturer in Physics,,,"B.A., Grinnell College; Ph.D., University of M..."
4,"Agosin, Marjorie",Professor of Spanish,,,"B.A., University of Georgia; M.A., Ph.D., Indi..."


In [24]:
# example of professors with multiple titles
prof_df2019[prof_df2019['title3'].apply(lambda x: x != None)]

Unnamed: 0,name,title,title2,title3,education
299,"Núñez, Megan",Nan Walsh Schow ‘54 and Howard B. Schow Profes...,Professor of Chemistry,Dean of Faculty Affairs,"B.A., Smith College; Ph.D., California Institu..."
347,"Russell, David",Lecturer in Music,"Co-Director, Wellesley Chamber Music Society",Music Performance Faculty in Cello,"B.M., Eastman School; M.M., University of Akro..."


In [28]:
# save to csv file for later
prof_df2019.to_csv('faculty_info2019.csv')

Now we have collected all the faculty information for the academic year of 2019-2020. Next we will continue with the previous year.

## 2. Collecting Faculty data from 2018-2019 academic year
Taking a similar approach we will use the Wayback Machine to extract the faculty roster for the 2018-2019 academic year.

In [29]:
# grab 2018-2019 page
browser.get('https://web.archive.org/web/20180821181030/http://www.wellesley.edu:80/provost/facultyroster')

In [37]:
# Extract relevant info
profs = browser.find_elements_by_xpath('//div[@class="field-item even"]')[0]
profInfo = profs.find_elements_by_xpath('//p')[2:445]

In [38]:
# one professors information
profInfo[-1].text

'Daniel Zitnick\nLecturer in Middle Eastern Studies\nB.A., M.A., University of Michigan'

In [39]:
# create dataframe with professor information
prof_df2018 = pd.DataFrame(restructureData(profInfo))
prof_df2018.columns = ['name', 'title', 'title2', 'title3', 'education']

In [40]:
prof_df2018.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Senior Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Ama Baafra AbebereseA,Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
3,Marjorie Agosin A2,Professor of Spanish,,,"B.A., University of Georgia; M.A., Ph.D., Indi..."
4,Eliko Akahori,Senior Music Performance Faculty in Piano,"Director, Music Performance Program",,"B.M., Kunitachi College of Music (Japan); M.M...."


In [41]:
# save to csv file for later
prof_df2018.to_csv('faculty_info2018.csv')

## 3. Collecting faculty data from 2016-2017 academic year

In [73]:
def extractPage(url, startEl, lastEl):
    browser.get(url)
    profs = browser.find_elements_by_xpath('//div[@class="field-item even"]')[0]
    profInfo = profs.find_elements_by_xpath('//p')[startEl:lastEl]
    return profInfo

In [43]:
def createAndSaveDf(info, filename):
    """
    creates a dataframe from faculty extracted data and returns dataframe as well as saves it to a csv file
    """
    df = pd.DataFrame(restructureData(info))
    df.columns = ['name', 'title', 'title2', 'title3', 'education']
    df.to_csv(filename)
    return df

In [98]:
info = extractPage('https://web.archive.org/web/20170608030112/http://www.wellesley.edu/provost/facultyroster', 7, 478)

In [99]:
df_2016 = createAndSaveDf(info, 'faculty_info2016.csv')

In [83]:
df_2016.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Ama Baafra AbebereseA1,Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,Katherine Adams,"Instructor in Physical Education, Recreation a...",,,"B.S., Nazareth College; MS., University of Mas..."
3,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
4,Marjorie Agosin,Professor of Spanish,,,"B.A., University of Georgia; M.A., Ph.D., Indi..."


## 4. Collecting faculty data from 2015-2016 academic year

In [92]:
info2015 = extractPage('https://web.archive.org/web/20160224060536/http://www.wellesley.edu/provost/facultyroster',7,487)

In [95]:
df_2015 = createAndSaveDf(info2015, 'faculty_info2015.csv')
df_2015.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Ama Baafra Abeberese,Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
3,Marjorie AgosinA2,Luella LaMer Slaner Professor in Latin America...,Professor of Spanish,,"B.A., University of Georgia; M.A., Ph.D., Indi..."
4,Eliko Akahori,Music Performance Faculty in Piano,"Director, Music Performance Program",,"B.M., Kunitachi College of Music (Japan); M.M...."


## 5. Collecting faculty data from 2014-2015 academic year

In [109]:
info2014 = extractPage('https://web.archive.org/web/20141230183413/http://www.wellesley.edu/provost/facultyroster',1,479)

In [111]:
df_2014 = createAndSaveDf(info2014, 'faculty_info2014.csv')
df_2014.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Ama Baafra AbebereseA1,Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
3,Marjorie Agosin,Luella LaMer Slaner Professor in Latin America...,Professor of Spanish,,"B.A., University of Georgia; M.A., Ph.D., Indi..."
4,Eliko Akahori,Music Performance Faculty in Piano,Coach/Accompanist,,"B.M., Kunitachi College of Music (Japan); M.M...."


## 6. Collecting faculty data from 2013-2014 academic year

In [116]:
info2013 = extractPage('https://web.archive.org/web/20140811170954/http://www.wellesley.edu/provost/facultyroster',1,483)

In [138]:
df_2013 = createAndSaveDf(info2013, 'faculty_info2013.csv')
df_2013.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Ama Baafra Abeberese,Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
3,Marjorie Agosin,Luella LaMer Slaner Professor in Latin America...,Professor of Spanish,,"B.A., University of Georgia; M.A., Ph.D., Indi..."
4,Eliko Akahori,Music Performance Faculty in Piano,Coach/Accompanist,,"B.M., Kunitachi College of Music (Japan); M.M...."


## 7. Collecting faculty data from 2012-2013 academic year

In [149]:
info2012 = extractPage('https://web.archive.org/web/20130518012000/http://www.wellesley.edu/provost/facultyroster',1,455)

In [152]:
df_2012 = createAndSaveDf(info2012, 'faculty_info2012.csv')
df_2012.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Brandon Abbs,Visiting Lecturer in Psychology,,,"B.A., University of Maryland; Ph.D., Universit..."
2,Rana Abdul-Aziz,Lecturer in Middle Eastern Studies,,,"B.A., M.A., Tufts University"
3,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
4,Marjorie Agosin,Luella LaMer Slaner Professor in Latin America...,Professor of Spanish,,"B.A., University of Georgia; M.A., Ph.D., Indi..."


## 8. Collecting faculty data from 2017-2018 
There is one year that has very different formatting than the other years instead of having all the data in p elements the data is in div elements and each div element includes one text snippet meaning that the information for one professor is spread across multiple div elements. This adds the additional complication that some professors have more information while others less making it hard to say that every third element includes professor information. However we can use empty div elements as the cutoff to another professors information.

In [153]:
# extract data from the web
browser.get('https://web.archive.org/web/20180722031240/https://www.wellesley.edu/provost/facultyroster')
profs = browser.find_elements_by_xpath('//div[@class="field-item even"]')[0]
profInfo = profs.find_elements_by_xpath('//div')

In [154]:
len(profInfo) # so many div elements!

2243

In [160]:
# first faculty information 
profInfo[29].text

'Rachid Aadnani'

In [171]:
# last current faculty information 
profInfo[1896].text

'B.A., M.A., University of Michigan'

In [182]:
# only save current faculty info 
profInfo = profs.find_elements_by_xpath('//div')[29:1897]

In [189]:
def splitProfInfo(info):
    profList = []
    prof = []
    for i in range(len(info)):
        if (info[i].text != ' '): ## check where there is space that 
            prof.append(info[i].text)
        else:
            profList.append(prof)
            prof = []
    return profList

In [190]:
cleanProf = splitProfInfo(profInfo)

In [193]:
cleanProf[:2]

[['Rachid Aadnani',
  'Senior Lecturer in Middle Eastern Studies',
  'B.A., Universite Moulay Ismail (Morocco); M.A., Dartmouth College; Ph.D., Binghamton University'],
 ['Ama Baafra Abeberese',
  'Assistant Professor of Economics',
  'B.A., Wellesley College; M.A., M.Phil., Ph.D., Columbia University']]

In [192]:
# restructure all data entries
def restructureProfInfo(profInfo):
    allInstructors = []
    for info in profInfo:
        if len(info) == 7: # special case when p element contains two people
            info1 = info[4:6] +[None,None] + [info[-1]]
            allInstructors.append(info1)
            info = info[:2] +[None,None] + [info[2]]
        if len(info) == 3: 
            info = info[:2] +[None,None] + [info[-1]]
        elif len(info) == 2:
            if "B.F.A." in info[1]: # special case of person just having university info and no title
                info = [info[0]] + [None, None, None] + [info[-1]]
            else:
                info = info[:2] + [None, None, None]
        elif len(info) == 4:
            if 'University' in info[2]:
                info = info[:2] + [None, None] + [''.join([info[2],info[-1]])]
            else:
                info = info[:3] + [None] + [info[-1]]
        allInstructors.append(info)
    return allInstructors

In [196]:
restructuredProf = restructureProfInfo(cleanProf)
prof_2017 = pd.DataFrame(restructuredProf)
prof_2017.columns = ['name', 'title', 'title2', 'title3', 'education']

In [197]:
prof_2017.head()

Unnamed: 0,name,title,title2,title3,education
0,Rachid Aadnani,Senior Lecturer in Middle Eastern Studies,,,"B.A., Universite Moulay Ismail (Morocco); M.A...."
1,Ama Baafra Abeberese,Assistant Professor of Economics,,,"B.A., Wellesley College; M.A., M.Phil., Ph.D.,..."
2,Kris Adams,Senior Music Performance Faculty in Vocal Jazz,,,"B.M., Berklee College of Music; M.M., New Engl..."
3,Katherine Adams,"Instructor in Physical Education, Recreation a...",,,"B.S., Nazareth College; MS., University of Mas..."
4,Marjorie Agosin,Professor of Spanish,,,"B.A., University of Georgia; M.A., Ph.D., Indi..."


In [198]:
# save data to csv
prof_2017.to_csv('faculty_info2017.csv')