# Scraping the Wellesley Hive Using Selenium
__Author:__ Francisca Moya Jimenez

02/20/2020

In an effort to answer the question: What are Wellesley College international alumnae doing now?, we decided to scrape the Wellesley Hive, a social network for Wellesley College students and alumni, for information on international alumni. We were especially interested in collecting alumni's graduation date, their current location, their job experience, and their education. This notebook uses Selenium to log into the Wellesley Hive, perform a search to find users who are international alums, downloading the user content, and to visit each users' profile to scrape the information of interest. 

### 1. Login
We use Selenium to navigate to the Wellesley Hive's login page and log into the portal. 

In [1]:
from selenium import webdriver

In [2]:
from selenium.webdriver.chrome.options import Options
options = Options()
options.page_load_strategy = 'eager'

In [3]:
DRIVER_PATH = '/Users/francisca/Desktop/JanuaryProject/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH, options=options)

In [4]:
def login(username, password):
    """Logs into the Wellesley Hive with the given Wellesley College credentials"""
    
    driver.get('https://login.wellesley.edu/module.php/core/loginuserpass.php?AuthState=_2e7d2c248c37bca87ca5f9503042055f14f1e744a8%3Ahttps%3A%2F%2Flogin.wellesley.edu%2Fsaml2%2Fidp%2FSSOService.php%3Fspentityid%3Dhttps%253A%252F%252Fwww.peoplegrove.com%252Fsaml%26cookieTime%3D1610746096%26RelayState%3Dwellesley')
    driver.find_element_by_name('username').send_keys(username)
    driver.find_element_by_name('password').send_keys(password)
    driver.find_element_by_id('regularsubmit').submit()

In [None]:
# Login credentials have been taken out for privacy
login('username', 'password')

### 2. Searching for International Alums
The Hive has a search functionality that can be used to find users based on specific criteria, such as class year or extracurricular involvement. One of these criteria allows us to search users based on their international student status, but many international alums chose to not display this information on their profile. After trying different combinations of criteria, we decided to search for alums who had participated in the Slater International Student Organization. The majority of the students who participate in Slater are international students, and a few are domestic students who have ties to the international community. This search yielded 229 exact results.

Selenium was used to select the desired criteria, which included users who are alums and were involved in Slater during their time at Wellesley.

In [None]:
import time

In [None]:
# Navigate to search page
driver.get('https://hive.wellesley.edu/hub/the-hive/person')
time.sleep(3)
driver.find_element_by_id('remainingFilter').click()
time.sleep(3)
categories = driver.find_elements_by_class_name('content-header')

# Select alumni filter
userType = categories[2]
userType.find_element_by_tag_name('em').click()
time.sleep(3)
alumna = driver.find_element_by_xpath('//*[@id="remainingFilterFilter"]/div/div/div/div/div[2]/div/div[3]/div/div/div[2]/div/div/div/div/label[1]/span[1]/input')
alumna.click()
time.sleep(3)

# Select Slater filter
involvement = categories[15]
involvement.find_element_by_tag_name('em').click()
time.sleep(3)
slater = driver.find_element_by_xpath('//*[@id="remainingFilterFilter"]/div/div/div/div/div[2]/div/div[16]/div/div/div[2]/div/div/div/div/label[203]/span[1]/input')
slater.click()
submit = driver.find_element_by_xpath('//*[@id="remainingFilterFilter"]/div/div/div/div/div[4]/div/button')
submit.click()

### 3. Loading User Search Profiles
The Wellesley Hive is a synchronously loading page, so we need to scroll down in order to load all of the users that match our search. It is important to consider that the Wellesley Hive has infinite scroll, and after it shows the exact matches for a search it starts showing users that are "similar matches", which means that they do not exactly match the criteria given. We found 232 exact matches, so we are interested in loading only these profiles. 

We will define a function to scroll down until at least 232 user profiles are loaded.

In [None]:
def scroller():
    """Scrolls down on the Wellesley Hive website until 232 user cards are loaded"""
    SCROLL_PAUSE_TIME = 5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    # We only want to load around 232 alums (exact matches)
    while len(driver.find_elements_by_class_name('ant-card-body'))<= 232:
        
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

In [None]:
# Scroll all the way down to get at least 232 alums
scroller()

### 4. Collecting Alum Data
Using Selenium, we will click on all of the international alums' profiles and retrieve information from their profiles. 

The function below collects the personal details, professional experience, and educational background for a profile, if the information is available on a user's profile.

In [None]:
def getProfileInfo():
    """Returns a dictionary containing name, graduation year, current location, 
    professional experience, and educational background for a given Wellesley
    Hive user"""
    
    # header
    try:
        name = driver.find_element_by_class_name("profile__top__content__title").text[:-4]
        header = driver.find_elements_by_css_selector(".profile__card--font-sm.profile__card__icon-with-text__text")
        year = header[0].text
        location = header[1].text
    except:
        name = driver.find_element_by_class_name("profile__top__content__title").text[:-4]
        header = ''
        year = ''
        location = ''
    
    # Professional experience
    workExp = []
    try:
        work = driver.find_element_by_id("workHistory")
        for elem in work.find_elements_by_class_name("profile__section__list__item"):
            position = elem.find_element_by_class_name("profile__section__list__item__content__title").text
            workplace = elem.find_element_by_class_name("flex-grow").text
            info = (position,workplace)
            workExp.append(info)
    except:
        pass
    
    # Education
    edExp = []
    try:
        ed = driver.find_element_by_id("schools")
        for elem in ed.find_elements_by_css_selector(".profile__section__list__item.profile__schools__item"):
            institution = elem.find_element_by_class_name("flex-grow").text
            level = elem.find_element_by_class_name("profile__section__list__item__content__subtitle").text
            majors = elem.find_element_by_css_selector(".profile__section__list__item__content__subtitle.profile__section__list__item__content__subtitle2").text
            info2 = (institution,level,majors)
            edExp.append(info2)
    except:
        pass
        
    return {'name':name, 'year':year, 'location':location, 'work':workExp, 'education':edExp}

We will retrieve the profile information for only 229 users. The rest of the users are not exact matches on our search, which means that they do not match all of our criteria (graduated before Spring 2021, and participated in Slater).

In [None]:
alums = []
allUsers = driver.find_elements_by_class_name('person-card__name-block')
scroll= 350
counter = 0

for i in range(0,229):
    #print(i,counter)
    allUsers = driver.find_elements_by_class_name('person-card__name-block')
    alum = allUsers[i]

    # Go to profile
    try:
        alum.click()
        time.sleep(3)
        alums.append(getProfileInfo())
    except:
        time.sleep(2)
        alums.append(getProfileInfo())

    # Go back
    driver.back()
    time.sleep(2) 
    
    counter += 1
    
    if counter%4 == 0:
        driver.execute_script("window.scrollTo(0,"+str(scroll)+");")
        scroll += 350
        time.sleep(2)

In [None]:
import json

with open('alums.json','w') as outfile:
    json.dump(alums, outfile)