# LinkedIn Scraper

Author: Andrea Mock

In the following notebook we will scrape data from the LinkedIn profiles of Wellesley alums to investigate their professional career paths.

Our notebook has the following structure:
1. Login and navigate to page
2. Gather linkedIn URLS
3. scraping individual pages

## Part 1 - Login and navigating to correct page
To start our scraping process we first need to open the LinkedIn website and login. Once logged in, we can navigate to the Wellesley page where we can access the profiles of all of the alums. 

In [5]:
#import chrome webdriver
from selenium import webdriver
import time

In [170]:
browser = webdriver.Chrome()

In [171]:
def login(username,password, browser):

    #Open login page
    browser.get('https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin')

    #Enter login info:
    elementID = browser.find_element_by_id('username')
    elementID.send_keys(username)

    elementID = browser.find_element_by_id('password')
    elementID.send_keys(password) 


    elementID.submit()
    # login in 

In [172]:
login("andrea@mock-matienzo.org",'dUscov-wypqiv-tyrga9', browser)

In [38]:
# navigate to wellesley's linkedin page for CS alums
browser.get("https://www.linkedin.com/school/wellesley-college/people/?facetFieldOfStudy=100189")

In [173]:
# navigate to wellesley's linkedin page for Econ alums
browser.get("https://www.linkedin.com/school/wellesley-college/people/?facetFieldOfStudy=100990")

Like many popular social media sites, LinkedIn allows for infinite scrolling. This means that you have to continuously scroll down to access all of the profiles. Thus we scroll down to the very bottom to gather all profiles.

In [162]:
def scroller(browser):
    SCROLL_PAUSE_TIME = 1

    # Get scroll height
    last_height = browser.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = browser.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

In [174]:
scroller(browser)

## Part 2 - Gather alum urls
After having scrolled down we can acces the links of each alum that is contained in the small profile shown for each individual. Using the inspector in the browser we found the class name ofthe items we were looking for and using that we can then access and save all of the urls in a list.

In [164]:
def gatherUrls(browser):
    list_of_hrefs = []

    content_blocks = browser.find_elements_by_class_name("artdeco-entity-lockup__title")

    for block in content_blocks:
        elements = block.find_elements_by_tag_name("a")
        for el in elements:
            list_of_hrefs.append(el.get_attribute("href"))
    return list_of_hrefs

In [181]:
def saveUrlsToFile(fileName):
    with open(fileName,'w') as outfile:
    #outfile.writelines(list_of_hrefs) 
        outfile.write('\n'.join(econUrls))

In [1]:
# save. the urls for alums who were econ majors
econUrls = gatherUrls(browser)
#econUrls[:10]

In [176]:
list_of_hrefs = gatherUrls(browser)

In [12]:
# total of 773 CS alum profiles
len(list_of_hrefs)

773

In [2]:
# some of linkedin urls for cs majors
#list_of_hrefs[:10]

In [183]:
def gatherNames(browser):
    # gathering all alum names 
    alum_names = browser.find_elements_by_xpath('//a[@class="ember-view link-without-visited-state"]')
    return [alum_names[a].text for a in range(len(alum_names)) ]

## Part 3 - Gathering invidual page data
After having obtained all of the urls we can use the library scrape_linkedin that allows us to scrape the information from a persons linkedIn page. The exact documentation on scrape_linkedin can be found on Github at https://github.com/austinoboyle/scrape-linkedin-selenium 

In [19]:
from scrape_linkedin import ProfileScraper

In [99]:
def createProfileDict(profileLink):
    """
    given a link to a LinkedIn page we use the profileScraper to scrape the profile and save it to a dictionary
    """
    with ProfileScraper(cookie='your_cookie_here') as scraper:
        profile = scraper.scrape(url=profileLink)
        profileDict = profile.to_dict()
    return profileDict

In [3]:
# showing how our profile scraping works
profile_example = createProfileDict("example profile link here")
#profile_example

In [27]:
def profileList(url_list):
    """
    for all of the urls we collected we gather their profile information as a dictionary and 
    store the information in a list
    """
    return [createProfileDict(url) for url in url_list ]

In the following we gather the data in chunks to avoid detection by Linkedin or for the browser to stop responding.

In [None]:
allProfiles = profileList(list_of_hrefs)

In [53]:
allProfiles_100 = profileList(list_of_hrefs[:99])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine 

In [57]:
allProfiles_150 = profileList(list_of_hrefs[100:150])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [59]:
allProfiles_200 = profileList(list_of_hrefs[151:200])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [64]:
allProfiles_250 = profileList(list_of_hrefs[201:250])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [65]:
allProfiles_300 = profileList(list_of_hrefs[251:300])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [68]:
allProfiles_350 = profileList(list_of_hrefs[301:350])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [69]:
allProfiles_400 = profileList(list_of_hrefs[351:400])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Message: element click intercepted: Element <a data-control-name="contact_see_more" href="/in/anna-m-pfoertsch/detail/contact-info/" id="ember96" class="ember-view">...</a> is not clickable at point (786, 161). Other element would receive the click: <section id="ember160" class="pv-highlights-section pv-profile-section artdeco-container-card artdeco-card ember-view">...</section>
  (Session info: chrome=86.0.4240.198)

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...

In [100]:
allProfiles_450_1 = profileList(list_of_hrefs[411:415])

In [101]:
allProfiles_450_2 = profileList(list_of_hrefs[416:450])

Unable to determine current company...continuing


In [104]:
allProfiles_500 = profileList(list_of_hrefs[451:500])

Unable to determine current company...continuing
Unable to determine current company...continuing


In [105]:
allProfiles_550 = profileList(list_of_hrefs[501:550])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [112]:
allProfiles_600_1 = profileList(list_of_hrefs[551:560])

In [126]:
allProfiles_600_2 = profileList(list_of_hrefs[561:580])

Unable to determine current company...continuing
Unable to determine current company...continuing


In [127]:
allProfiles_600_3 = profileList(list_of_hrefs[581:600])

Unable to determine current company...continuing


In [131]:
allProfiles_650_1 = profileList(list_of_hrefs[601:625])

Message: 

Unable to determine current company...continuing
Unable to determine current company...continuing


In [132]:
allProfiles_650_2 = profileList(list_of_hrefs[626:650])

Unable to determine current company...continuing


In [137]:
allProfiles_700_1 = profileList(list_of_hrefs[651:675])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [138]:
allProfiles_700 = profileList(list_of_hrefs[676:700])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [4]:
#allProfiles_800

In [147]:
allProfiles_750 = profileList(list_of_hrefs[701:750])

Unable to determine current company...continuing
Unable to determine current company...continuing


In [None]:
allProfiles_800 = profileList(list_of_hrefs[751:])

Unable to determine current company...continuing
Unable to determine current company...continuing
Unable to determine current company...continuing


In [142]:
merged700 = allProfiles_700 + allProfiles_700_1 + allProfiles_800

In [184]:
# save data to a json file
def saveDictFile(fileName,data):
    with open(fileName, 'w') as outfile:
        json.dump(data,outfile)