# LinkedIn Web-Scraping Toolkit

This code asks for your LinkedIn credentials, the city/region where you are interested in working, and the keywords in titles/taglines for all people you want to 'view'. It will then:

1. look at all* current job postings in your city of interest and save the posting organizations/companies to a csv file (in the same directory as this notebook)
2. use that list of companies and search its people for roles that match your keywords (e.g. "Purchasing Manager")
3. save the names, jobtitles, locations, and linkedin urls for those people in an Excel file
4. view the profiles of any people from (3) that are located in your city of interest

The idea is that you will be able to obtain a list of companies and/or people to attempt to network with, and you will have viewed the profiles of these people--which could lead to you getting more profile views from relevant people that may be considering hiring someone like you.


Prior to running this program, make sure:
1. chrome is installed (see [chrome website](https://www.google.co.uk/chrome/?brand=CHBD&gclid=Cj0KCQiA2o_fBRC8ARIsAIOyQ-mBWe_td_tlfyeh_TWRbyDCe7zo6R65xYhObb42egIYBfkRnlW4_MUaAtdvEALw_wcB&gclsrc=aw.ds))
2. chromedriver.exe file is in the same folder as this file (download the Python one at [SeleniumHQ](https://www.seleniumhq.org/download/))
3. python and all the required libraries (pandas, os, selenium, time, random, re) are installed on your machine
5. search_terms.csv is saved in the same folder as this file with search terms listed in column A from cell A1

*Note that LinkedIn limits the job postings to the first 40 pages of results--ranked by how relevant LinkedIn thinks the jobs are to you, based on previous use of their website, your profile experience, etc.

In [None]:
##import libraries and inputs
import pandas as pd
import os
from selenium import webdriver
import time
#os.chdir('C:\\Users\\craig\\OneDrive - Dufrain Consulting\\analytics_craig\\tasks\\201804\\linkedin_scraping')
import random
import re
#import linkedin_creds #this needs phased out

#Switch jupyter to enable multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#Change Jupyter cell width to 100% of browser
from IPython.core.display import display, HTML
display(HTML("<style> .container { width:100% !important; }</style>"))
#display(HTML('<style> div.prompt {display:none} </style>'))

In [None]:
#define function for scrolling down a page (that will be used at least twice in this notebook)
def scrolljs():
    pg_height=driver.execute_script('return document.body.scrollHeight') #get height of page
    scroll_step=pg_height/5 #define size of scroll increments
    for i in range(1,6): #initiate the loop
        #print(i) #print current iteration (for debugging only)
        driver.execute_script('window.scrollTo(0,'+str(scroll_step*i)+');') #scroll another 1/5th down
        time.sleep(.5) #give the js a chance to load html content
    driver.execute_script('window.scrollTo(0,0);') #return to top of page

### Login to the Website
Use the webdriver to locate the website and login with user credentials

In [None]:
username=input('what is your username?') #prompt user to input their username
password=input('what is your password?') #prompt user to input their password

In [None]:
#Open LinkedIn with chrome webdriver and login
#note that webdriver must be installed

#initiate the webdriver
options=webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors-spki-list')
options.add_argument('--ignore-ssl-errors')
driver=webdriver.Chrome(chrome_options=options)

#login to linkedin
driver.get('https://www.linkedin.com/') #navigate to the page
username=driver.find_element_by_id("login-email") #username form field
password=driver.find_element_by_id("login-password") #password form field
username.send_keys(username_input) #send username string to username field
password.send_keys(password_input) #send pw to pw field
driver.find_element_by_id("login-submit").click() #locate and click submit button

In [None]:
city=input('What city/region do you want to work in?')

driver.get('https://www.linkedin.com/jobs/search/?location='+city) #go to search results for your city

### Gather Relevant Companies
After spending time on Glassdoor and looking around the web for other credible sources of company data by region, I considered that LinkedIn itself would provide a good source of companies by looking at jobs by region, then scraping the posting company for those job ads

In [None]:
split_btn=driver.find_element_by_xpath('//*[@class="jobs-search-dropdown__trigger-icon"]')
split_btn.click()
try:
    classic_btn=driver.find_element_by_xpath('//*[@class="jobs-search-dropdown__option-button jobs-search-dropdown__option-button--single "]')
    classic_btn.click()
except:
    print('already in classic mode')

In [None]:
results_end=False #initiate results_end as False
companies_list=[] #create empty companies list container
p=0 #initiate page at 0
while results_end==False:
    p=p+1
    print('starting page='+str(p))
    scrolljs
    companies=driver.find_elements_by_xpath('//h4[@class="job-card-search__company-name"]') #find all companies on page

    #harvest all companies on page
    for c in companies:
        companies_list.append(c.text)

    #click 'next' results
    try: #see if the 'Next' link is found on the current page
        nextpg_btn=driver.find_element_by_xpath('//*[@class="artdeco-pagination__button artdeco-pagination__button--next artdeco-button artdeco-button--muted artdeco-button--icon-right artdeco-button--1 artdeco-button--tertiary ember-view"]')
        nextpg_btn.click()
        time.sleep(random.randint(2,10))
    except: #if the 'Next' link can't be found just change results_end to False and continue to next while loop iteration
        results_end=True

In [None]:
#Deduplicate and save scraped companies to a csv file
companies_list=list(set(companies_list)) #deduplicate the list
companies_list=[c for c in companies_list if c!=''] #get rid of the '' companies
pd.DataFrame({'company':companies_list}).to_csv('companies.csv',index=False, header=False) #export to csv

### Loop through Companies to Find People
For each company:
1. Find Relevant People in the Company (using search terms for tagline/title)
2. View their Profile if they are in your city of interest

In [None]:
#get search terms for the scrape
df=pd.read_csv('search_terms.csv',header=None) #read csv
search_terms=df.iloc[:,0].tolist() #convert to list

#get companies to scrape
df=pd.read_csv('companies.csv',header=None) #read csv
companies=df.iloc[:,0].tolist() #convert to list

#create a repository dataframe to which all scraped data will be written
repo=pd.DataFrame()

#Loop through companies in list (LOOP #1.0)
for company in companies:

    url_company_search='https://www.linkedin.com/search/results/companies/?keywords='+company+'&origin=SWITCH_SEARCH_VERTICAL'
    driver.get(url_company_search) #navigate to the search results page

    #locate first element in search results
    link=driver.find_element_by_xpath("//div[@class='blended-srp-results-js pt0 pb4 ph0 container-with-shadow']/ul/li[1]/div/div/div[2]/a")
    company_link=link.get_attribute("href") #get the target of the link

    #store name and industry of the company 
    company_name=driver.find_element_by_xpath("//div[@class='blended-srp-results-js pt0 pb4 ph0 container-with-shadow']/ul/li[1]/div/div/div[2]/a/h3").text
    try:
        company_industry=driver.find_element_by_xpath("//div[@class='blended-srp-results-js pt0 pb4 ph0 container-with-shadow']/ul/li[1]/div/div/div[2]/p").text
    except:
        company_industry='Unknown'
    driver.get(company_link) #browse to the company link's target

    #locate element with link to go to employees of the company
    link=driver.find_element_by_xpath('//span[@class="org-company-employees-snackbar__see-all-employees-link"]/a')
    employees_link=link.get_attribute("href") #get target url of emps link
    driver.get(employees_link) #go to the link's destination url

    #store text of the element string that shows # of employees globally
    employees_num=driver.find_element_by_xpath('//h3[@class="search-results__total t-14 t-black--light t-normal pl5 pt4 clear-both"]').text
    employees_num2=re.findall('[0-9,]+',employees_num)
    employees_num2=re.sub('[,]','',employees_num2[0])
    employees_link_uk=employees_link+'&facetGeoRegion=%5B"gb%3A0"%5D' #get url of UK employees
    driver.get(employees_link_uk) #browse to uk employees page

    #store text of the element string that shows # of employees in UK
    try:
        employees_num_uk=driver.find_element_by_xpath('//h3[@class="search-results__total t-14 t-black--light t-normal pl5 pt4 clear-both"]').text
    except:
        employees_num_uk='0'
    employees_num_uk2=re.findall('[0-9,]+',employees_num_uk)
    employees_num_uk2=re.sub('[,]','',employees_num_uk2[0])

    #print company's linkedin profile details to window
    print('company name='+company_name+\
          '\nindustry='+company_industry+\
          '\nlink='+company_link+\
          '\nglobal employees='+employees_num2+\
          '\nuk employees='+employees_num_uk2)
    
    
    #Loop through all search result pages and scrape the name, title, and location (LOOP #2.0)
    for t in search_terms:
        
        results_url=employees_link_uk+'&title='+t
        driver.get(results_url)
        shutdown=False
        i=0
        #Loop through all result pages for this search iteration (LOOP #3.0)
        while shutdown==False:

            #store current url for later use
            url_current=driver.current_url

            #find all employees in page
            scrolljs()
            try:
                #find names on the page (or LinkedIn Member which has a different class as shown below)
                emps=driver.find_elements_by_xpath('//span[@class="name actor-name"]|//span[@class="actor-name"]')
            except:
                #prepare for next iteration or stop if no more result pages!
                try:
                    driver.find_element_by_xpath('//button[@class="artdeco-pagination__button artdeco-pagination__button--next artdeco-button artdeco-button--muted artdeco-button--icon-right artdeco-button--1 artdeco-button--tertiary ember-view"]').click()
                    time.sleep(random.randint(4,15))
                    i=i+1
                except:
                    shutdown=True

                continue
                
            emps_list=[]

            for emp in emps: #loop #3.1
                emps_list.append(emp.text.strip())
            emps_list

            #find all taglines/titles in url string
            jtitles=driver.find_elements_by_xpath('//p[@class="subline-level-1 t-14 t-black t-normal search-result__truncate"]')
            jtitles_list=[]

            for jtitle in jtitles: #loop #3.2
                jtitles_list.append(jtitle.text.strip())
            jtitles_list

            #find all locations in url string
            locations=driver.find_elements_by_xpath('//p[@class="subline-level-2 t-12 t-black--light t-normal search-result__truncate"]')
            locations_list=[]
            for location in locations: #loop #3.3
                locations_list.append(location.text.strip())
            locations_list

            #find all employee url links in page
            plinks=driver.find_elements_by_xpath('//div[@class="search-result__info pt3 pb4 ph0"]/a')
            urls_list=[]

            for plink in plinks: #loop #3.4
                url=plink.get_attribute('href')
                urls_list.append(url)
            urls_list
            
            j=0
            for url in urls_list: #loop #3.5
                #view url/profile if in city of interest
                if city in locations_list[j]:
                    driver.get(url)
                    time.sleep(random.randint(5,15))
                j=j+1

            #combine these lists into a dataframe
            pgdata=pd.DataFrame({'company':company,
                               'employee':emps_list,
                               'job_tagline':jtitles_list,
                               'location':locations_list,
                               'profile_url':urls_list})
            repo=repo.append(pgdata)

            #print info to log
            print('criteria=("'+t+'")'+\
                  ' iteration='+str(i)+' '+\
                  str(time.strftime('%H:%M:%S'))+\
                  ' rows added='+str(len(pgdata))+\
                  ' repo.shape='+str(repo.shape))

            #return to parent results page
            driver.get(url_current)
            scrolljs()

            #prepare for next iteration or stop if no more result pages!
            try:
                driver.find_element_by_xpath('//button[@class="artdeco-pagination__button artdeco-pagination__button--next artdeco-button artdeco-button--muted artdeco-button--icon-right artdeco-button--1 artdeco-button--tertiary ember-view"]').click()
                time.sleep(random.randint(4,15))
                i=i+1
            except:
                shutdown=True

### Tidy up and Write Log to CSV

In [None]:
#quit driver
driver.quit()

#get country from location field
repo['country']=repo.location.str.extract('([\w ]+)$').str.strip()
#deduplicate results
repo.drop_duplicates(keep='first',inplace=True)

#output to excel (with urls click-ready)
#create directory for outputs if doesn't exist
if not os.path.exists('outputs'):
    os.makedirs('outputs')
writer=pd.ExcelWriter('outputs\\'+str(time.strftime('%d%b%y_%H%M'))+'_leadprospects.xlsx',
                      engine='xlsxwriter',
                      options={'strings_to_urls':True})
repo.to_excel(writer,
              sheet_name=company,
              columns=['company','employee','job_tagline','location','country',
                       'profile_url'],
              index=False)
writer.save()
print('To see the results, check the file that has just been saved to:\
      outputs\\'+str(time.strftime('%d%b%y_%H%M'))+'_leadprospects.xlsx')
input('click Enter to close this window')