# Introduction to Web Scraping with Python

#### CEMFI Undergraduate Summer Internship 2021

#### Instructor: Cay

# Course Outline

1. Introduction: Python and Jupyter Notebooks
1. Web Scraping Example 1
1. **Web Scraping: Example 2** 

### Our goal

In this last hour of class we will deal with a case that requires a little bit more compared to our previous example.

In particular, we will **focus** on how we can interact with the webpage (i.e. as if someone was using the browser).

### Selenium

Selenium is a library that automates the interaction with web browsers. 


It can be used to validate web applications across different browsers and platforms. 


It can also be used to do web scraping to extract useful information contained in webpages.

### When will we need Selenium?

With yesterday's approach (i.e. `requests`) we cannot interact with the page in a flexible way.


We could only change the `url` and then request its content. 


With Selenium, we can accomplish different tasks, such as:
- Scrolling
- Clicking on buttons
- Filling forms
- etc...

## Example: jobs from Linkedin

To illustrate the use of Selenium let's suppose that our goal is to scrape jobs from Linkedin.

More specifically, suppose we are interested recent offers posted for the city of Madrid .

### Visual Inspection

We first go to the website to check how it looks like.

One thing we want to pay attention to is the behavior of the **url**. 

Note that it changes when we choose a given filter, but not when we scroll down the page.

### `requests` is not enough

Let's do exactly what we did in the previous example to illustrate what we would be missing.

In [None]:
# Again, in Python we do many imports
import requests

In [None]:
# send get request
url = "https://www.linkedin.com/jobs/jobs-in-madrid?keywords=&location=Madrid%2C%20Community%20of%20Madrid%2C%20Spain&locationId=&geoId=100994331&sortBy=R&f_TPR=r604800&f_PP=103374081&f_JT=F&f_E=4&position=1&pageNum=0"
result = requests.get(url)

In [None]:
# check status code
result.status_code

In [None]:
# that's the standard way to import BeautifulSoup
from bs4 import BeautifulSoup

In [None]:
# BeautifulSoup parses the HTML content and python now understands the object "soup"
source_code = BeautifulSoup(result.content, 'html.parser')

In [None]:
# container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# first job listed
job_postings[0]

In [None]:
# how many jobs?
len(job_postings)

#### Why is the number of jobs so small?

That's because the HTML code of this url only has information on the first jobs that appear in the search page.

Let's go back to the website and see what's happening.

Since the url does not change, we need to go beyond `requests`.

## Selenium

### Setup

#### Installing Selenium 

We will need to install selenium, which can be done by following the same approach we took yesterday: `pip install`

In [None]:
%pip install selenium

#### Downloading the webdriver

To work with Selenium, another crucial ingredient is to have a **webdriver**. 

The **webdriver** is what will be automatically opening up your browser and interacting with it. 

The specific webdriver you need depends on which browser you prefer to use. 

Here I show how to do with *Google Chrome*, but you can download webdrivers of other browsers as well.

For Google Chrome, you can download the webdriver at: https://chromedriver.chromium.org/downloads. 

In [None]:
# import packages: note that with selenium we don't use requests
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from random import randint

In [None]:
# directoty where you have webdriver saved
my_directory = %pwd + "/chromedriver"
my_directory = my_directory + "/chromedriver"
my_directory

In [None]:
# access selenium driver
driver = webdriver.Chrome(executable_path = my_directory)

In [None]:
# driver loads the url we want

# use the exact same url we used with requests
driver.get(url)
time.sleep(1)

Let's see how many jobs we can scrape without scrolling down.

Notice that we are not using `requests` but we still use BeautifulSoup to parse the HTML of the page.

In [None]:
# BeautifulSoup gets the source code of the page the drive is at
source_code = BeautifulSoup(driver.page_source, "lxml") 

You could have used **"html.parser"** (like in our previous code) instead of **"lxml"**, [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser) is a discussion of advantages and disadvantages of each one.

In [None]:
# a container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# how many jobs?
len(job_postings)

Similar to before, since we only loaded the page and called `BeautifulSoup` right away.

#### Scrolling down 

We can use Selenium to scroll down the page so that it loads more jobs.

First, we do it manually to show it works.

In [None]:
# get source code after manual scroll down
source_code = BeautifulSoup(driver.page_source, "lxml") 

In [None]:
# a container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# how many jobs?
len(job_postings)

Yes, more jobs are loaded.

We only need an automated way to scroll down.

The cell below shows a general way to do for any website.

In [None]:
# general way to scroll down up to when there are no more jobs to load

# time to wait for page to load correctly
scroll_pause_time = 2

# this command gets the height of the page 
last_height = driver.execute_script("return document.body.scrollHeight")

# while True is a loop that never stops (the condition is always satisfied) 
while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(scroll_pause_time)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    
    # test whether it is still possible to scroll down more or if it is the end
    if new_height == last_height:
        # If heights are the same it will exit the function
        break
    last_height = new_height

The code only stops because, after some time, Linkedin removes the scroll down option and a `See more jobs` button appears.

Let's check whether we have more job listings that before in the source code. 

In [None]:
# get source code after manual scroll down
source_code = BeautifulSoup(driver.page_source, "lxml") 

In [None]:
# a container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# how many jobs?
len(job_postings)

Yes, we do have more jobs.

How can we adjust the code to get even more jobs?

#### Clicking on buttons

Notice that there is a `See more jobs` button at the end of the page.

We can use Selenium to click on it so that more jobs appear.

First, let's do it manually so that we make sure this action delivers the result we want (i.e. more jobs in the source HTML).

In [None]:
# get source code after manual scroll down
source_code = BeautifulSoup(driver.page_source, "lxml") 

In [None]:
# a container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# how many jobs?
len(job_postings)

Yes, the number of jobs does increase.

We can automate this action with the following code:

In [None]:
# one of the ways to identify elements in the page (xpath)
see_more_button = driver.find_element_by_xpath("//button[@aria-label='Load more results']")

In [None]:
# the object we just created 
see_more_button

Now we can click on it as many times as we want, here as an example I click on it three times.

In [None]:
# as an example, I click on it three times
for i in range(3):
    see_more_button.click()
    time.sleep(randint(1,2))
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(randint(1,2))

Let's check whether we have even more jobs now

In [None]:
# get source code after manual scroll down
source_code = BeautifulSoup(driver.page_source, "lxml") 

In [None]:
# a container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# how many jobs?
len(job_postings)

Yes, we do. 

In general what we would want is
- Scroll down until `See more jobs` button appears.
- Click on `See more jobs` until it is no longer possible.

Below is a way to click on a button as long as it exists (and is clickable).

If the element is not found or not clickable the code stops and throws an error.

In [None]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

In [None]:
while True:
    
    # this explicitly waits for the element to be clikable
    see_more_button = WebDriverWait(driver, 4).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='Load more results']")))
    time.sleep(randint(1,2))    
    driver.execute_script("arguments[0].click();", see_more_button)
    
    # Scroll down to bottom just to facilitate visualization
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(randint(1,2))

We get the error`TimeoutException`.

This tells that even after waiting for the time we allowed selenium to wait for, no element `"//button[@aria-label='Load more results']"` was found.

Now, we should have all jobs listed that fulfil out initial search parameters.

In [None]:
# get source code after manual scroll down
source_code = BeautifulSoup(driver.page_source, "lxml") 

In [None]:
# a container with all jobs
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

In [None]:
# find all jobs inside the container
job_postings = results_list.find_all("li")

In [None]:
# how many jobs?
len(job_postings)

One last adjustment we can make is to tell our code to instead of throwing an error message, to simply exit the while loop when there is no `See more jobs` button in the page. 

This can be achieved with what is called *exceptions*.

In [None]:
from selenium.common.exceptions import TimeoutException

In [None]:
while True: 
    
    # on each iteration try to do this
    try: 
        # this explicitly waits for the element to be clikable
        see_more_button = WebDriverWait(driver, 4).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='Load more results']")))
        time.sleep(randint(1,2))    
        driver.execute_script("arguments[0].click();", see_more_button)

        # Scroll down to bottom just to facilitate visualization
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(randint(1,2))
    
    # if it encounters TimeoutException do not throw error, simply exit the while loop
    except TimeoutException:
        break

# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(randint(1,2))

There are other commmon exceptions that are useful and appear frequently, I list them below.

In [None]:
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementNotInteractableException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import StaleElementReferenceException

The more exceptions you use, the less likely is that the code will *brake*.

On the other hand, it becomes more *black-box* and you may loose data with noticing.

Thus, it is recommended to only include the exceptions that are indeed needed for each particular application.

### Obtain information from each job

The focus here is how to extract information, not what to do with it later on.

Following this principle, the most natural piece of information of each job we should extract is its url.

Once we have all job urls we can export a csv file to be used to later on.

In [None]:
# example of job url
job_postings[0]

In [None]:
# get only the url with href attribute
job_postings[0].a.get('href')

Each job has an id number associated to it.

We don't need to save the entire url, since only the id is enough to enter the page describing the job.

In [None]:
# get only the url with href attribute
job_postings[0].a.get('href').split("?ref")

In [None]:
# get only the url with href attribute
job_postings[0].a.get('href').split("?ref")[0]

In [None]:
# get only the url with href attribute
job_postings[0].a.get('href').split("?ref")[0].split("-")

In [None]:
# get only the url with href attribute
job_postings[0].a.get('href').split("?ref")[0].split("-")[-1]

In [None]:
# 
job_id  = job_postings[0].a.get('href').split("?ref")[0].split("-")[-1]
job_url = "https://es.linkedin.com/jobs/view" + "/" + job_id
job_url

#### Get job title and company

We can also save job title and company in our csv file.

#### Job Title

In [None]:
# get only the url with href attribute
job_postings[0].find("h3",{"class":"base-search-card__title"})

In [None]:
# get only the url with href attribute
job_postings[0].find("h3",{"class":"base-search-card__title"}).text

In [None]:
# get only the url with href attribute
job_postings[0].find("h3",{"class":"base-search-card__title"}).text.replace("\n","").strip()

#### Company Name

In [None]:
# get only the url with href attribute
job_postings[0].find("h4",{"class":"base-search-card__subtitle"})

In [None]:
# get only the url with href attribute
job_postings[0].find("h4",{"class":"base-search-card__subtitle"}).text.replace("\n","").strip()

### Do it for all jobs

In [None]:
# list to store info
job_info_list = []


for job in job_postings:
    
    # job title
    title_tag = job.find("h3",{"class":"base-search-card__title"})
    title     = title_tag.text.replace("\n","").strip() if title_tag else ""
    
    # company name 
    company_tag = job.find("h4",{"class":"base-search-card__subtitle"})
    company     = company_tag.text.replace("\n","").strip() if company_tag else ""
    
    # job url 
    job_id_tag = job.a.get('href')
    job_id = job_id_tag.split("?ref")[0].split("-")[-1] if job_id_tag else ""
    job_url = "https://es.linkedin.com/jobs/view" + "/" + job_id if job_id_tag else ""
     
    # save info
    job_info_list.append( [title, company, job_url]  )

In [None]:
# take a look at a few of them
job_info_list[:5]

## Put it all together

In [None]:
#################
#### imports ####
#################

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
import pandas as pd
from bs4 import BeautifulSoup
import time
from random import randint


###################
##### control #####
###################

# time to execute
start_time = time.time()


#################
##### paths #####
#################

# access selenium driver
driver = webdriver.Chrome(executable_path = my_directory)


######################
##### source url #####
######################

# base url you want to scrape: try different cities, filters, etc.
url = "https://www.linkedin.com/jobs/jobs-in-madrid?keywords=&location=Madrid%2C%20Community%20of%20Madrid%2C%20Spain&locationId=&geoId=100994331&sortBy=R&f_TPR=r604800&f_PP=103374081&f_JT=F&f_E=4&position=1&pageNum=0"

# driver navigates to url and waits
driver.get(url)
time.sleep(randint(1,3))


##############################
##### interact with page #####
##############################

### 1. scroll down until see more button appears ###

# general way to scroll down up to when there are no more jobs to load

# time to wait for page to load correctly
scroll_pause_time = 1.5

# this command gets the height of the page 
last_height = driver.execute_script("return document.body.scrollHeight")

# while True is a loop that never stops (the condition is always satisfied) 
while True:
    # Scroll down to bottom
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
    time.sleep(scroll_pause_time)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    
    # test whether it is still possible to scroll down more or if it is the end
    if new_height == last_height:
        # If heights are the same it will exit the function
        break
    last_height = new_height
    

### 2. click see more button until all jobs are loaded  ###
    
while True: 
    
    # on each iteration try to do this
    try: 
        # this explicitly waits for the element to be clikable
        time.sleep(randint(1,2))    
        see_more_button = WebDriverWait(driver, 4).until(EC.element_to_be_clickable((By.XPATH, "//button[@aria-label='Load more results']")))   
        driver.execute_script("arguments[0].click();", see_more_button)
        
        # Scroll down to bottom just to facilitate visualization
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(randint(1,2)) 
    
    # if it encounters TimeoutException do not throw error, simply exit the while loop
    except TimeoutException:
        break

# Scroll down to bottom (not necessary here, just to facilitate inspection)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")


##############################
######## get job info ########
##############################

# now we're back to what we know from BeautifulSoup, done with selenium

# get source code after loading all jobs
source_code = BeautifulSoup(driver.page_source, "lxml") 

# container with all jobs in page
results_list = source_code.find("ul", {"class":"jobs-search__results-list"})

# create iterable object with all jobs in page
job_postings = results_list.find_all("li")

# list to store info
job_info_list = []

# loop over all jobs found in result
for job in job_postings:
    
    # job title
    title_tag = job.find("h3",{"class":"base-search-card__title"})
    title     = title_tag.text.replace("\n","").strip() if title_tag else ""
    
    # company name 
    company_tag = job.find("h4",{"class":"base-search-card__subtitle"})
    company     = company_tag.text.replace("\n","").strip() if company_tag else ""
    
    # job url 
    job_id_tag = job.a.get('href')
    job_id = job_id_tag.split("?ref")[0].split("-")[-1] if job_id_tag else ""
    job_url = "https://es.linkedin.com/jobs/view" + "/" + job_id if job_id_tag else ""
     
    # save info
    job_info_list.append( [title, company, job_url]  )
    

    
######################
##### export csv #####
######################

# 1. to dataframe
# transform our list into a Pandas dataframe object
df = pd.DataFrame(job_info_list, columns=['job_title', 'company_name', 'job_url'] )
# 2. to csv
df.to_csv('linkedin_jobs.csv', index=False)


###################
##### control #####
###################

time_elased = time.time() - start_time # now - minus when it started
minutes = int( (time_elased)/60 )      # entire minutes
seconds = time_elased % 60             # % is remainder operator 
print("--- %.0f minutes and %.0f seconds to scrape %.0f job listings  ---" 
      % ( minutes, seconds, len(job_info_list) ) )

In [None]:
df

In [None]:
df.company_name.nunique()

In [None]:
df.shape[0] / df.company_name.nunique()

In [None]:
df.company_name.value_counts()[:10]

In [None]:
df.loc[df.company_name=="Amazon Web Services (AWS)", "company_name"] = "Amazon"

In [None]:
df.company_name.value_counts()[:10]

In [None]:
df.company_name.value_counts(normalize=True)[:10]

In [None]:
import matplotlib.pyplot as plt

n_firms = 8

fig, ax = plt.subplots(figsize=(9,6))

firms = df.company_name.value_counts(normalize=True)[:n_firms].index

job_offers = df.company_name.value_counts(normalize=True)[:n_firms]

ax.bar(firms,job_offers)

ax.tick_params(axis='x', rotation=45, labelsize=13)
ax.set_title("Fraction of jobs by company (top employers)", size=16)

plt.show()

### What's next

Next, you could create another code to enter into each url scraped and get more specific information from each job in order to study characteristics of local labor markets:
- Which industries are offering more jobs?
- What are the most relevant skills to get a job and how does this vary by city/country?
- Salaries.
- etc...

### End of course

I hope you learned something useful for your future research projects.

All the best to everyone!