## Web Scraping with Beautiful Soup and Python

##### Scrape HTML Content from a Page

The `request` module allows us to send HTTP request using Python. We can load webpages with `request`

In [4]:
# import request and BeautifulSoup
import requests
from bs4 import BeautifulSoup

# get url
url = 'https://www.monster.com/jobs/search/?q=Data-Scientist&where=Seattle-WA&rad=200&tm=14'
page = requests.get(url)

# create BeautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser')
print('Beautiful Soup object created!')

Beautiful Soup object created!


In [6]:
# find specific element by ID
results = soup.find(id='ResultsContainer')

We can `.prettify()` any Beautiful Soup object when we print out. Calling this method displays all the `HTML` contained within the `<div>`.

In [None]:
# prettify Beautiful Soup object
print(results.prettify())

##### Find Elements by HTML Class Name

Since the job posting is wrapped in a `<section>` element with the class `card-content`, we can select only job postings from the Beautiful Soup object called `results`

In [None]:
# selecting job postings
job_elems = results.find_all('section', class_='card-content')

for job_elem in job_elems:
    print(job_elem, end='\n'*2)

In [31]:
# use find to extracts elements by HTML class name 
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    print(title_elem)
    print(company_elem)
    print(location_elem)
    # retrieve links to each jobs
    for link in job_elem.findAll('a'):
        print(link.get('href'))
    print()

<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="559" data-m_impr_j_coc="" data-m_impr_j_jawsid="462237602" data-m_impr_j_jobid="0" data-m_impr_j_jpm="1" data-m_impr_j_jpt="2" data-m_impr_j_lat="47.6063" data-m_impr_j_lid="647" data-m_impr_j_long="-122.3308" data-m_impr_j_occid="11866" data-m_impr_j_p="1" data-m_impr_j_postingid="565d7e3d-2733-4ee7-b3d8-cbe1385b886e" data-m_impr_j_pvc="appcastpx2" data-m_impr_s_t="t" data-m_impr_uuid="435fc66f-ee45-44d5-bf83-dd725fd14050" href="https://job-openings.monster.com/cmts-ic5-ml-cognitive-services-nlp-data-scientist-joinoci-seattle-wa-us-oracle/565d7e3d-2733-4ee7-b3d8-cbe1385b886e" onclick="clickJobTitle('plid=647&amp;pcid=559&amp;poccid=11866','Data Scientist',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;CMTS IC5 - ML Cognitive Services NLP Data Scientist (JoinOCI)&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSR2CW&quot

##### Extract Text from HTML Elements

To extract the job title, company name, and job location, we add `.text` to the Beautiful Soup object to return only the **test content** of the HTML elements that the object contains

In [35]:
# extract text from HTML elements
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem):
        continue
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())
    # retrieve links to each jobs
    for link in job_elem.findAll('a'):
        print(link.get('href'))
    print()    

CMTS IC5 - ML Cognitive Services NLP Data Scientist (JoinOCI)
Oracle
Seattle, WA
https://job-openings.monster.com/cmts-ic5-ml-cognitive-services-nlp-data-scientist-joinoci-seattle-wa-us-oracle/565d7e3d-2733-4ee7-b3d8-cbe1385b886e

Senior Data Scientist
XPO Logistics
Portland, OR
https://job-openings.monster.com/senior-data-scientist-portland-or-us-xpo-logistics/222108792

Data Scientist
The Judge Group
Redmond, WA
https://job-openings.monster.com/data-scientist-redmond-wa-us-the-judge-group/222342189

Senior Software Dev Engineer, Data Science and Machine Learning
Expedia.com
Seattle, WA
https://job-openings.monster.com/senior-software-dev-engineer-data-science-machine-learning-seattle-wa-us-expedia-com/200c0f79-0c63-45bb-9636-b668969a4e62

VP of Product Management Data Science
Behavox
Seattle, WA
https://job-openings.monster.com/vp-of-product-management-data-science-seattle-wa-us-behavox/a82d4e97-4a2b-4629-b774-e84e34602504

Principal Cognitive-Decision Data Scientist (JoinOCI)
Oracle

##### Extracting Attributes from HTML Elements

In [37]:
# another way to narrow down the job title and get links
data_scientist_jobs = results.find_all('h2', string=lambda text: 'data scientist' in text.lower())

for ds_job in data_scientist_jobs:
    link = ds_job.find('a')['href']
    print(ds_job.text.strip())
    print(f'Apply here: {link}\n')

CMTS IC5 - ML Cognitive Services NLP Data Scientist (JoinOCI)
Apply here: https://job-openings.monster.com/cmts-ic5-ml-cognitive-services-nlp-data-scientist-joinoci-seattle-wa-us-oracle/565d7e3d-2733-4ee7-b3d8-cbe1385b886e

Senior Data Scientist
Apply here: https://job-openings.monster.com/senior-data-scientist-portland-or-us-xpo-logistics/222108792

Data Scientist
Apply here: https://job-openings.monster.com/data-scientist-redmond-wa-us-the-judge-group/222342189

Principal Cognitive-Decision Data Scientist (JoinOCI)
Apply here: https://job-openings.monster.com/principal-cognitive-decision-data-scientist-joinoci-seattle-wa-us-oracle/2aaf5b2e-016e-41e6-8869-d93c43a03e09

Principal "Cognitive" Data Scientist (JoinOCI)
Apply here: https://job-openings.monster.com/principal-cognitive-data-scientist-joinoci-seattle-wa-us-oracle/4f3e3441-d08e-4fd9-b0a6-928be9d49d1b

Data Scientist Product Strategy
Apply here: https://job-openings.monster.com/data-scientist-product-strategy-seattle-wa-us-stri