# 2. Beautiful soup
Credits to tutorial as below: https://realpython.com/beautiful-soup-web-scraper-python/

## 2.1 Requests
Use the requests class to perform a HTTP GET request for extracting information.

In [2]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
import requests

URL = 'http://pythonjobs.github.io/'
page = requests.get(URL)

In [27]:
import pprint
pp = pprint.PrettyPrinter()

**Static websites - return HTML**

As printed below.

In [28]:
pp.pprint(page.content)

(b'<!doctype html>\n<!-- https://github.com/paulirish/html5-boilerplate/blob'
 b'/master/index.html -->\n<!-- paulirish.com/2008/conditional-stylesheets-v'
 b's-css-hacks-answer-neither/ -->\n<!--[if lt IE 7 ]> <html lang="en" class'
 b'="no-js ie6"> <![endif]-->\n<!--[if IE 7 ]>    <html lang="en" class="no-'
 b'js ie7"> <![endif]-->\n<!--[if IE 8 ]>    <html lang="en" class="no-js ie'
 b'8"> <![endif]-->\n<!--[if (gte IE 9)|!(IE)]><!--> <html lang="en" class="'
 b'no-js"> <!--<![endif]-->\n<head>\n  <!-- meta element for compatibility mo'
 b'de needs to be before\n        all elements except title & meta\n        m'
 b'sdn.microsoft.com/en-us/library/cc288325(VS.85).aspx -->\n  <meta charset'
 b'="utf-8">\n  \n  <!-- Always force latest IE rendering engine (even in int'
 b'ranet) & Chrome Frame\n       Remove this if you use the .htaccess -->\n  '
 b'<meta http-equiv="X-UA-Compatible" content="IE=edge">\n  <title>The Free '
 b'Python Job Board</title>\n  <meta name="description" cont

## 2.2 BeautifulSoup parse
Use BeautifulSoup to parse the page's content.

In [5]:
soup = BeautifulSoup(page.content, 'html.parser')

In [30]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [31]:
results = soup.find(id='main')

In [32]:
print(results)

<div id="main" role="main">
<section id="content">
<nav class="main">
<span id="text_search_container">
<i class="i-search"></i>
<input id="text_search" name="search" placeholder="Search" type="text"/>
</span>
<span id="location">
<a class="backlink" href="map.html">
<i class="i-globe"></i>
            Jobs by Location
        </a>
</span>
<div id="filter">
<h3>Filter jobs by tag</h3>
<ul>
<li>
<a href="/tags/aws.html">aws</a>
</li>
<li>
<a href="/tags/backend.html">backend</a>
</li>
<li>
<a href="/tags/cms.html">cms</a>
</li>
<li>
<a href="/tags/devops.html">devops</a>
</li>
<li>
<a href="/tags/django.html">django</a>
</li>
<li>
<a href="/tags/flask.html">flask</a>
</li>
<li>
<a href="/tags/nosql.html">nosql</a>
</li>
<li>
<a href="/tags/postgresql.html">postgresql</a>
</li>
<li>
<a href="/tags/python.html">python</a>
</li>
<li>
<a href="/tags/sql.html">sql</a>
</li>
</ul>
</div>
</nav>
<h1 id="list-title">Most Recent Jobs</h1>
<div class="search_info hidden" id="search_info"></div>
<

## 2.3 Prettify
Use ```prettify``` function to align the `<div>` tags in a better manner.

In [33]:
print(results.prettify())

<div id="main" role="main">
 <section id="content">
  <nav class="main">
   <span id="text_search_container">
    <i class="i-search">
    </i>
    <input id="text_search" name="search" placeholder="Search" type="text"/>
   </span>
   <span id="location">
    <a class="backlink" href="map.html">
     <i class="i-globe">
     </i>
     Jobs by Location
    </a>
   </span>
   <div id="filter">
    <h3>
     Filter jobs by tag
    </h3>
    <ul>
     <li>
      <a href="/tags/aws.html">
       aws
      </a>
     </li>
     <li>
      <a href="/tags/backend.html">
       backend
      </a>
     </li>
     <li>
      <a href="/tags/cms.html">
       cms
      </a>
     </li>
     <li>
      <a href="/tags/devops.html">
       devops
      </a>
     </li>
     <li>
      <a href="/tags/django.html">
       django
      </a>
     </li>
     <li>
      <a href="/tags/flask.html">
       flask
      </a>
     </li>
     <li>
      <a href="/tags/nosql.html">
       nosql
      </a>
     </li>


## 2.4 Separate entries
We find that the `<section>` tag separates each job entry, so let's separate by that using `find_all` function.

In [34]:
job_elems = results.find_all('div', class_='job')

In [37]:
print(type(job_elems))

<class 'bs4.element.ResultSet'>


In [40]:
for job_elem in job_elems:
    print(type(job_elem))
    print(job_elem, end='\n'*2)

<class 'bs4.element.Tag'>
<div class="job" data-order="0" data-slug="autumn_compass_engineer" data-tags="python,aws,devops">
<a class="go_button" href="/jobs/autumn_compass_engineer.html">
		    	Read more <i class="i-right"></i>
</a>
<h1><a href="/jobs/autumn_compass_engineer.html">Software Engineer (Data Operations)</a></h1>
<span class="info"><i class="i-globe"></i> Sydney, Australia / Remote</span>
<span class="info"><i class="i-calendar"></i> Tue, 15 Sep 2020</span>
<span class="info"><i class="i-chair"></i> Permanent</span>
<span class="info"><i class="i-company"></i> Autumn Compass</span>
<p class="detail"> About The Role We are looking for an experienced Software Engineer or SRE to join the Data Platforms team at Autumn Compass. The team is in charge of our distributed cloud compute infrastructure, data processing...</p>
<div class="search_match"></div>
</div>

<class 'bs4.element.Tag'>
<div class="job" data-order="1" data-slug="NIH-developer-engineer" data-tags="python,nosql,s

Further separate by `<h2>`, `<div class="company">` and `<div class="location">` tags, which separates into job title, company name and location of job.

In [50]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('h1')
    location_elem, posted_date, role_type, company_elem = job_elem.find_all('span', class_='info')
    print(title_elem)
    print(location_elem)
    print(posted_date)
    print(role_type)
    print(company_elem)
    print()

<h1><a href="/jobs/autumn_compass_engineer.html">Software Engineer (Data Operations)</a></h1>
<span class="info"><i class="i-globe"></i> Sydney, Australia / Remote</span>
<span class="info"><i class="i-calendar"></i> Tue, 15 Sep 2020</span>
<span class="info"><i class="i-chair"></i> Permanent</span>
<span class="info"><i class="i-company"></i> Autumn Compass</span>

<h1><a href="/jobs/NIH-developer-engineer.html">Developer / Engineer</a></h1>
<span class="info"><i class="i-globe"></i> Maryland / DC Metro Area</span>
<span class="info"><i class="i-calendar"></i> Tue, 12 May 2020</span>
<span class="info"><i class="i-chair"></i> permanent</span>
<span class="info"><i class="i-company"></i> National Institutes of Health contracting company.</span>

<h1><a href="/jobs/bambus_vienna_django_dev_20200302.html">Senior Backend Developer (Python/Django)</a></h1>
<span class="info"><i class="i-globe"></i> Vienna, Austria</span>
<span class="info"><i class="i-calendar"></i> Mon, 02 Mar 2020</span>

Use ```.text.strip()``` to strip all whitespace. We find that there is 1 job which doesn't fit into the structure.

In [60]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('h1')
    location_elem, posted_date, role_type, company_elem = job_elem.find_all('span', class_='info')
    url = job_elem.find('a', class_='go_button')['href']
    print(title_elem.text.strip())
    print("Location: " + location_elem.text.strip())
    print("Posted on: " + posted_date.text.strip())
    print("Role type: " + role_type.text.strip())
    print("Company: " + company_elem.text.strip())
    print("Link: " + URL + url)
    print()

Software Engineer (Data Operations)
Location: Sydney, Australia / Remote
Posted on: Tue, 15 Sep 2020
Role type: Permanent
Company: Autumn Compass
Link: http://pythonjobs.github.io//jobs/autumn_compass_engineer.html

Developer / Engineer
Location: Maryland / DC Metro Area
Posted on: Tue, 12 May 2020
Role type: permanent
Company: National Institutes of Health contracting company.
Link: http://pythonjobs.github.io//jobs/NIH-developer-engineer.html

Senior Backend Developer (Python/Django)
Location: Vienna, Austria
Posted on: Mon, 02 Mar 2020
Role type: permanent
Company: Bambus.io
Link: http://pythonjobs.github.io//jobs/bambus_vienna_django_dev_20200302.html



## 2.5 Find specific entry
Let's try to find "Senior Solutions Engineer".

In [54]:
senior_jobs = results.find_all('span', string='Software Engineer')

In [55]:
print(senior_jobs)

[]


This is because of differences in whitespace / capitalisations, since ```string=``` searches for exact matches.

In [78]:
engineer_jobs = results.find_all('h1', string=lambda text: 'engineer' in text.lower())

In [80]:
print(len(engineer_jobs))

2


## 2.1 Requests
Use the requests class to perform a HTTP GET request for extracting information.

In [2]:
import requests

URL = 'https://remote.co/remote-jobs/developer/'
page = requests.get(URL)

In [2]:
import requests

URL = 'https://remote.co/remote-jobs/search/?search_keywords=software+developer'
page = requests.get(URL)

In [3]:
import pprint
pp = pprint.PrettyPrinter()

**Static websites - return HTML**

As printed below.

In [4]:
pp.pprint(page.content)

(b'<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#" class='
 b'"no-js no-svg">\n<head>\n<title>\n  Remote Developer Jobs - Remote.co</titl'
 b'e><link rel="stylesheet" href="https://remote.co/wp-content/cache/min/1/293c'
 b'273e3521fee26af57382e991bb10.css" media="all" data-minify="1" />\n\n<link '
 b'rel="shortcut icon" href="/wp-content/uploads/2017/02/retina_favicon_32.png"'
 b' type="image/x-icon" />\n\n<meta charset="UTF-8">\n<meta name="viewport" co'
 b'ntent="width=device-width, initial-scale=1, shrink-to-fit=no, user-scalable='
 b'0">\n\n<!--link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/'
 b'bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma'
 b'34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous"--'
 b'>\n\n<link href="https://fonts.googleapis.com/css?family=Open+Sans:400,400'
 b'i,700,700i|Raleway:400,700|Montserrat:400,500,600&#038;display=swap" rel="st'
 b'ylesheet">\n\n\n<!-- Social Warfare v4.1

## 2.2 BeautifulSoup parse
Use BeautifulSoup to parse the page's content.

In [3]:
soup = BeautifulSoup(page.content, 'html.parser')

In [4]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [5]:
results = soup.find('div', class_='card bg-light mb-3 rounded-0')

In [4]:
results = soup.find('ul', class_='job_listings')

<ul class="job_listings">
</ul>

In [6]:
results_list = soup.find_all('div', class_='card bg-light mb-3 rounded-0')

In [8]:
print(type(results))

<class 'bs4.element.Tag'>


In [9]:
print(type(results_list))

<class 'bs4.element.ResultSet'>


In [6]:
print(results)

<div class="card bg-light mb-3 rounded-0">
<div class="card-body">
<div class="d-flex align-items-center mb-3">
<h2 class="text-uppercase mb-0 mr-2 raleway" style="-webkit-box-flex:0;flex-grow:0;">Remote Developer Jobs</h2><div style="background:#00a2e1;-webkit-box-flex:1;flex-grow:1;height:3px;"></div>
</div>
<div class="card bg-white m-0">
<div class="card-body p-0">
<p class="p-3 m-0 border-bottom">
<a href="/remote-jobs/" style="font-size:18px;">
<em>
                                                  See all Remote Jobs &gt;
                                              </em>
</a>
</p>
<a class="card m-0 border-left-0 border-right-0 border-top-0 border-bottom" href="/job/software-engineer-security-2/">
<div class="card border-0 p-3 job-card bg-white">
<div class="row no-gutters align-items-center">
<div class="col-lg-1 col-md-2 position-static d-none d-md-block pr-md-3">
<img alt="Fathom Health" class="card-img" src="https://remoteco.s3.amazonaws.com/wp-content/uploads/2020/11/2223

In [8]:
print(len(results_list))

1


## 2.3 Prettify
Use ```prettify``` function to align the `<div>` tags in a better manner.

In [10]:
print(results.prettify())

<div class="card bg-light mb-3 rounded-0">
 <div class="card-body">
  <div class="d-flex align-items-center mb-3">
   <h2 class="text-uppercase mb-0 mr-2 raleway" style="-webkit-box-flex:0;flex-grow:0;">
    Remote Developer Jobs
   </h2>
   <div style="background:#00a2e1;-webkit-box-flex:1;flex-grow:1;height:3px;">
   </div>
  </div>
  <div class="card bg-white m-0">
   <div class="card-body p-0">
    <p class="p-3 m-0 border-bottom">
     <a href="/remote-jobs/" style="font-size:18px;">
      <em>
       See all Remote Jobs &gt;
      </em>
     </a>
    </p>
    <a class="card m-0 border-left-0 border-right-0 border-top-0 border-bottom" href="/job/senior-software-engineer-frontend-2/">
     <div class="card border-0 p-3 job-card bg-white">
      <div class="row no-gutters align-items-center">
       <div class="col-lg-1 col-md-2 position-static d-none d-md-block pr-md-3">
        <img alt="Astronomer" class="card-img" src="https://remote.co/wp-content/uploads/2018/06/astronomer-150x

## 2.4 Separate entries
We find that the `<section>` tag separates each job entry, so let's separate by that using `find_all` function.

In [11]:
job_elems = results.find_all('a', class_='card m-0 border-left-0 border-right-0 border-top-0 border-bottom')

In [13]:
print(type(job_elems))

<class 'bs4.element.ResultSet'>


In [12]:
for job_elem in job_elems:
    print(job_elem, end='\n'*2)

<a class="card m-0 border-left-0 border-right-0 border-top-0 border-bottom" href="/job/senior-software-engineer-frontend-2/">
<div class="card border-0 p-3 job-card bg-white">
<div class="row no-gutters align-items-center">
<div class="col-lg-1 col-md-2 position-static d-none d-md-block pr-md-3">
<img alt="Astronomer" class="card-img" src="https://remote.co/wp-content/uploads/2018/06/astronomer-150x150.jpg"/>
</div>
<div class="col position-static">
<div class="card-body px-3 py-0 pl-md-0">
<p class="m-0"><span class="font-weight-bold larger">Senior Software Engineer - Frontend</span><span class="float-right d-none d-md-inline text-secondary"><small><date>15 hours ago</date></small></span></p>
<p class="m-0 text-secondary">
                                      Astronomer 
                                                   
                                                                          </p>
</div>
</div>
</div>
</div>
</a>

<a class="card m-0 border-left-0 border-right-0 bor

Further separate by `<h2>`, `<div class="company">` and `<div class="location">` tags, which separates into job title, company name and location of job.

In [13]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('span', class_='font-weight-bold larger')
    posted_days_before = job_elem.find('span', class_='float-right d-none d-md-inline text-secondary')
    company_elem = job_elem.find('p', class_='m-0 text-secondary')
    print(title_elem)
    print(company_elem)
    print(posted_days_before)
    print()

<span class="font-weight-bold larger">Senior Software Engineer - Frontend</span>
<p class="m-0 text-secondary">
                                      Astronomer 
                                                   
                                                                          </p>
<span class="float-right d-none d-md-inline text-secondary"><small><date>15 hours ago</date></small></span>

<span class="font-weight-bold larger">Full Stack Developer</span>
<p class="m-0 text-secondary">
                                      Uhuru Network 
                                                   
                                                                          </p>
<span class="float-right d-none d-md-inline text-secondary"><small><date>1 day ago</date></small></span>

<span class="font-weight-bold larger">Full Stack Engineer</span>
<p class="m-0 text-secondary">
                                      Interview Schedule 
                                                   
     

Use ```.text.strip()``` to strip all whitespace. We find that there is 1 job which doesn't fit into the structure.

In [14]:
URL_prefix = "https://remote.co"
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('span', class_='font-weight-bold larger')
    posted_days_before = job_elem.find('span', class_='float-right d-none d-md-inline text-secondary')
    company_elem = job_elem.find('p', class_='m-0 text-secondary')
    url = job_elem['href']
    print(title_elem.text.strip())
    print("Company: " + company_elem.text.strip())
    print("Posted: " + posted_days_before.text.strip())
    print("Link: " + URL_prefix + url)
    print()

Senior Software Engineer - Frontend
Company: Astronomer
Posted: 15 hours ago
Link: https://remote.co/job/senior-software-engineer-frontend-2/

Full Stack Developer
Company: Uhuru Network
Posted: 1 day ago
Link: https://remote.co/job/full-stack-developer-47/

Full Stack Engineer
Company: Interview Schedule
Posted: 3 days ago
Link: https://remote.co/job/full-stack-engineer-31/

Java Language Specialist - Build Engineer
Company: ActiveState Software 
                                                   
                                                                
                                                                                   | International
Posted: 3 days ago
Link: https://remote.co/job/java-language-specialist-build-engineer/

Senior UI Engineer
Company: Urban Outfitters Inc.
Posted: 4 days ago
Link: https://remote.co/job/senior-ui-engineer-4/

Senior Web Developer
Company: Lightbend 
                                                   
                              

## 2.5 Find specific entry
Let's try to find "Senior Solutions Engineer".

In [29]:
senior_jobs = results.find_all('span', string='Engineer')

In [30]:
print(senior_jobs)

[]


This is because of differences in whitespace / capitalisations, since ```string=``` searches for exact matches.

In [33]:
engineer_jobs = results.find_all('span', string=lambda text: 'engineer' in text.lower())

In [34]:
print(len(engineer_jobs))

35


## 2.1 Requests
Use the requests class to perform a HTTP GET request for extracting information.

In [2]:
import requests

URL = 'https://hk.indeed.com/jobs?q=software+developer&l=hong+kong'
page = requests.get(URL)

In [37]:
import pprint
pp = pprint.PrettyPrinter()

**Static websites - return HTML**

As printed below.

In [38]:
pp.pprint(page.content)

(b'<!DOCTYPE html>\n<html lang="en" dir="ltr">\n<head>\n<meta http-equiv="cont'
 b'ent-type" content="text/html;charset=UTF-8">\n<script type="text/javascri'
 b'pt" src="//d3fw5vlhllyvee.cloudfront.net/s/d309c77/en_HK.js"></script>\n<'
 b'link href="//d3fw5vlhllyvee.cloudfront.net/s/7e3bf04/jobsearch_all.css" rel='
 b'"stylesheet" type="text/css">\n<link rel="alternate" type="application/rs'
 b's+xml" title="Software Developer Jobs in Hong Kong" href="https://hk.indeed.'
 b'com/rss?q=software+developer&l=hong+kong">\n<link rel="alternate" media="'
 b'only screen and (max-width: 640px)" href="/m/jobs?q=software+developer&l=hon'
 b'g+kong">\n<link rel="alternate" media="handheld" href="/m/jobs?q=software'
 b'+developer&l=hong+kong">\n\n<script type="text/javascript">\n\nif (typeof wi'
 b"ndow['closureReadyCallbacks'] == 'undefined') {\nwindow['closureReadyCall"
 b"backs'] = [];\n}\n\nfunction call_when_jsall_loaded(cb) {\nif (window['closu"
 b"reReady']) {\ncb();\n} else {\nwindow['closu

 b't}html body.janus table #vjs-container .indeed-apply-button:hover:focus:acti'
 b've,html body.janus table #vjs-container .state-picker-button:hover:focus:act'
 b'ive,html body.janus table #vjs-container .view-apply-button:hover:focus:acti'
 b've{box-shadow:none !important;outline:0 !important}:root body.janus table #v'
 b'js-container button.state-picker-button:focus{box-shadow:inset 0 1px 0.25rem'
 b' rgba(0,0,0,0.1),0 0 0 2px #fff,0 0 0 3px #2557a7 !important}:root body.janu'
 b's table #vjs-container button.state-picker-button:active{box-shadow:inset 0 '
 b'1px 0.25rem rgba(0,0,0,0.1),0 0 0 2px #fff,0 0 0 3px #2557a7 !important}:roo'
 b't body.janus table #vjs-container button.state-picker-button:hover:focus{out'
 b'line:0 !important;box-shadow:inset 0 1px 0.25rem rgba(0,0,0,0.1),0 0 0 2px #'
 b'fff,0 0 0 3px #2557a7 !important}:root body.janus table #vjs-container butto'
 b'n.state-picker-button:hover:focus:active{box-shadow:none !important;outline:'
 b'0 !important}html body #i

 b"ip-rule='evenodd' d='M4 1.563C4 1.252 4.252 1 4.563 1h10.875c.31 0 .562.252."
 b'562.563V6.38a.563.563 0 01-.178.41L12.4 10l3.422 3.208a.563.563 0 01.178.41v'
 b'4.82c0 .31-.252.562-.563.562H4.563A.562.562 0 014 18.437V13.62c0-.156.064-.3'
 b'04.178-.41L7.6 10 4.178 6.792A.562.562 0 014 6.382v-4.82zM6.4 16.75v-2.443l3'
 b'.6-3.375 3.6 3.375v2.443H6.4z\' fill=\'%237461E7\'/%3E%3C/svg%3E") no-repea'
 b't !important}.jobCardShelfItem.earlyApply .jobCardShelfIcon svg{opacity:0}.s'
 b'erpvj-earlyApplyMessage-icon::before{background:url("data:image/svg+xml;char'
 b"set=utf8,%3Csvg width='20' height='20' fill='none' xmlns='http://www.w3.org/"
 b"2000/svg'%3E%3Cpath fill-rule='evenodd' clip-rule='evenodd' d='M4 1.563C4 1."
 b'252 4.252 1 4.563 1h10.875c.31 0 .562.252.562.563V6.38a.563.563 0 01-.178.41'
 b'L12.4 10l3.422 3.208a.563.563 0 01.178.41v4.82c0 .31-.252.562-.563.562H4.563'
 b'A.562.562 0 014 18.437V13.62c0-.156.064-.304.178-.41L7.6 10 4.178 6.792A.562'
 b".562 0 014 6.382v-4.82zM6.

 b'div class="result-link-bar-container">\n<div class="result-link-bar"><spa'
 b'n class="date ">8 days ago</span><span id="tt_set_6" class="tt_set"><div cla'
 b'ss="job-reaction"><button class="job-reaction-kebab" aria-haspopup="true" ar'
 b'ia-expanded="false" data-ol-has-click-handler tabindex="0" aria-label="save '
 b'or dislike" onclick="toggleKebabMenu(\'907f3eb21679af74\', false, event); '
 b'return false;"></button><span class="job-reaction-kebab-menu"><button class='
 b'"job-reaction-kebab-item job-reaction-save" onclick="changeJobState(\'907'
 b'f3eb21679af74\', \'save\', \'linkbar\', false, \'\');return false;" data-ol'
 b'-has-click-handler><svg focusable="false" viewBox="0 0 24 24" height="16" wi'
 b'dth="16"><g><path fill="#2d2d2d" d="M16.5,3A6,6,0,0,0,12,5.09,6,6,0,0,0,7.5,'
 b'3,5.45,5.45,0,0,0,2,8.5C2,12.28,5.4,15.36,10.55,20L12,21.35,13.45,20C18.6,15'
 b'.36,22,12.28,22,8.5A5.45,5.45,0,0,0,16.5,3ZM12.1,18.55l-0.1.1-0.1-.1C7.14,14'
 b'.24,4,11.39,4,8.5A3.42,3.42,0,0,1,

## 2.2 BeautifulSoup parse
Use BeautifulSoup to parse the page's content.

In [16]:
soup = BeautifulSoup(page.content, 'html.parser')

In [40]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [18]:
results = soup.find(id='resultsCol')

## 2.3 Prettify
Use ```prettify``` function to align the `<div>` tags in a better manner.

In [19]:
print(results.prettify())

<td id="resultsCol">
 <div id="resultsColTopSpace">
 </div>
 <div class="messageContainer">
  <script type="text/javascript">
   function setRefineByCookie(refineByTypes) {
        var expires = new Date();
        expires.setTime(expires.getTime() + (10 * 1000));
        for (var i = 0; i < refineByTypes.length; i++) {
          setCookie(refineByTypes[i], "1", expires);
        }
      }
  </script>
 </div>
 <style type="text/css">
  #increased_radius_result {
        font-size: 16px;
        font-style: italic;
    }
    #original_radius_result{
        font-size: 13px;
        font-style: italic;
        color: #666666;
    }
 </style>
 <div class="resultsTop">
  <div class="mosaic-zone" id="mosaic-zone-aboveJobCards">
   <div class="mosaic mosaic-provider-serpreportjob" id="mosaic-provider-serpreportjob">
    <span>
     <div class="mosaic-reportcontent-content">
     </div>
    </span>
   </div>
  </div>
  <script type="text/javascript">
   try {
                    window.mosaic

## 2.4 Separate entries
We find that the `<section>` tag separates each job entry, so let's separate by that using `find_all` function.

In [46]:
job_elems = results.find_all('div', class_='jobsearch-SerpJobCard')

In [47]:
print(type(job_elems))

<class 'bs4.element.ResultSet'>


In [48]:
for job_elem in job_elems:
    print(job_elem, end='\n'*2)

<div class="jobsearch-SerpJobCard unifiedRow row result" data-ci="344380629" data-empn="5012426528945402" data-jk="69d6f98f66964878" id="pj_69d6f98f66964878">
<style>
.jobcard_logo{margin:6px 0}.jobcard_logo img{width:auto;max-width:80px;max-height:30px}.jasxrefreshcombotst .jobcard_logo img{max-height:2rem;max-width:100%}
</style>
<div class="jobcard_logo">
<a href="/cmp/Protiviti" onmousedown="this.href = appendParamsOnce(this.href, 'tk=1en3pe6vn7gkq800&amp;campaignid=femp&amp;from=femp');" rel="noopener" target="_blank">
<img alt="Protiviti logo" src="https://d2q79iu7y748jz.cloudfront.net/s/_logo/8f528d72e74e9073b73f94fe27f67e44"/>
</a>
</div>
<h2 class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0CjL_p_YojAjV80YmrtNOEa-36diKkMq8UMd180quyDDP1a4cdJOl9fhlG4sGMQLw7MixlusmKCi9UobrbkWVGA50msSS2D2AbIyovLZcV0hcs0F80W1YgPsCPSMlzuWf6L9DrRfQNG3q5Z8-bWWxh3Hg-CSEWqSqit3rd32R1CenPjsPzGj6ZvSaWXWtF5gbOJT8gErVUu2L9Sblck7rIBnfO_ji8I6wR

Further separate by `<h2>`, `<div class="company">` and `<div class="location">` tags, which separates into job title, company name and location of job.

In [54]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('a', class_='jobtitle turnstileLink')
    company_elem = job_elem.find('span', class_='company')
    location_elem = job_elem.find('div', class_='location accessible-contrast-color-location')
    summary_elem = job_elem.find('div', class_='summary')
    print(title_elem)
    print(company_elem)
    print(location_elem)
    print(summary_elem)
    print()

<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0CjL_p_YojAjV80YmrtNOEa-36diKkMq8UMd180quyDDP1a4cdJOl9fhlG4sGMQLw7MixlusmKCi9UobrbkWVGA50msSS2D2AbIyovLZcV0hcs0F80W1YgPsCPSMlzuWf6L9DrRfQNG3q5Z8-bWWxh3Hg-CSEWqSqit3rd32R1CenPjsPzGj6ZvSaWXWtF5gbOJT8gErVUu2L9Sblck7rIBnfO_ji8I6wRJGAygWT7l_nr542EqfQDLrU_qrWMpEsPR6io_83-S46dwfIoK52uFXzWiTCG2InLvJZjdD_8ZK8M_WFSyilJrIHrboF0dY5AUbbDPypcjEqU7cr49mZEWptXIa6DC3UvAjxd9SUQv4CpNuaMftDP86GIYZgx4XbnIAoga9oczmoqFydmSn6gTpAXjjStIQFlTpv2h4E69aXca-YguBSv2nOD4tStO26wj6M0OTcS7eYrvgLtLISDPVy4WHd1tnNF_8I2rXYJX2qMoLg8Irp2xD-UwE5wOxlNbJ6gVTHX-kizrggydgxsMXSOdEXsQ5RbzOSdwyQ98szMR0715JscREdUZvbO4IuBAUUdkqIRuu88RR13JpDqk0YwfFHzOAlHcLCC-6eZBlsw6itz0DHBjB9TJsQO8f54PPJ4oLrjYNPUZYFC9cbU38ZIF6DG9bFQ7ErsoyXpc7wDTWRL5SXNRu-85m8BDsCfy9JVGYv0904WT8-LIQPXdR7ij1qHBpdiEudYDEYnqls2EcyEktQxsIjle2j101P2HpiQBa5mniyZOv6dxzvhL&amp;p=0&amp;fvj=0&amp;vjs=3" id="sja0" onclick="setRefineByCookie([]); sjoc('sja0', 0); convCtr('SJ'); rclk(

Use ```.text.strip()``` to strip all whitespace. We find that there is 1 job which doesn't fit into the structure.

In [60]:
url_prefix = 'https://hk.indeed.com'
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('a', class_='jobtitle turnstileLink')
    company_elem = job_elem.find('span', class_='company')
    location_elem = job_elem.find('div', class_='location accessible-contrast-color-location')
    summary_elem = job_elem.find('div', class_='summary')
    url = job_elem.find('a', class_='jobtitle turnstileLink')['href']
    if str.startswith(url, '/'):
        url = url_prefix + url
    print(title_elem.text.strip())
    print("Company: " + company_elem.text.strip())
    print("At: " + location_elem.text.strip())
    print("Summary: " + summary_elem.text.strip())
    print("Link: " + url)
    print()

Software Developer (.Net) - Insurance - 25-35k
Company: Protiviti
At: Central and Western District, Hong Kong Island
Summary: Solid experience in Software Development, financial background is highly preferred.
Work closely with teammates on system design and new software development,…
Link: https://hk.indeed.com/pagead/clk?mo=r&ad=-6NYlbfkN0CjL_p_YojAjV80YmrtNOEa-36diKkMq8UMd180quyDDP1a4cdJOl9fhlG4sGMQLw7MixlusmKCi9UobrbkWVGA50msSS2D2AbIyovLZcV0hcs0F80W1YgPsCPSMlzuWf6L9DrRfQNG3q5Z8-bWWxh3Hg-CSEWqSqit3rd32R1CenPjsPzGj6ZvSaWXWtF5gbOJT8gErVUu2L9Sblck7rIBnfO_ji8I6wRJGAygWT7l_nr542EqfQDLrU_qrWMpEsPR6io_83-S46dwfIoK52uFXzWiTCG2InLvJZjdD_8ZK8M_WFSyilJrIHrboF0dY5AUbbDPypcjEqU7cr49mZEWptXIa6DC3UvAjxd9SUQv4CpNuaMftDP86GIYZgx4XbnIAoga9oczmoqFydmSn6gTpAXjjStIQFlTpv2h4E69aXca-YguBSv2nOD4tStO26wj6M0OTcS7eYrvgLtLISDPVy4WHd1tnNF_8I2rXYJX2qMoLg8Irp2xD-UwE5wOxlNbJ6gVTHX-kizrggydgxsMXSOdEXsQ5RbzOSdwyQ98szMR0715JscREdUZvbO4IuBAUUdkqIRuu88RR13JpDqk0YwfFHzOAlHcLCC-6eZBlsw6itz0DHBjB9TJsQO8f54PPJ4oLrjYNPUZYFC

AttributeError: 'NoneType' object has no attribute 'text'

## 2.5 Find specific entry
Let's try to find "Senior Solutions Engineer".

In [66]:
grad_jobs = results.find_all('a', string='Grad')

In [67]:
print(grad_jobs)

[]


This is because of differences in whitespace / capitalisations, since ```string=``` searches for exact matches.

In [80]:
grad_jobs = results.find_all('a', class_='jobtitle turnstileLink') #, string=lambda text: 'grad' in text.lower())

In [81]:
print(grad_jobs)

[<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0CjL_p_YojAjV80YmrtNOEa-36diKkMq8UMd180quyDDP1a4cdJOl9fhlG4sGMQLw7MixlusmKCi9UobrbkWVGA50msSS2D2AbIyovLZcV0hcs0F80W1YgPsCPSMlzuWf6L9DrRfQNG3q5Z8-bWWxh3Hg-CSEWqSqit3rd32R1CenPjsPzGj6ZvSaWXWtF5gbOJT8gErVUu2L9Sblck7rIBnfO_ji8I6wRJGAygWT7l_nr542EqfQDLrU_qrWMpEsPR6io_83-S46dwfIoK52uFXzWiTCG2InLvJZjdD_8ZK8M_WFSyilJrIHrboF0dY5AUbbDPypcjEqU7cr49mZEWptXIa6DC3UvAjxd9SUQv4CpNuaMftDP86GIYZgx4XbnIAoga9oczmoqFydmSn6gTpAXjjStIQFlTpv2h4E69aXca-YguBSv2nOD4tStO26wj6M0OTcS7eYrvgLtLISDPVy4WHd1tnNF_8I2rXYJX2qMoLg8Irp2xD-UwE5wOxlNbJ6gVTHX-kizrggydgxsMXSOdEXsQ5RbzOSdwyQ98szMR0715JscREdUZvbO4IuBAUUdkqIRuu88RR13JpDqk0YwfFHzOAlHcLCC-6eZBlsw6itz0DHBjB9TJsQO8f54PPJ4oLrjYNPUZYFC9cbU38ZIF6DG9bFQ7ErsoyXpc7wDTWRL5SXNRu-85m8BDsCfy9JVGYv0904WT8-LIQPXdR7ij1qHBpdiEudYDEYnqls2EcyEktQxsIjle2j101P2HpiQBa5mniyZOv6dxzvhL&amp;p=0&amp;fvj=0&amp;vjs=3" id="sja0" onclick="setRefineByCookie([]); sjoc('sja0', 0); convCtr('SJ'); rclk

In [85]:
grad_job_list = []
for job in grad_jobs:
    if 'grad' in job['title'].lower():
        print(job['title'])
        grad_job_list.append(job)

Software Developer [Fresh Grad Welcome]
Graduate C++ Software Developer | 2021 Intake


In [87]:
print(len(grad_job_list))

2


In [3]:
import requests

URL = 'https://hk.jobsdb.com/hk/search-jobs/software-developer/1'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_='FYwKg _36UVG_6')
job_elems = results.find('div', class_='FYwKg').find_all('div', recursive=False)

In [16]:
print(results)

<div class="FYwKg _36UVG_6"><div class="FYwKg" data-automation="jobListing"><div data-search-sol-meta='{"searchRequestToken":"ca61f65b-3e83-40ce-81da-daf3ef69fc0b","token":"0~ca61f65b-3e83-40ce-81da-daf3ef69fc0b","jobId":"jobsdb-hk-job-100003008103020","section":"MAIN","sectionRank":1,"jobAdType":"ORGANIC","tags":{"mordor__flights":"mordor_45","jobsdb:userGroup":"CC","jobsdb:s_vi":""}}'><div class="FYwKg _17IyL_6 _2-ij9_6 _3Vcu7_6 MtsXR_6"><div class=""><article class="FYwKg _3j_fQ _2mOt7_6 _1A6vC_6 _29sNS _58veS_6" data-automation="job-card-0"><div class="FYwKg _1A6vC_6 _31UWZ _1gtjJ _2cWXo _1Swh0"><div class="FYwKg _31UWZ _27u74_6"><div class="FYwKg _31UWZ fB92N_6 _1pAdR_6 FLByR_6 _2QIfI_6 _2cWXo _1Swh0 HdpOi"><div class="FYwKg"><div class="FYwKg _36UVG_6" data-automation="job-card-logo"><img alt="The Hong Kong General Chamber of Commerce's logo" class="FYwKg _1qfRc_6" src="https://image-service-cdn.seek.com.au/1537dbb76528c523b90cd2084e6b3c1a93909301/ee4dce1061f3f616224767ad58cb2fc7

In [5]:
print(len(job_elems))

30


In [15]:
print(job_elems[1].prettify())

<div data-search-sol-meta='{"searchRequestToken":"f436f257-248c-4a2c-a7de-f4bd75ab7be8","jobId":"jobsdb-hk-job-100003008110786","section":"MAIN","sectionRank":2,"jobAdType":"ORGANIC","tags":{"jobsdb:s_vi":"","jobsdb:userGroup":"CC"}}'>
 <div class="FYwKg _17IyL_6 _2-ij9_6 _3Vcu7_6 MtsXR_6">
  <div class="">
   <article class="FYwKg _3j_fQ _2mOt7_6 _1A6vC_6 _29sNS _58veS_6" data-automation="job-card-1">
    <div class="FYwKg _1A6vC_6 _31UWZ _1gtjJ _2cWXo _1Swh0">
     <img alt="Derivco Hong Kong's banner" class="FYwKg _1GAuD _1quaz_6 _3CnZK_6" data-automation="job-card-banner" src="https://content.jobsdbcdn.com/Content/CmsContent/Logo/HK/JobsDBFiles/CompanyLogo/banner-m/52976m.png"/>
     <div class="FYwKg _31UWZ _27u74_6">
      <div class="FYwKg _31UWZ fB92N_6 _1pAdR_6 FLByR_6 _2QIfI_6 _2cWXo _1Swh0 HdpOi">
       <div class="FYwKg">
        <div class="FYwKg _36UVG_6" data-automation="job-card-logo">
         <img alt="Derivco Hong Kong's logo" class="FYwKg _1qfRc_6" src="https://con

In [26]:
for job_elem in job_elems:
    # Each job_elem is a new BeautifulSoup object.
    # You can use the same methods on it as you did before.
    title_elem = job_elem.find('div', class_='FYwKg _2j8fZ_6 sIMFL_6 _1JtWu_6')
    company_elem = job_elem.find('span', class_='FYwKg _1GAuD C6ZIU_6 _6ufcS_6 _27Shq_6 _29m7__6')
    location_elem = job_elem.find('span', class_='FYwKg _1gtjJ _1GAuD _29LNX')
    
    url = job_elem.find('a', class_='DvvsL_6 _1p9OP')['href']
    
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    if location_elem is not None:
        print(location_elem.text.strip())
    print(url)

Software Developer/ Analyst Programmer (JAVA)- 18-45k
Ambition
/hk/en/job/software-developer-analyst-programmer-100003008110836?sectionRank=1&jobId=100003008110836&token=0~f436f257-248c-4a2c-a7de-f4bd75ab7be8
Software Developer C#
Derivco Hong Kong
Tsim Sha Tsui
/hk/en/job/software-developer-csharp-100003008110786?sectionRank=2&jobId=100003008110786&token=0~f436f257-248c-4a2c-a7de-f4bd75ab7be8
Application Developer
Derivco Hong Kong
Tsim Sha Tsui
/hk/en/job/application-developer-100003008110783?sectionRank=3&jobId=100003008110783&token=0~f436f257-248c-4a2c-a7de-f4bd75ab7be8
Software Developer
MaCaPS International Limited
Kwun Tong
/hk/en/job/software-developer-100003008110725?sectionRank=4&jobId=100003008110725&token=0~f436f257-248c-4a2c-a7de-f4bd75ab7be8
Mobile Application Developer
PrimeCredit Ltd
Wan Chai
/hk/en/job/mobile-application-developer-100003008110567?sectionRank=5&jobId=100003008110567&token=0~f436f257-248c-4a2c-a7de-f4bd75ab7be8
Software Developer
Maxson Network Limited
K

In [18]:
import requests

URL = 'https://hk.jobsdb.com/hk/en/job/job-100003008110565?token=0~8b5650b2-187c-48da-a9b7-c534bfb762d5&sectionRank=2&jobId=jobsdb-hk-job-100003008110565'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_='FYwKg d7v3r _3BZ6E_6 _2FwxQ_6')

In [20]:
print(results.text)

Job Highlightsprogrammer, C#, C++, MS SQL5 day work, free shuttle, medical, kowloon bay mtrsystem design, development, implementationJob DescriptionJob Responsibilities

Assist software development team to develop computer software;
Prepare system documentation;
Support system implementation and training;
Provide technical support

 Qualification

Possess a Bachelor's degree or above in
Computer Science or Computer Engineering; Hands on experience in C# on MS Windows platform;
Knowledge of C++ and MS SQL development a plus;
Clear communication and documentation skills in both English and Chinese;
Ability to deliver quality work under tight schedules;
Must be analytical, self directed and work well in a team environment

Remarks

Candidates with 2+ years of relevant working experience will be considered as Advanced Software Developer.

Interested parties please submit detailed resume including present and expected salary to us.  All data collected will only be used
 for recruitment purp