# Read websites info really, really fast: web-scraping in Python. 

# Table of contents

1. [Terminology](#nt)

## Terminology
<a id="nt"> </a>

 A lot of technical jargon here. Here are definitions :
 
 * BeautifulSoup : the Python library that contains web-scraping functions.
 * Requests : the Python library that contains web-requests (https) functions.
 * Web page : 
 * Website : 
 * Web server : 
 * Web browser : 
 * HTML : standard markup language on the web.  
 * DOM : Document Object Model. A tool provided by a web-browser to visualize a structured HTML.
 * Static/Dynamic websites : in essence, a request to a static website returns a final content, an HTML ; in a dynamic website, it's a Javascript code that is ran locally by your web browser. More details below.
 

## Static website

There's a lot of info on the web. We would like to collect and parse all of that but our lifespan is short. Hence, we web-scrape. How ? When you read a web page, you read the output of a code. Written, for example, in `HTML` language. Web scraping is, writing a code that collects chosen parts of that webpage's code and so systematically. How hard that task is, depends on how complex that code is. Basically there are two cases, assume the code is `HTML` language: 
 1. the server of a _static_ website returns all the page's `HTML` code you'll get to see as a user. All the info is there at once. 
 2. in a _dynamic_ website, the final code is usually `Javascript`. For memory purposes, many (dynamic) websites use `HTML` templates that are filled in with varying datasets. The output is a `Javascript` code that is processed locally by the browser. 
 
Let's start with the first case. We use the example from https://realpython.com/beautiful-soup-web-scraper-python/ : the **Monster Job** site.

### Get the URL and the HTML code it leads to

The `requests` library in Python contains the tools to make `HTTP` requests through Python. I.e, whereas you would usually go to the URL section of your firefox or chrome window and type the URL of the website you want (actually, now you usually even just type a word because the search engine function is internalized...), `requests` does that for you in Python. Hence, it is the first step to web-scraping ! More in https://realpython.com/python-requests/

In [74]:
import requests #the library to make requests on the web (i.e, to look for things!)
from pprint import pprint  #to print (a bit more) prettily your request output.
from bs4 import BeautifulSoup #the library to process the output of requests. See below.

In [75]:
URL = 'https://www.monster.com/jobs/search/?q=Economist&where=USA'

In [76]:
page = requests.get(URL) #look for an URL. I.e, request it to the web server.

play around with the object `page` to see what's the point of `requests.get`. Look next : `status_code` returned a 200, which means your request was successful and the server responded with the data you were requesting. 

**Question** : **what does it mean if the output is 404? You've seen this a lot already...**

In [77]:
page.status_code #what is the status code of the HTTP request? 

200

Let's look at the HTML content of the page.

In [83]:
pprint(page.content) #what does the HTML look like ? 

(b'<!DOCTYPE html>\r\n<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="e'
 b'n" lang="en">\r\n<head>\r\n    \r\n            <link rel="preconnect" href'
 b'="https://coda.newjobs.com" />\r\n            <link rel="preconnect" href='
 b'"https://js-seeker.newjobs.com" />\r\n            <link rel="preconnect" h'
 b'ref="https://css-seeker.newjobs.com" />\r\n            <link rel="preconne'
 b'ct" href="https://securemedia.newjobs.com" />\r\n            <link rel="pr'
 b'econnect" href="https://logs2.jobs.com" />\r\n            <link rel="preco'
 b'nnect" href="https://job-openings.monster.com" />\r\n            <link rel'
 b'="preconnect" href="https://apis.google.com" />\r\n            <link rel="'
 b'preconnect" href="https://www.google.com" />\r\n            <link rel="pre'
 b'connect" href="https://accounts.google.com" />\r\n            <link rel="p'
 b'reconnect" href="https://content.googleapis.com" />\r\n            <link r'
 b'el="preconnect" href="https://ssl.gstatic.com" />

 b"his.addClass('hidden');\r\n                        var title = $this.find("
 b"'h3').html();\r\n                        var $childs = $this.find('ul li')"
 b";\r\n\r\n                        var $parentUl = $this.closest('ul');\r\n "
 b'                       $parentUl.append(\'<li class="cmsNavContainer drop'
 b'down-sub"></li>\');\r\n                        var $newCmsContainer = $pare'
 b"ntUl.find('li.cmsNavContainer.dropdown-sub');\r\n                        $"
 b'newCmsContainer.append(\'<a href="#" class="dropdown-toggle" data-toggle='
 b'"dropdown" role="button" aria-haspopup="true" aria-expanded="false">\' + '
 b'title + \' <span class="caret"></span></a>\');\r\n                        $n'
 b'ewCmsContainer.append(\'<ul class="dropdown-menu dropdown-menu-sub"></ul>'
 b"');\r\n                        $newCmsContainer.find('ul').append($childs)"
 b';\r\n\r\n                        $this.remove();\r\n                    }\r'
 b'\n                }\r\n\r\n                //CMS 

 b'ot;eVar25&quot;:&quot;Program Planning \\u0026 Control Analyst III&quot;,'
 b'&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSR2CW&quot;'
 b',&quot;eVar26&quot;:&quot;xsphhoneyx_Honeywell&quot;,&quot;eVar31&quot;:&quo'
 b't;Torrance_CA_&quot;,&quot;prop22&quot;:&quot;Employee&quot;,&quot;prop24&qu'
 b'ot;:&quot;2020-03-20T12:00&quot;,&quot;eVar53&quot;:&quot;1300093001001&quot'
 b';,&quot;eVar50&quot;:&quot;Duration&quot;,&quot;eVar74&quot;:&quot;regular&q'
 b'uot;}&#39;)">Program Planning &amp; Control Analyst III\r\n</a></h2>\r\n    '
 b'                            </header>\r\n                                <'
 b'div class="company">\r\n                                        <span clas'
 b's="name">Honeywell</span>\r\n\r\n                                    <ul cla'
 b'ss="list-inline">\r\n\r\n                                    </ul>\r\n    '
 b'                            </div>\r\n                                <div'
 b' class="location">\r\n              

We have our html content. It's messy : let's turn to the **cleaning** and **parsing** parts.

### Clean and analyze the HTML

The package `BeautifulSoup` in Python is exactly here for that. You give its functions the HTML content, it returns readable information. Let's create a `BeautifulSoup` object, indicating the function what language to parse in second argument.

In [84]:
soup = BeautifulSoup(page.content, 'html.parser') #parse the HTML with BeautifulSoup/

There are two ways to retrieve elements from that objects. In an `HTML`, each element has an `id`. You can hence pick an element by passing the id to the function that does that. Here for example, the element that contains all the listings is named `ResultsContainer`. 

In [85]:
results = soup.find(id='ResultsContainer') #extract one element

In [86]:
print(results.prettify())

<div class="mux-custom-scroll" data-extend="left" data-mux="customScroll" data-target="html" id="ResultsContainer">
 <div class="scrollable" id="ResultsScrollable">
  <script type="application/ld+json">
   {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=Economist&amp;where=USA"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":1,"url":"https://job-openings.monster.com/transportation-economist-mill-creek-building-odot-or-us-oregon-department-of-transportation/216739002"}
                    ,
                 {"@type":"ListItem","position":2,"url":""}
                    ,
                 {"@type":"ListItem","position":3,"url":"https://job-openings.monster.com/senior-economist-albany-ny-us-nys-office-of-the-state-comptroller/216493284"}
                    ,
                 {"@type":"ListItem","position":4,"url":"https://j

So that is a part of the HTML - all the results of the search in the job site. You can see that every job posting is wrapped in an element that is called "card-content". Within that you have subelements : title of the job, location, company. You are interested in these. You can extract them.

Two HTML objects are important here : `section`, `class` and `div`. Think of it like that : in the `section` which class (or name) is "card-header", you have the `div` which class (name) is "title", the `div` which class is "location".... Hence you infer that "card-content" identifies the card of a unique job ! 

Hence, you want to ask Python to "pick each section which class is "card-content"" and put those in a sort of list.

In [87]:
job_elems = results.find_all('section', class_='card-content')
print(type(job_elems))

<class 'bs4.element.ResultSet'>


You have this sort of list, which you can see is an object specific to the BeautifulSoup package.


In [88]:
job_elems[2]

<section class="card-content" data-jobid="216493284" data-postingid="9f860997-6841-4cc5-942d-7856190cc6cb" onclick="MKImpressionTrackingMouseDownHijack(this, event)">
<div class="flex-row">
<div class="mux-company-logo thumbnail"></div>
<div class="summary">
<header class="card-header">
<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="559" data-m_impr_j_coc="xtime_nyoscx" data-m_impr_j_jawsid="425967836" data-m_impr_j_jobid="216493284" data-m_impr_j_jpm="1" data-m_impr_j_jpt="1" data-m_impr_j_lat="42.6525" data-m_impr_j_lid="544" data-m_impr_j_long="-73.7563" data-m_impr_j_occid="11866" data-m_impr_j_p="2" data-m_impr_j_postingid="9f860997-6841-4cc5-942d-7856190cc6cb" data-m_impr_j_pvc="monster" data-m_impr_s_t="t" data-m_impr_uuid="75217ea2-2cef-4480-8abe-de16764851b7" href="https://job-openings.monster.com/senior-economist-albany-ny-us-nys-office-of-the-state-comptroller/216493284" onclick="clickJobTitle('plid=544&amp;pcid=559&amp;poccid=

You picked the second job element. You're heading to something clean. But you want to get rid of all the HTML you don't care about. You essentially want the job title, the location and the company. You're gonna do the exact same as before but narrowing the window. You want, in each job, to get only the lines "title", "location", "company".

In [89]:
for job_elem in job_elems:
    title=job_elem.find('h2', class_='title')
    location=job_elem.find('div', class_='location')
    company=job_elem.find('div', class_='company')
    print(title)
    print(location)
    print(company)
    print()

<h2 class="title"><a data-bypass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="11" data-m_impr_j_coc="xw301558987wx" data-m_impr_j_jawsid="427022700" data-m_impr_j_jobid="216739002" data-m_impr_j_jpm="1" data-m_impr_j_jpt="1" data-m_impr_j_lat="0" data-m_impr_j_lid="573" data-m_impr_j_long="0" data-m_impr_j_occid="11892" data-m_impr_j_p="1" data-m_impr_j_postingid="f3991871-0cf7-4a34-bffc-9674024247db" data-m_impr_j_pvc="monster" data-m_impr_s_t="t" data-m_impr_uuid="f97f7172-f9be-4241-820d-3ce00786b919" href="https://job-openings.monster.com/transportation-economist-mill-creek-building-odot-or-us-oregon-department-of-transportation/216739002" onclick="clickJobTitle('plid=573&amp;pcid=11&amp;poccid=11892','Economist',''); clickJobTitleSiteCat('{&quot;events.event48&quot;:&quot;true&quot;,&quot;eVar25&quot;:&quot;Transportation Economist&quot;,&quot;eVar66&quot;:&quot;Monster&quot;,&quot;eVar67&quot;:&quot;JSR2CW&quot;,&quot;eVar26&quot;:&quot;xw301558987wx_Oregon Depart

Looks more trimmed. You still want to clean that up. **How do you extract text from the HTML element?** 

### Extract text from the parsed HTML

In [90]:
for job_elem in job_elems:
    title_elem = job_elem.find('h2', class_='title')
    company_elem = job_elem.find('div', class_='company')
    location_elem = job_elem.find('div', class_='location')
    if None in (title_elem, company_elem, location_elem): #tell Python to skip that iteration if the job listing does not contain the information you want
        continue
    print(title_elem.text)
    print(company_elem.text)
    print(location_elem.text)
    print()


Transportation Economist


Oregon Department of Transportation





Mill Creek Building - ODOT, OR



Senior Economist


NYS Office of the State Comptroller





Albany, NY



Sr. Transportation Economist (Sr. Business Intelligence Planner)


Washington Metropolitan Area Transit Authority





Washington, DC



INTERNATIONAL ECONOMIST


UNIVERSITY OF PENNSYLVANIA





Philadelphia, PA



Economist


NYISO





Rensselaer, NY



Senior Equity Research Associate - Finance, Economics


CyberCoders





Cleveland, OH



ECONOMICS EDUCATION SME


GEX, Inc.





ATKINSON, NH



ECONOMIC DEVELOPMENT DIRECTOR


TOWN OF DOVER





Mount Snow, VT



Project Manager, Office of Economic Development


University of Nevada, Las Vegas





Las Vegas, NV



VP Real Estate Technology


PenFed Credit Union





San Antonio, TX



Economist


Amazon Corporate LLC





Seattle, WA



Director Product Marketing


New York Power Authority





White Plains, NY



Sales Pl

Now, it looks **absolutely beautiful** ! You can do more with that. Assume you are not interested in anything else than Python developer jobs. You can just tell BeautifulSoup to filter elements that contain this word in the job title part. 

In [91]:
econ_jobs = results.find_all('h2',
                               string=lambda text: 'economist' in text.lower())

In [93]:
print(len(econ_jobs))

6


### Extract attributes from the parsed HTML

Now, you want also the link to that job (your goal is to apply !). The link is in the element with the "title" class. You can see that it's in the `<a>` tag. The `<a>` tag has some attributes. Among those, you have one called `href` : this is the link. Fetch it. 

In [95]:
for e_job in econ_jobs:
    link = e_job.find('a')['href']
    print(f"Apply here: {link}\n")

Apply here: https://job-openings.monster.com/transportation-economist-mill-creek-building-odot-or-us-oregon-department-of-transportation/216739002

Apply here: https://job-openings.monster.com/senior-economist-albany-ny-us-nys-office-of-the-state-comptroller/216493284

Apply here: https://job-openings.monster.com/sr-transportation-economist-sr-business-intelligence-planner-washington-dc-us-washington-metropolitan-area-transit-authority/216685632

Apply here: https://job-openings.monster.com/international-economist-philadelphia-pa-us-university-of-pennsylvania/216313117

Apply here: https://job-openings.monster.com/economist-rensselaer-ny-us-nyiso/216090939

Apply here: https://job-openings.monster.com/economist-seattle-wa-us-amazon-corporate-llc/216838335



Beautiful, right?  😃