# Read websites info really, really fast: web-scraping in Python. 

## Newbies terminology

 A lot of technical jargon here. Here are definitions :
 
 * BeautifulSoup
 * Requests
 * Web page
 * Website
 * Web server
 * Web browser
 * HTML
 * DMO
 * Static/Dynamic websites
 

## II. Static website : Monster Jobs

There's a lot of info on the web. We would like to collect and parse all of that but our lifespan is short. Hence, we web-scrape. How ? When you read a web page, you read the output of a code. Written, for example, in `HTML` language. Web scraping is, writing a code that collects chosen parts of that webpage's code and so systematically. How hard that task is, depends on how complex that code is. Basically there are two cases, assume the code is `HTML` language: 
 1. the server of a _static_ website returns all the page's `HTML` code you'll get to see as a user. All the info is there at once. 
 2. in a _dynamic_ website, the final code is usually `Javascript`. For memory purposes, many (dynamic) websites use `HTML` templates that are filled in with varying datasets. The output is a `Javascript` code that is processed locally by the browser. 
 
Let's start with the first case. We use the example from https://realpython.com/beautiful-soup-web-scraper-python/ : the Monster Job site.

### Get the URL and the HTML code it leads to

The `requests` library in Python contains the tools to make `HTTP` requests through Python. Hence, it is the first step to web-scraping ! More in https://realpython.com/python-requests/

In [57]:
import requests
import pprint
from bs4 import BeautifulSoup

In [11]:
URL = 'https://www.monster.com/jobs/search/?q=Economist&where=NYC'

In [12]:
page = requests.get(URL)

play around with the object `page` to see what's the point of `requests.get`. Look next : `status_code` returned a 200, which means your request was successful and the server responded with the data you were requesting

In [17]:
page.status_code

200

Let's look at the HTML content of the page.

In [53]:
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(page.content)

(b'<!DOCTYPE html>\r\n<html xmlns="https://www.w3.org/1999/xhtml" xml:lang="e'
 b'n" lang="en">\r\n<head>\r\n    \r\n            <link rel="preconnect" href'
 b'="https://coda.newjobs.com" />\r\n            <link rel="preconnect" href='
 b'"https://js-seeker.newjobs.com" />\r\n            <link rel="preconnect" h'
 b'ref="https://css-seeker.newjobs.com" />\r\n            <link rel="preconne'
 b'ct" href="https://securemedia.newjobs.com" />\r\n            <link rel="pr'
 b'econnect" href="https://logs2.jobs.com" />\r\n            <link rel="preco'
 b'nnect" href="https://job-openings.monster.com" />\r\n            <link rel'
 b'="preconnect" href="https://apis.google.com" />\r\n            <link rel="'
 b'preconnect" href="https://www.google.com" />\r\n            <link rel="pre'
 b'connect" href="https://accounts.google.com" />\r\n            <link rel="p'
 b'reconnect" href="https://content.googleapis.com" />\r\n            <link r'
 b'el="preconnect" href="https://ssl.gstatic.com" />

 b'                <li role="separator" class="divider"></li>\r\n\r\n          '
 b'                          <li id="mobile-nav-0" class="menu__item">\r\n   '
 b'                                     <a class="menu__link" role="menuitem" d'
 b'ata-submenu="submenu-0" href="#" aria-haspopup="true" aria-expanded="false">'
 b'Find Jobs</a>\r\n                                        <ul data-menu="su'
 b'bmenu-0" class="menu__level">\r\n                                         '
 b'   \r\n    <li id="mobile-subnav-2" class="menu__item">\r\n        <a class='
 b'"menu__link bold-submenu" href="#" role="menuitem">Find Jobs</a>\r\n    </'
 b'li>\r\n    <li id="mobile-subnav-3" class="menu__item">\r\n        <a class='
 b'"menu__link " href="https://www.monster.com/jobs/" role="menuitem">Browse Jo'
 b'bs</a>\r\n    </li>\r\n    <li id="mobile-subnav-4" class="menu__item">\r'
 b'\n        <a class="menu__link " href="https://www.monster.com/jobs/advan'
 b'ced-search/" role="menuitem">Advanced S

 b'                              </div>\r\n                            <div c'
 b'lass="summary">\r\n                                <header class="card-hea'
 b'der">\r\n                                    <h2 class="title"><a data-byp'
 b'ass="true" data-m_impr_a_placement_id="JSR2CW" data-m_impr_j_cid="660" data-'
 b'm_impr_j_coc="xinfotechnolx" data-m_impr_j_jawsid="423756447" data-m_impr_j_'
 b'jobid="216200095" data-m_impr_j_jpm="1" data-m_impr_j_jpt="1" data-m_impr_j_'
 b'lat="40.727" data-m_impr_j_lid="549" data-m_impr_j_long="-73.6328" data-m_im'
 b'pr_j_occid="11979" data-m_impr_j_p="23" data-m_impr_j_postingid="9c195e8d-0c'
 b'b4-40d3-a36e-9e6cad92a29d" data-m_impr_j_pvc="monster" data-m_impr_s_t="t" d'
 b'ata-m_impr_uuid="f166c902-9483-43d8-b646-d9716f43c963" href="https://job-ope'
 b'nings.monster.com/data-analyst-garden-city-ny-us-diversant-llc/216200095" on'
 b'Click="clickJobTitle(&#39;plid=549&amp;pcid=660&amp;poccid=11979&#39;,&#39;E'
 b'conomist&#39;,&#39;&#39;); clic

We have our html content. It's messy : let's turn to the cleaning and parsing parts.

### Clean and analyze the HTML

The package `BeautifulSoup` in Python is exactly here for that. You give its functions the HTML content, it returns readable information. Let's create a `BeautifulSoup` object, indicating the function what language to parse in second argument.

In [58]:
soup = BeautifulSoup(page.content, 'html.parser')

There are two ways to retrieve elements from that objects. In an `HTML`, each element has an `id`. You can hence pick an element by passing the id to the function that does that. Here for example, the element that contains all the listings is named `ResultsContainer`. 

In [60]:
results = soup.find(id='ResultsContainer')

In [62]:
results.prettify()

'<div class="mux-custom-scroll" data-extend="left" data-mux="customScroll" data-target="html" id="ResultsContainer">\n <div class="scrollable" id="ResultsScrollable">\n  <script type="application/ld+json">\n   {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{\r\n            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=Economist&amp;where=NYC"\r\n            }\r\n            ,"itemListElement":[\r\n\r\n                 {"@type":"ListItem","position":1,"url":"https://job-openings.monster.com/lead-economist-thematic-research-econometrics-new-york-ny-us-mastercard/3d012c6c-7a55-4e86-80d9-ce72d9f48644"}\r\n                    ,\r\n                 {"@type":"ListItem","position":2,"url":""}\r\n                    ,\r\n                 {"@type":"ListItem","position":3,"url":"https://job-openings.monster.com/private-wealth-management-administrative-assistant-program-new-york-ny-us-advantage-xpo/216773985"}\r\n                    ,\r\n           