Data Science Fundamentals: Python |
[Table of Contents](../../index.ipynb)
- - - 
<!--NAVIGATION-->
Real World Examples: **[Web Scraping](./01_rw_web_scraping.ipynb)** | [Automation](../automation/02_rw_automation.ipynb) | [Messaging](../messaging/03_rw_messaging.ipynb) | [CSV](../csv/04_rw_csv.ipynb) | [Games](../games/05_games.ipynb) | [Mobile](../mobile/06_mobile.ipynb) | [Computer Vision](../computer_vision/08_computer_vision.ipynb) | [Chatbot](../chatbot/10_chatbot.ipynb) | [Built-In Database](../database/11_database.ipynb)
- - -
Life Examples: [COVID-19](../COVID-19/COVID-19_visualizations-plotly.ipynb) | [Police Brutality](https://maminian.github.io/brutality-map/) | [Spanish Flu](../spanishflu/index.ipynb)

## Real World: Scrape Data from nearly Any Website

Two top options for web scraping in Python is either [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/) or [Scrapy](https://docs.scrapy.org/en/latest/).  For these examples, we will be using Beautifulsoup.  

Beautiful Soup has been used in hundreds of different projects. There's no way I can list them all, but I want to highlight a few high-profile projects. Beautiful Soup isn't what makes these projects interesting, but it did make their completion easier:

- ["Movable Type"](https://www.nytimes.com/2007/10/25/arts/design/25vide.html), a work of digital art on display in the lobby of the New York Times building, uses Beautiful Soup to scrape news feeds.
- [Jiabao Lin's DXY-COVID-19-Crawler](https://github.com/BlankerL/DXY-COVID-19-Crawler) uses Beautiful Soup to scrape a Chinese medical site for information about COVID-19, making it easier for researchers to track the spread of the virus. (Source: "How open source software is fighting COVID-19")
- Reddit uses Beautiful Soup to [parse a page that's been linked to and find a representative image](https://github.com/reddit-archive/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py).
- Alexander Harrowell uses Beautiful Soup to track the [business activities of an arms merchant](http://www.harrowell.org.uk/viktormap.html).
- The Lawrence Journal-World uses Beautiful Soup to [gather statewide election results](https://www.b-list.org/weblog/2010/nov/02/news-done-broke/).

### Install BeautifulSoup, LXML and Future

In [None]:
Python 3.7.7 (default, Mar 10 2020, 15:43:33) 
[Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> help("modules")

Search for **beautifulsoup, lxml, future**

If you get an error from Command Line you might want to submit this line 

If you get an error from Command Line you might want to submit this line

So that this code works with either Python2 or Python3, you will need one helper library. Run in the terminal:

In [None]:
pip install future

### Extracting Names and URLs from an HTML page

We will be attempting to go from a search results page where the html page looks like this:

To a CSV file that looks like this -

To begin, import the Beautiful Soup library, open the HTML file and pass it to Beautiful Soup, and then print the “pretty” version in the terminal.

In [3]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("files/43rd-congress.html"), features="lxml")

print(soup.prettify())

<!DOCTYPE html>
<!-- saved from url=(0053)https://bioguideretro.congress.gov/Home/SearchResults -->
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <script type="text/javascript">
   window.NREUM||(NREUM={}),__nr_require=function(t,e,n){function r(n){if(!e[n]){var o=e[n]={exports:{}};t[n][0].call(o.exports,function(e){var o=t[n][1][e];return r(o||e)},o,o.exports)}return e[n].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o<n.length;o++)r(n[o]);return r}({1:[function(t,e,n){function r(t){try{s.console&&console.log(t)}catch(e){}}var o,i=t("ee"),a=t(24),s={};try{o=localStorage.getItem("__nr_flags").split(","),console&&"function"==typeof console.log&&(s.console=!0,o.indexOf("dev")!==-1&&(s.dev=!0),o.indexOf("nr_dev")!==-1&&(s.nrDev=!0))}catch(c){}s.nrDev&&i.on("internal-error",function(t){r(t.stack)}),s.dev&&i.on("fn-err",function(t,e,n){r(n.stack)})

Both the names and the URLs are, most fortunately, embedded in the tags. So, we need to isolate out all of the tags. We can do this by updating the code -

In [4]:
from bs4 import BeautifulSoup

soup = BeautifulSoup (open("files/43rd-congress.html"), features="lxml")

# print(soup.prettify())

links = soup.find_all('a')

for link in links:
    print(link)

<a href="https://bioguideretro.congress.gov/"><img src="./43rd-congress_files/index.jpg"/></a>
<a class="headerLinkWhiteNone" href="https://bioguideretro.congress.gov/">New Search</a>
<a class="headerLinkWhite" href="https://history.house.gov/">House History</a>
<a class="headerLinkWhite" href="https://www.senate.gov/art/art_hist_home.htm">Senate History</a>
<a class="headerLinkWhite" href="https://bioguideretro.congress.gov/Home/Copyright">Copyright Information</a>
<a class="headerLinkWhite" href="https://bioguideretro.congress.gov/Home/Privacy">Privacy</a>
<a class="page-link" href="https://bioguideretro.congress.gov/Home/SearchResults?page=2">2</a>
<a class="page-link" href="https://bioguideretro.congress.gov/Home/SearchResults?page=3">3</a>
<a class="page-link" href="https://bioguideretro.congress.gov/Home/SearchResults?page=4">4</a>
<a class="page-link" href="https://bioguideretro.congress.gov/Home/SearchResults?page=5">5</a>
<a class="page-link" href="https://bioguideretro.congre

- - - 

## Using Python to Find Python Jobs

As opposed to parsing a file that has been saved, you can also scrape files that exist on the web. 

In [34]:
import requests

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Charlotte'
page = requests.get(URL)
print(page)

<Response [200]>


### Using Python to Scrape Job Details

In [40]:
import requests
from bs4 import BeautifulSoup

URL = 'https://www.monster.com/jobs/search/?q=Software-Developer&where=Seattle'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(id='ResultsContainer')

print(results.prettify())

<div class="mux-custom-scroll" data-extend="left" data-mux="customScroll" data-target="html" id="ResultsContainer">
 <div class="scrollable" id="ResultsScrollable">
  <script type="application/ld+json">
   {"@context":"https://schema.org","@type":"ItemList","mainEntityOfPage":{
            "@type":"CollectionPage","@id":"https://www.monster.com/jobs/search/?q=Software-Developer&amp;where=Seattle"
            }
            ,"itemListElement":[

                 {"@type":"ListItem","position":1,"url":"https://job-openings.monster.com/software-developer-automotive-cloud-redmond-wa-us-vw-automotive-cloud/217699144"}
                    ,
                 {"@type":"ListItem","position":2,"url":""}
                    ,
                 {"@type":"ListItem","position":3,"url":"https://job-openings.monster.com/software-developer-full-stack-seattle-wa-us-cybercoders/217611623"}
                    ,
                 {"@type":"ListItem","position":4,"url":"https://job-openings.monster

### Use Python to Search and Apply for Python Jobs

In [46]:
import requests
from bs4 import BeautifulSoup

URL = "https://www.monster.com/jobs/search/?q=Software-Developer\
        &where=Australia"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="ResultsContainer")

# Look for Python jobs
python_jobs = results.find_all("h2", string=lambda t: "python" in t.lower())
for p_job in python_jobs:
    link = p_job.find("a")["href"]
    print(p_job.text.strip())
    print(f"Apply here: {link}\n")

# Print out all available jobs from the scraped webpage
job_elems = results.find_all("section", class_="card-content")
for job_elem in job_elems:
    title_elem = job_elem.find("h2", class_="title")
    company_elem = job_elem.find("div", class_="company")
    location_elem = job_elem.find("div", class_="location")
    if None in (title_elem, company_elem, location_elem):
        continue
    print(title_elem.text.strip())
    print(company_elem.text.strip())
    print(location_elem.text.strip())
    print()

Python Developer
Apply here: https://job-openings.monster.com/python-developer-woodlands-wa-us-lancesoft-inc/4755ec59-d0db-4ce9-8385-b4df7c1e9f7c

SQL BI (SSRS, SSIS) developer for Blackboard - NYC
LanceSoft Inc
New york, WA

Python Developer
LanceSoft Inc
Woodlands, WA

Junior QA Analyst - Melbourne, Victoria
Mediaocean
Melbourne, VIC

Test Analyst
Dialog Group
Canberra, ACT

Senior Sales Engineer
Zuora
Melbourne, VIC

Data Warehouse Tester
Dialog Group
Brisbane, QLD

Customer Experience Technical Analyst - Sydney, New South Wales
Mediaocean
Sydney, NSW

Test Analyst / Senior Test Analyst
Dialog Group
Melbourne, VIC

Senior Practice Manager - IES (WA)
Blue Ocean Ventures
New York, WA

Enterprise Account Executive
Zuora
Melbourne, VIC

Software Development Engineer/Software Developer - New initiative project from t
Amazon Corporate LLC
Seattle, WA



- - - 
<!--NAVIGATION-->
Real World Examples: **[Web Scraping](./01_rw_web_scraping.ipynb)** | [Automation](../automation/02_rw_automation.ipynb) | [Messaging](../messaging/03_rw_messaging.ipynb) | [CSV](../csv/04_rw_csv.ipynb) | [Games](../games/05_games.ipynb) | [Mobile](../mobile/06_mobile.ipynb) | [Feature Engineering](../feature_engineering/07_feature-engineering.ipynb) | [Computer Vision](../computer_vision/08_computer_vision.ipynb) | [Chatbot](../chatbot/10_chatbot.ipynb)
<br>
[Top](#)

- - -

Copyright © 2020 Qualex Consulting Services Incorporated.