## Overview

Web scraping is the term for using a program to download and process content from the web.
In this chapter, you will learn about several modules that make it easy to scrape web pages in Python.
* `webbrowser` Comes with Python and opens a browser to a specific page.
* `requests` Downloads files and web pages from the internet.
* `bs4` Parses HTML, the format that web pages are written in.
* `selenium` Launches and controls a web browser. The `selenium` module is able to fill in forms and simulate mouse clicks in this browser.

## Project: mapIt.py with the webbrowser Module

In [1]:
import webbrowser
webbrowser.open('https://inventwithpython.com/')

True

## Downloading Files from the Web with the requests Module

The `requests` module lets you easily download files from the web without having to worry about complicated issues such as network errors, connection problems, and data compression.

## Downloading a Web Page with the requests.get() Function

In [4]:
import requests

In [25]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')

In [6]:
type(res)

requests.models.Response

In [7]:
requests.codes.ok

200

In [9]:
res.status_code

200

In [10]:
print(res.text[:250])

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


In [29]:
len(res.text)

178978

## Checking for Errors

As you’ve seen, the `Response` object has a `status_code` attribute that can be checked against `requests.codes.ok` (a variable that has the integer value `200`) to see whether the download succeeded. A simpler way to check for success is to call the `raise_for_status()` method on the `Response` object. This will raise an exception if there was an error downloading the file and will do nothing if the download succeeded. 

In [12]:
res = requests.get('https://inventwithpython.com/page_that_does_not_exist')

In [13]:
res.raise_for_status()

HTTPError: 404 Client Error: Not Found for url: https://inventwithpython.com/page_that_does_not_exist

Always call `raise_for_status()` after calling `requests.get()`. You want to be sure that the download has actually worked before your program continues.

## Saving Downloaded Files to the Hard Drive

You can save the web page to a file on your hard drive with the standard `open()` function and `write()` method. 
First, you must open the file in write binary mode by passing the string `'wb'` as the second argument to `open()` to maintain the Unicode encoding of the text.


In [26]:
playFile = open('RomeoAndJuliet.txt', 'wb')

In [27]:
for chunk in res.iter_content(100000):
        playFile.write(chunk)

In [28]:
playFile.close()

To review, here’s the complete process for downloading and saving a file:
* Call `requests.get()` to download the file.
* Call `open()` with `'wb'` to create a new file in write binary mode.
* Loop over the Response object’s `iter_content()` method.
* Call `write()` on each iteration to write the content to the file.
* Call `close()` to close the file.

That’s all there is to the requests module! The `for` loop and `iter_content()` stuff may seem complicated compared to the `open()/write()/close()` workflow you’ve been using to write text files, but it’s to ensure that the `requests` module doesn’t eat up too much memory even if you download massive files. 

## HTML

## Parsing HTML with the bs4 Module