<a href="https://colab.research.google.com/github/carloslme/automating-boring-stuff/blob/main/Chapter_11_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Web scraping is the term for using a program to download and process content from the Web.

In this chapter, you will learn about several modules that make it easy to scrape web pages in Python. 
* webbrowser . Comes with Python and opens a browser to a specific page. 
* Requests . Downloads files and web pages from the Internet. 
* Beautiful Soup . Parses HTML, the format that web pages are written in. 
* Selenium . Launches and controls a web browser. Selenium is able to fill in forms and simulate mouse clicks in this browser.

In [None]:
!pip install requests



In [None]:
import requests

In [None]:
res = requests.get('https://automatetheboringstuff.com/files/rj.txt')

In [None]:
type(res)

requests.models.Response

In [None]:
res.status_code == requests.codes.ok

True

In [None]:
len(res.text)

178978

In [None]:
print(res.text[:250])

The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Projec


# Checking for Errors
the Response object has a status_code attribute that can be checked against requests.codes.ok to see whether the download succeeded. A simpler way to check for success is to call the raise_for_status() method on the Response object. This will raise an exception if there was an error downloading the file and will do nothing if the download succeeded.

In [None]:
res = requests.get('http://inventwithpython.com/page_that_does_not_exist') 
res.raise_for_status()

HTTPError: ignored

The raise_for_status() method is a good way to ensure that a program halts if a bad download occurs. This is a good thing: You want your program to stop as soon as some unexpected error happens. If a failed download isn’t a deal breaker for your program, you can wrap the raise_for_status() line with try and except statements to handle this error case without crashing.

In [None]:
import requests
res = requests.get('http://inventwithpython.com/page_that_does_not_exist') 
try: 
  res.raise_for_status() 
except Exception as exc: 
  print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


# Saving Downloaded Files to the Hard Drive
You can save the web page to a file on your hard drive with the standard `open()` function and `write()` method. There are some slight differences, though. First, you must open the file in write binary mode by passing the string `'wb'` as the second argument to `open()`. Even if the page is in plaintext (such as the Romeo and Juliet text you downloaded earlier), you need to write binary data instead of text data in order to maintain the Unicode encoding of the text.

To write the web page to a file, you can use a for loop with the Response object’s `iter_content()` method.

In [None]:
import requests 

res = requests.get('https://automatetheboringstuff.com/files/rj.txt') # get to download the file
res.raise_for_status() # ensure that the program stops if a bad download occurs
playFile = open('RomeoAndJuliet.txt', 'wb') # 'wb' to create a file in write binary mode

for chunk in res.iter_content(100000): # chunks of 100Kb to be written
  playFile.write(chunk) 

playFile.close()

In [None]:
text = open('RomeoAndJuliet.txt', 'r')
print(text.read())

# HTML
Hypertext Markup Language (HTML) is the format that web pages are written in.

An HTML file is a plaintext file with the .html file extension. The text in these files is surrounded by tags , which are words enclosed in angle brackets. The tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element . The text (or inner HTML ) is the content between the starting and closing tags.

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets.

Some elements have an id attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

# Parsing HTML with the BeautifulSoup Module
Beautiful Soup is a module for extracting information from an HTML page (and is much better for this purpose than regular expressions). The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version 4).

In [1]:
!pip install beautifulsoup4



In [3]:
import bs4
import requests

In [4]:
res = requests.get('http://nostarch.com/automatestuff/')
res.raise_for_status()

In [9]:
htmlFile = open('htmlFile.html', 'wb')

for chunk in res.iter_content(100000):
  htmlFile.write(chunk)

htmlFile.close()


In [11]:
open('htmlFile.html','r').read()

'<!DOCTYPE html>\n<html lang="en" dir="ltr" xmlns:og="http://ogp.me/ns#">\n<head>\n<script src="/cdn-cgi/apps/head/j5v88GAcO1Pymf91CQYvgLZqNao.js"></script><link rel="profile" href="https://www.w3.org/1999/xhtml/vocab" />\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<link rel="shortcut icon" href="https://nostarch.com/sites/default/files/favicon.ico" type="image/vnd.microsoft.icon" />\n<meta name="description" content="The second edition of the best-selling Python book, Automate the Boring Stuff with Python, 2nd Edition (100,000+ copies sold in print alone) uses Python 3 to teach even the technically uninclined how to write programs that do in minutes what would take hours to do by hand." />\n<meta name="generator" content="Drupal 7 (http://drupal.org)" />\n<link rel="image_src" href="https://nostarch.com/sites/default/files/Automate_coversmall_0.png" />\n<link rel="canonical" href="https

## Creating a BeautifulSoup Object HTML
The `bs4.BeautifulSoup()` function needs to be called with a string containing the HTML it will parse. The `bs4.BeautifulSoup()` function returns is a BeautifulSoup object.

In [1]:
import requests, bs4

res = requests.get('http://nostarch.com')
res.raise_for_status()
noStarchSoup = bs4.BeautifulSoup(res.text)
type(noStarchSoup)

bs4.BeautifulSoup

You can also load an HTML file from your hard drive by passing a File object to `bs4.BeautifulSoup()`

In [3]:
exampleFile = open('htmlFile.html')
exampleSoup = bs4.BeautifulSoup(exampleFile)
print(type(exampleSoup))

<class 'bs4.BeautifulSoup'>


## Finding an Element with the `select()` Method
You can retrieve a web page element from a BeautifulSoup object by calling the `select() `method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

Here's a short introduction to selectors.

| Selector passed to the `select()` method        |  Will match...                                                |
|-----------------|-------------------------------------------------------------|
| soup.select('div')        | All elements named <div>               |
| soup.select('#author')    | The element with an id attribute of author |
| soup.select('.notice')    | All elements that use a CSS class attribute named notice |
| soup.select('div span') |   All elements named <span> that are within an element named <div>      |
| soup.select('div > span') | All elements named <span> that are directly within an element named <div> , with no other element in between                        |
| soup.select('input[name]') | All elements named \<input> that have a `name` attribute with any value                      |
| soup.select('input[type="button"]')| All elements named \<input> that have an attribute named `type` with value button |

The various selector patterns can be combined to make sophisticated matches. For example, `soup.select('p #author')` will match any element that has an id attribute of author , as long as it is also inside a \<p> element.

The `select()` method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the `str()` function to show the HTML tags they represent. Tag values also have an attrs attribute that shows all the HTML attributes of the tag as a dictionary.

In [28]:
# Create html manually
exampleFile = open('example.html','w')
exampleFile.write('<!-- This is the example.html example file. --> <html><head><title>The Website Title</title></head> <body> <p>Download my <strong>Python</strong> book from <a href="http:// inventwithpython.com">my website</a>. </p> <p class="slogan">Learn Python the easy way!</p> <p>By <span id="author">Al Sweigart</span></p> </body></html>')
exampleFile.close()

In [29]:
import bs4
exampleFile = open('example.html')
exampleFile.read()
print(type(exampleFile.read()))

<class 'str'>


In [31]:
exampleSoup = bs4.BeautifulSoup('<!-- This is the example.html example file. --> <html><head><title>The Website Title</title></head> <body> <p>Download my <strong>Python</strong> book from <a href="http:// inventwithpython.com">my website</a>. </p> <p class="slogan">Learn Python the easy way!</p> <p>By <span id="author">Al Sweigart</span></p> </body></html>')
elems = exampleSoup.select('#author')
print(type(elems))
print(len(elems))

<class 'list'>
1


In [32]:
type(elems[0])

bs4.element.Tag

In [34]:
elems[0].getText()

'Al Sweigart'

In [35]:
str(elems[0])

'<span id="author">Al Sweigart</span>'

In [36]:
elems[0].attrs

{'id': 'author'}