# Saving a web page to scrape later

For many scraping jobs, it makes sense to first save a copy of the web page (or pages) that you want to scrape and then operate on the local files you've saved. This is a good practice for a couple of reasons: You won't be bombarding your target server with requests every time you fiddle with your script and rerun it, and you've got a copy saved in case the page (or pages) disappear.

Here's one way to accomplish that. (If you haven't run through [the notebook on using `requests` to fetch web pages](02.%20Fetching%20HTML%20with%20requests.ipynb), do that first.)

We'll need the `requests` and `bs4` libraries, so let's start by importing them:

In [12]:
import requests
import bs4

## Fetch the page and write to file

Let's grab the Texas death row page: `'https://www.tdcj.texas.gov/death_row/dr_offenders_on_dr.html'`

In [5]:
dr_page = requests.get('https://www.tdcj.texas.gov/death_row/dr_offenders_on_dr.html')

In [6]:
# take a peek at the HTML
dr_page.text

'<!doctype html>\r\n<html lang="en-US"><!-- InstanceBegin template="/Templates/generic_inside.dwt" codeOutsideHTMLIsLocked="false" -->\r\n<head>\r\n<meta charset="utf-8">\r\n<meta name="viewport" content="width=device-width, initial-scale=1">\r\n<!-- stylesheet: global -->\r\n<link rel="stylesheet" href="/stylesheets/global.css">\r\n<!-- stylesheet: page-specific -->\r\n<link rel="stylesheet" href="/stylesheets/content.css">\r\n<link rel="stylesheet" href="/stylesheets/menu_style.css">\r\n<!-- InstanceBeginEditable name="stylesheets" -->\r\n\r\n<!-- InstanceEndEditable -->\r\n<!-- jQuery library (if CDN fails, use local copy) -->\r\n<script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>\r\n<script type="text/javascript"> window.jQuery || document.write(\'<script src="/javascripts/jquery.min.js"><\\/script>\') </script>\r\n<!-- javascripts -->\r\n<script type="text/javascript" src="/javascripts/google_analytics.js"></script>\r\n<script 

Now, instead of continuing on with our scraping journey, we'll use some built-in Python tools to write this to file:

In [7]:
# define a name for the file we're saving to
HTML_FILE_NAME = 'death-row-page.html'

In [8]:
# open that file in write mode and write the page's HTML into it
with open(HTML_FILE_NAME, 'w') as outfile:
    outfile.write(dr_page.text)

The `with` block is just a handy way to deal with opening and closing files -- note that everything under the `with` line is indented.

The `open()` function is used to open files for reading or writing. The first _argument_ that you hand this function is the name of the file you're going to be working on -- we defined it above and attached it to the `HTML_FILE_NAME` variable, which is totally arbitrary. (We could have called it `HTML_BANANAGRAM` if we wanted to.)

The `'w'` means that we're opening the file in "write" mode. We're also tagging the opened file with a variable name using the `as` operator -- `outfile` is an arbitrary variable name that I came up with.

But then we'll use that variable name to do things to the file we've opened. In this case, we want to use the file object's `write()` method to write some content to the file.

What content? The HTML of the page we grabbed, which is accessible through the `.text` attribute.

In human words, this block of code is saying: "Open a file called `death-row-page.html` and write the HTML of thata death row page you grabbed earlier into it."

## Reading the HTML from a saved web page

At some point after you've saved your page to file, eventually you'll want to scrape it. To read the HTML into a variable, we'll use a `with` block again, but this time we'll specify "read mode" (`'r'`) and use the `read()` method instead of the `write()` method:

In [9]:
with open(HTML_FILE_NAME, 'r') as infile:
    html = infile.read()

In [10]:
html

'<!doctype html>\n<html lang="en-US"><!-- InstanceBegin template="/Templates/generic_inside.dwt" codeOutsideHTMLIsLocked="false" -->\n<head>\n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<!-- stylesheet: global -->\n<link rel="stylesheet" href="/stylesheets/global.css">\n<!-- stylesheet: page-specific -->\n<link rel="stylesheet" href="/stylesheets/content.css">\n<link rel="stylesheet" href="/stylesheets/menu_style.css">\n<!-- InstanceBeginEditable name="stylesheets" -->\n\n<!-- InstanceEndEditable -->\n<!-- jQuery library (if CDN fails, use local copy) -->\n<script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>\n<script type="text/javascript"> window.jQuery || document.write(\'<script src="/javascripts/jquery.min.js"><\\/script>\') </script>\n<!-- javascripts -->\n<script type="text/javascript" src="/javascripts/google_analytics.js"></script>\n<script type="text/javascript" src="/javascr

Now it's just a matter of turning that HTML into soup -- [see this notebook for more details](03.%20Parsing%20HTML%20with%20BeautifulSoup.ipynb) -- and parsing the results.

In [13]:
soup = bs4.BeautifulSoup(html, 'html.parser')