Created by R. David Beales for the [Kelvin Smith Library](https://case.edu/library/) at [Case Western Reserve University](https://case.edu) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email rdb104@case.edu.<br />
___

# Web Scraping: Making a Request and Receiving a Response

**Description:** This lesson introduces the basic web scraping workflow using the `requests` library for Python.  

**Use Case:** For Learners (Additional explanation, not ideal for researchers)

**Difficulty:** Beginner

**Completion time:** 15 minutes

**Knowledge Required:** Basic Python

**Knowledge Recommended:** HTML Structure

**Data Format:** `html`, `txt`, `py` 

**Libraries Used:** `requests` 
___

## Introduction
Welcome to your first web scraping project.  This project will use the `requests` package to introduce the basic web scraping workflow.

We will be using [Books to Scrape](https://books.toscrape.com/) as our test website for this tutorial.  The site exists just to provide a platform for people to practice web scraping.  You may want to keep the website open in a another browser tab so you can compare the data we are getting with the website as a regular user will see it. 

In this project you will:

1. Send a request to a web server.
2. Check for a response.
3. View the content of that response.
4. Save that content to a file.


## Ultra Quick Jupyter Notebook Tips

These Jupyter Notebooks have markdown cells, which you just read, and code cells, which you can run and edit.  

Code cells appear in the light grey box.  You can click on the text in the code cell to edit it.  You can run a code cell by clicking the Run button at the top of the page (pictured) or by clicking Shift + Enter after clicking on the cell.

![The Run Button](img/runcellbutton.png) 

Finally, all the code cells in a notebook must be run in order.  Make sure you start at the top of each lesson and run each cell in order.

### HTTP Requests: Sending a request and checking for a response.

HTTP is a protocol for fetching resources like HTML documents, images, video, ads (yuck), and other content that make up a web page.  HTTP is a client-server protocol meaning that the client, usually a web browser, initiates a request and the server(s) sends back a response object that contains all the data.  This is where the python package `requests` gets its name, from the request piece of the request/response exchange.

Web scraping doesn't use a web browser to initiate an HTTP request. We are going to use a Python script instead.  

Before we can begin using a Python package, we have to import it.  Run the cell below to import the `requests` package.

In [None]:
import requests  #https://requests.readthedocs.io/

Now that the `requests` package has been imported we can use the various excellent methods that are built into the package. The most common method for web scraping is the `get` method.

`requests.get` will send a `get` request to a web address that you specify.  This simple example will get everything from the web server at that url, but `requests` has powerful tools for selecting exactly what you want to scrape, which we will explore in a later lesson.

Try running the code below.
What response do you get?

In [None]:
requests.get('https://books.toscrape.com/')

Python sent a request to the URL we specified, https://books.toscrape.com in this case.  The response we got back is the first piece of information you need when web scraping.  Could we connect?!

The 200 is one of what are called **http response status codes** and it means our attempt to connect was a success.  Excellent!

A code in the 400s or 500s would mean there was an error connecting to the server.  

If you want to learn more about the status codes, you can check out the developer documentation from Mozilla here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

### Viewing the content of a response.

The status code is only one piece of information in the response object.  Now that we've made a successful connection to a website, we can start to look at the other information in the response object returned by our `get` request.  

The first thing we'll need to do is store the response object in a variable using the `=` so we can look at the content without repeatedly sending requests to the web server.

The code below is the same as our first `get` request, but it is stored in the variable `response`.  You could name this variable anything you wanted.  In this example we chose `results` as a shorthand for response object.  If you change the variable name here, you'll have to change it in all the following code examples as well.  So, just leave it as `results` for now.  

When you're ready, run the code cell below.

In [None]:
results = requests.get('https://books.toscrape.com/')

You'll notice this time, you didn't get an http status code as a response.  When you store a response object in a variable, python won't display any information from the response object unless you use a command to ask it to show a piece of the response stored there. 

If we want to check the http status code again, we can call the variable `response` and the `status_code` attribute of the response object.

In [None]:
results.status_code

The response object has many other attibutes as well:
* .headers - access header information about the server that sent the response
* .text - access the response body as text for text-based responses, such as html, json, yaml, etc.
* .content - access the response body as bytes for nontext requests, such as images, spreadsheets, zipfiles, etc.
* .encoding - shows what text encoding `requests` is using in the `.text` attribute
* .url - shows the url that was used in the request, can be useful when encoding urls with various parameters
* .json - builtin in json decoder for scraping json files

Try viewwing these different attributes by editing the code cell below and running it again to see what you get.  All the attributes follow the same format, `results` where we stored our response object, a period, the the attribute, (i.e., `text`, `headers`, `url`)

For example, `results.text`


In [None]:
results.text

### Saving the Response Object 

Now that we have a response object and taken a quick look at its attributes, we need to save it to a file so that we can come back to it in the future and examine it.  We could scrape the web again each time we want to do some analysis, but that would be a waste of computing resources, and the content on the web is not static.  Social media posts are deleted.  Items are removed from shops.  The format of websites changes making your web scraper obsolete.  So it is best to scrape once, and save the data locally. Saving it also allows us to easily share the data we are using with others.

We are going to use a `with` statement to simplify the process of saving the response object.  The `with` statement in Python is used when you want to execute several operations as a group.  We will open the file, write the response object to it, and close the file and all these operations will happen within the `with` statement. You can learn more about `with` statments and writing to files here: https://www.freecodecamp.org/news/with-open-in-python-with-statement-syntax-example/

You can see the structure of the `with` statement in the code cell below.  

`with` begins the with statement.  
`open` tells the computer to open a file using the filename you specify, in this case, `scrape.txt`, and the `w` indicates we want to write to the file.  
`as` creates a variable that contains the file information, in this case, `file`.  
`file.write` will save whatever attributes of the response object you include in the parentheses, (i.e., `results.text`, `results.headers`, `results.url`)

The filename, `scrape.txt`, and the attribute can be changed to create different files with diffferent content.  

When you run the code below you will see the file appear in the list of directories/files on the left.  

In [None]:
with open('scrape.txt','w') as file:
    file.write(results.text)

Try changing the filename below to `scrapeheaders.txt` and the attribute below to `response.headers` and run the code cell again to create another file.  Did you get a differnt file with different content?

In [None]:
with open('scrape.txt','w') as file:
    file.write(results.text)