# Getting Page Content 

Now that we have determined that we need to scrape the web, how the pages are generated, and what tools we have available, the next step is to get the information from the web pages to our machine. First, we need to import the modules we need. The `requests` module should be all that you need, but the `urllib` modules are included for comparison.

In [52]:
import urllib.request
import urllib.error
import urllib.parse
import urllib.robotparser
import requests

## Can I Scrape This Site?

The first thing you should do is check to see if you're allowed to scrape a website. First we'll try to see if we can scrape some pages from http://quotes.toscrape.com/. Requests does not have a module for checking `robots.txt`, so we'll have to use `urllib.robotparser`. There was an effort to make a plugin called [`requests-robotstxt`](https://github.com/ambv/requests-robotstxt), but it has been inactive since 2017 and was just a proof of concept.

In [53]:
rfp = urllib.robotparser.RobotFileParser()
rfp.set_url('http://quotes.toscrape.com/robots.txt')
rfp.read()
rfp.can_fetch("*", "http://quotes.toscrape.com/")

True

It looks like we have permission! We could check this manually as well, but 1) we want our bots to be well behaived just like their creators, and 2) it appears that they don't even have a `robots.txt` file! Looking at the code below, we see that we get a `404 Error`. Looking at the [list of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes), we see that this means "Not Found."

If this is the case, then you might want to do a bit more searching to see if there's any issue. Otherwise, [this article](https://serverfault.com/questions/154820/what-happens-if-a-website-does-not-have-a-robots-txt-file) seems to indicate that no `robots.txt` file means we're good to go.

In [54]:
response = requests.get('http://quotes.toscrape.com/robots.txt')
response

<Response [404]>

Let's take a look at another site that does have a `robots.txt` file. It's understandable that Wikipedia get's a lot of traffic. Let's see what bots are allowed to scrape the main page.

In [55]:
rfp = urllib.robotparser.RobotFileParser('https://en.wikipedia.org/robots.txt')
rfp.read()

bots = [
    'MJ12bot', 'Mediapartners-Google*', 'IsraBot', 'Orthogaffe', 'UbiCrawler', 'DOC', 
    'Zao', 'sitecheck.internetseer.com', 'Zealbot', 'MSIECrawler', 'SiteSnagger',
    'WebStripper', 'WebCopier', 'Fetch', 'Offline Explorer', 'Teleport', 'TeleportPro', 
    'WebZIP', 'linko', 'HTTrack', 'Microsoft.URL.Control', 'Xenu', 'larbin', 'libwww', 
    'ZyBORG', 'Download Ninja', 'fast', 'wget', 'grub-client', 'k2spider', 'NPBot', 
    'WebReaper', '*'
]
allowed = [
    bot for bot in bots 
    if rfp.can_fetch(bot, 'https://en.wikipedia.org/')
]
print(allowed)
print(rfp.can_fetch('*', 'https://en.wikipedia.org/w/'))

['IsraBot', 'Orthogaffe', '*']
False


As we can see, only `IsraBot`, `Orthogaffe`, and `*` are allowed to scrape. Fortunately for us, our bot is most likely going to follow the rules for `*`. The other two bots are bots built by Wikipedia. Also worth nothing, we're allowed to scrape `/`, but not `/w/`. If we try to look at that page, we'll see that `/w/` just redirects us to `/` anyway, so we're not missing much.

## Getting a page

Getting a page with `requests` is pretty straight forward. In fact, you already saw a quick example when we saw the `404 Error` earlier. The simplest way to **get** a web page is to use the `requests.get()` function. In the simplest form, all you have to do is pass the function the URL you want.


In [88]:
response = requests.get('http://quotes.toscrape.com/')

There from here, there are several different methods available to extract the response data. The best method depends on data structure.

* [`response.text`](https://requests.readthedocs.io/en/master/user/quickstart/#response-content): Decoded text
* [`response.content`](https://requests.readthedocs.io/en/master/user/quickstart/#binary-response-content): Binary response content
* [`response.json()`](https://requests.readthedocs.io/en/master/user/quickstart/#json-response-content): JSON respone content. Saves you from having to run `json.loads()`
* [`response.raw.read()`](https://requests.readthedocs.io/en/master/user/quickstart/#raw-response-content): Raw response content

Below you can see the difference between `.text` and `.content`. BeautifulSoup will be able to accept either.

In [121]:
print(response.text[:100])

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
   


In [123]:
print(response.content[:100])

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n   '


### Looking at Header Data

**HTTP Status Codes**

A full list of status codes can be found [here](https://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html). The codes are broken up into five categories. They can be summed up as:

* 1xx: Informational. The request is not yet complete.
* 2xx: Success! The data has posted or the page is retrieved.
* 3xx: Redirection. The server is either sending you somewhere else or you need to find a different way to request the page.
* 4xx: Client Error. There is an error on your end.
* 5xx: Server Error. There is an error on the server's end.

Some commonly encountered codes are:

* 200 `OK`: The request has succeeded
* 301 `Moved Permanently`: The page you are looking for is somewhere else and we'll take you to it. 
* 302 `Found`: The content you're looking for is actually elsewhere. This is common after logging in.
* 400 `Bad Request`: Something is wrong with your request.
* 401 `Unauthorized`: You are not authenticated to reach the content.
* 403 `Forbidden`: You do not have permission to reach the content.
* 404 `Not Found`: The content doesn't exist.
* 500 `Internal Server Error`: The server isn't configured correctly or there was another error.
* 502 `Bad Gateway`: The gateway or proxy on the server 


In [149]:
attrs = ['url', 'status_code', 'headers', 'elapsed', 'encoding', 'ok']
for attr in attrs:
    print(f"{attr}: {getattr(response, attr)}\n")

url: http://quotes.toscrape.com/

status_code: 200

headers: {'Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Mon, 10 Feb 2020 01:49:55 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Upstream': 'spidyquotes-master_web', 'Content-Encoding': 'gzip'}

elapsed: 0:00:00.241843

encoding: utf-8

ok: True



## Sending Data

Sometimes you need to send data along with your request. This may be setting search parameters in a query or sending form data to submit.

In [176]:
params = {'group': 'PyRVA', 'speakers': ['Brian Cohan', 'Sam Portillo']}
response = requests.get('http://quotes.toscrape.com/', params=params)
print(response.status_code)
print(response.url)

200
http://quotes.toscrape.com/?group=PyRVA&speakers=Brian+Cohan&speakers=Sam+Portillo


In [177]:
response = requests.get('http://quotes.toscrape.com/', params=params)
assert 'Login' in response.text and 'Logout' not in response.text

data = {'username': 'PyRVA', 'password': 'Rocks'}
response = requests.post('http://quotes.toscrape.com/login', data=data)
assert 'Login' not in response.text and 'Logout' in response.text
print(response.history)

[<Response [302]>]


## Working With Sessions

In [183]:
data = {'username': 'PyRVA', 'password': 'Rocks'}
response = requests.post('http://quotes.toscrape.com/login', data=data)
assert 'Login' not in response.text and 'Logout' in response.text

response = requests.get('http://quotes.toscrape.com/')
if 'Logout' in response.text:
    print("Still Logged In")
else:
    print("Not Logged In Anymore")

Not Logged In Anymore


In [188]:
session = requests.Session()
data = {'username': 'PyRVA', 'password': 'Rocks'}
response = session.post('http://quotes.toscrape.com/login', data=data)
assert 'Login' not in response.text and 'Logout' in response.text

response = session.get('http://quotes.toscrape.com/')
if 'Logout' in response.text:
    print("Still Logged In")
else:
    print("Not Logged In Anymore")
session.close()

Still Logged In


In [187]:
with requests.Session() as session:
    data = {'username': 'PyRVA', 'password': 'Rocks'}
    response = session.post('http://quotes.toscrape.com/login', data=data)
    assert 'Login' not in response.text and 'Logout' in response.text

    response = session.get('http://quotes.toscrape.com/')
    if 'Logout' in response.text:
        print("Still Logged In")
    else:
        print("Not Logged In Anymore")

Still Logged In


## Modifying Header Data

In [192]:
response = requests.get('http://quotes.toscrape.com/')
print(response.request.headers)

headers = {'User-Agent': 'pretending to be Chrome', 'event': 'PyRVA'}
response = requests.get('http://quotes.toscrape.com/', headers=headers)
print(response.request.headers)

{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
{'User-Agent': 'pretending to be Chrome', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'event': 'PyRVA'}


## Sending a Bad Request to Timeout

In [197]:
timeout = 2
try:
    requests.get('http://quotes.toscrape.com:81', timeout=timeout)
except requests.exceptions.Timeout:
    print(f"Server took more than {timeout} seconds to respond.")

Server took more than 2 seconds to respond.


## Working Behind A Proxy


    proxies = {
      'http': 'http://10.10.1.10:3128',
      'https': 'http://10.10.1.10:1080',
    }

    response = requests.get('http://quotes.toscrape.com/', proxies=proxies)
