&emsp;&emsp;一般通过 urllib 或 requests 库发送 HTTP 请求，下面将分别介绍两个库的使用（笔者更倾向于使用 requests 库）。在正式开始前，先设置两个 url（分别进行 `get` 和 `post` 请求）：

In [1]:
get_url = 'http://httpbin.org/get'
post_url = 'http://httpbin.org/post'

> `httpbin.org` 提供了简单的 HTTP 请求和响应服务

## 2.1 urllib

&emsp;&emsp;`urllib` 是 python 内置的 HTTP 请求库，包含以下几个模块：
- urllib.request：请求模块
- urllib.error：异常处理模块
- urllib.parse：url 解析模块
- urllib.robotparser：`robots.txt` 解析模块

In [2]:
import socket
from urllib import request
from urllib import parse
from urllib import error
from urllib import robotparser
from http import cookiejar

### 2.1.1 request

&emsp;&emsp;`request` 模块提供 `urlopen()` 方法打开 url，下面是一个简单的例子：

In [3]:
with request.urlopen('https://api.douban.com/v2/book/2224879') as f:
    data = f.read() #获取网页内容
    print('Status: {0} {1}'.format(f.status, f.reason)) #打印状态码和原因
    print('Headers:')
    for k, v in f.getheaders(): #打印响应头信息
        print('\t{0}: {1}'.format(k, v))
    print('Data:\n', data.decode('utf-8')) #打印网页内容

Status: 200 OK
Headers:
	Date: Sat, 10 Nov 2018 14:26:25 GMT
	Content-Type: application/json; charset=utf-8
	Content-Length: 1179
	Connection: close
	Vary: Accept-Encoding
	X-Ratelimit-Remaining2: 99
	X-Ratelimit-Limit2: 100
	Expires: Sun, 1 Jan 2006 01:00:00 GMT
	Pragma: no-cache
	Cache-Control: must-revalidate, no-cache, private
	Set-Cookie: bid=x7CXEz9xPqk; Expires=Sun, 10-Nov-19 14:26:25 GMT; Domain=.douban.com; Path=/
	X-DOUBAN-NEWBID: x7CXEz9xPqk
	X-DAE-Node: anson32
	X-DAE-App: book
	Server: dae
	X-Frame-Options: SAMEORIGIN
Data:
 {"rating":{"max":10,"numRaters":159,"average":"9.3","min":0},"subtitle":"一卷本","pubdate":"1962","image":"https://img3.doubanio.com\/view\/subject\/m\/public\/s3254653.jpg","binding":"精装","images":{"small":"https://img3.doubanio.com\/view\/subject\/s\/public\/s3254653.jpg","large":"https://img3.doubanio.com\/view\/subject\/l\/public\/s3254653.jpg","medium":"https://img3.doubanio.com\/view\/subject\/m\/public\/s3254653.jpg"},"alt":"https:\/\/book.douban.c

&emsp;&emsp;在上面的例子中，首先通过 `urlopen()` 打开 url，接着通过 `read()`、`status`、 `getheaders()` 分别获取响应体内容、状态码和响应头信息。

&emsp;&emsp;`urlopen()` 一般有 3 个常用参数：`url, data, timeout`，下面通过 `http://httpbin.org/post` 演示 `data` 参数的使用。

In [4]:
dict = {
    'word': 'hello'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
print(data)

b'word=hello'


In [5]:
response = request.urlopen(post_url, data=data)
print(response.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7"
  }, 
  "json": null, 
  "origin": "27.46.137.115", 
  "url": "http://httpbin.org/post"
}



&emsp;&emsp;上面的例子中，先通过 `bytes(parse.urlencode())` 函数将 `dict` 的内容转换，接着将其添加到 `urlopen()`的 `data`参数中，这样就完成了一次 `POST` 请求。

> 在没有设置 `data` 参数时，默认以 `GET` 方法请求，反之则以 `POST` 方法请求

&emsp;&emsp;当网络情况不好或者服务器端异常时，会出现请求慢或者请求异常的情况，因此需要给请求设置一个超时时间，而不是让程序一直在等待结果。例子如下：

In [6]:
try:
    response = request.urlopen(get_url, timeout=0.1)
    print(response.read().decode('utf-8'))
except Exception as e:
    print(e)

<urlopen error timed out>


&emsp;&emsp;一般为了防范目标网站的反爬虫机制，会为请求设置一些 Headers 头部信息（如 `User-Agent`）：

In [7]:
req = request.Request(url=get_url)
req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25')
with request.urlopen(req) as f:
    print('Status: {0} {1}'.format(f.status, f.reason))
    print('Data:\n', f.read().decode('utf-8'))

Status: 200 OK
Data:
 {
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25"
  }, 
  "origin": "27.46.137.115", 
  "url": "http://httpbin.org/get"
}



&emsp;&emsp;除了通过 `add_header()` 方法添加头信息，还可以通过定义请求头字典，设置 `headers` 参数添加：

In [8]:
headers = {
    'User-Agent': 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'
}

In [9]:
req = request.Request(post_url, headers=headers, data=data)
with request.urlopen(req) as f:
    print('Status: {0} {1}'.format(f.status, f.reason))

Status: 200 OK


### 2.1.2 handler

&emsp;&emsp;这里介绍两种 handler，分别是设置代理的 ProxyHandler 和处理 Cookie 的  HTTPCookiProcessor。

&emsp;&emsp;网站会检测某段时间某个 IP 的访问次数，若访问次数过多，则会被禁止访问，这个时候就需要设置代理，urllib 通过 `request.ProxyHandler()` 可以设置代理：

In [10]:
proxy_handler = request.ProxyHandler({
    'http': 'http://61.135.217.7',
    'https': 'https://61.178.127.14'
})

In [11]:
opener = request.build_opener(proxy_handler)
with opener.open(post_url) as f:
    print('Status: {0} {1}'.format(f.status, f.reason))

Status: 200 OK


&emsp;&emsp;可以通过 ProxyBasicAuthHandler 来处理代理的身份验证：

``` python
url = 'http://www.example.com/login.html'
proxy_handler = request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')
opener = request.build_opener(proxy_handler, proxy_auth_handler)
with opener.open(url) as f:
    pass
```

&emsp;&emsp;Cookie 中保存中我们常见的登录信息，有时候爬取网站需要携带 Cookie 信息访问,这里通过 `http.cookijar` 获取 Cookie 以及存储 Cookie：

In [12]:
cookie = cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
with opener.open('http://www.baidu.com'):
    for item in cookie:
        print(item.name + '=' + item.value)

BAIDUID=D45627B5DED40DACA0F9BCFCF20DB981:FG=1
BIDUPSID=D45627B5DED40DACA0F9BCFCF20DB981
H_PS_PSSID=26524_1429_21126_27400_22159
PSTM=1541859990
delPer=0
BDSVRTM=0
BD_HOME=0


### 2.1.3 error

&emsp;&emsp;在 2.1.1 设置 `timeout` 参数时，借助了 `urllib.error` 模块进行异常处理。

In [13]:
try:
    response = request.urlopen('http://www.pythonsite.com/', timeout=0.001)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('Time Out')

<class 'socket.timeout'>
Time Out


&emsp;&emsp;在 `urllib.error` 中有两种异常错误：URLError 和 HTTPError。URLError 里只有一个属性—— `reason`，即抓异常的时候只能打印错误原因；而 HTTPError 里有三个属性：`code, reason, headers`，即抓异常的时候可以获得错误代码、错误原因、头信息三个信息，例子如下：

In [14]:
 try:
    response = request.urlopen('http://pythonsite.com/1111.html')
except error.HTTPError as e:
    print(e.reason)
    print(e.code)
    print(e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print('reqeust successfully')

Not Found
404
Date: Sat, 10 Nov 2018 14:26:29 GMT
Server: Apache
Vary: Accept-Encoding
Content-Length: 207
Connection: close
Content-Type: text/html; charset=iso-8859-1




### 2.1.4 parse

&emsp;&emsp;`urllib.parse` 模块用于解析 url。其中 `urlparse()` 方法拆分 url：

In [15]:
result = parse.urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')


&emsp;&emsp;`urlunparse()` 与之相反，用于拼接 url：

In [16]:
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=123', 'commit']
print(parse.urlunparse(data))

http://www.baidu.com/index.html;user?a=123#commit


&emsp;&emsp;`urljoin()` 也是用于拼接：

In [17]:
print(parse.urljoin('http://www.baidu.com', 'FAQ.html'))

http://www.baidu.com/FAQ.html


&emsp;&emsp;`urlencode()` 将字典转换为 url 参数：

In [18]:
params = {
    'name': '尧德胜',
    'age': 23,
}
base_url = 'http://www.baidu.com?'

print(base_url + parse.urlencode(params))

http://www.baidu.com?name=%E5%B0%A7%E5%BE%B7%E8%83%9C&age=23


### 2.1.5 robotparser

&emsp;&emsp;`urllib.robotparser` 模块用于解析 robots.txt（即 Robots 协议），下面以一个简单的例子来介绍：

> Robots 协议也称作爬虫协议，用于告知爬虫哪些页面可以抓取，一般放在网站的根目录下。

In [19]:
rp = robotparser.RobotFileParser() #创建 RobotFileParser 对象

rp.set_url('http://www.jianshu.com/robots.txt') #设置目标网站的 robots.txt 链接
rp.read() #读取 robots.txt 并解析
print(rp.can_fetch('*', 'https://www.jianshu.com/p/2b0ed045e535')) #判断网页是否可以被爬取

False


## 2.2 requests

&emsp;&emsp;这一节介绍较 urllib 更为强大方便的 requests：

In [20]:
import requests

### 2.2.1 GET 请求

&emsp;&emsp;urllib 的 `urlopen()` 方法实际上默认通过 `GET` 发送请求，而 requests 则以 `get()` 方法发送 `GET` 请求，更为明确：

In [21]:
with requests.get(get_url) as r:
    print('Status: {0} {1}'.format(r.status_code, r.reason)) #打印状态码和原因
    print('Encoding: {}'.format(r.encoding)) 
    print('Headers: \n{}'.format(r.headers)) #打印响应头信息
    print('Data: \n{}'.format(r.text)) #打印响应体内容

Status: 200 OK
Encoding: None
Headers: 
{'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Sat, 10 Nov 2018 14:26:32 GMT', 'Content-Type': 'application/json', 'Content-Length': '266', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}
Data: 
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.20.0"
  }, 
  "origin": "27.46.137.115", 
  "url": "http://httpbin.org/get"
}



&emsp;&emsp;通过观察，不难发现返回的数据格式为 JSON，可以通过 `json()` 方法将返回的 JSON 格式字符串转化为字典：

In [22]:
with requests.get(get_url) as r:
    json = r.json() 
    print(json)
    print(type(json)) 

{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.20.0'}, 'origin': '27.46.137.115', 'url': 'http://httpbin.org/get'}
<class 'dict'>


&emsp;&emsp;当需要对 `GET` 请求添加额外信息时，可以利用 `params` 参数：

In [23]:
data = {
    'name': 'gaiusyao',
    'id': '42'
}

In [24]:
with requests.get(get_url, params=data) as r:
    print('Data: \n{}'.format(r.text))

Data: 
{
  "args": {
    "id": "42", 
    "name": "gaiusyao"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.20.0"
  }, 
  "origin": "27.46.137.115", 
  "url": "http://httpbin.org/get?name=gaiusyao&id=42"
}



&emsp;&emsp;加入 `headers` 参数，设置请求头：

In [25]:
with requests.get(get_url, headers=headers) as r:
    print('Status: {0} {1}'.format(r.status_code, r.reason))

Status: 200 OK


### 2.2.2 POST 请求

&emsp;&emsp;requests 发送 `POST` 请求是通过 `post()` 方法，与 urllib 相同的是，也需要设置 `data` 参数：

In [26]:
with requests.post(post_url, data = data) as r:
    print('Status: {0} {1}'.format(r.status_code, r.reason))
    print('Data: \n{}'.format(r.text))

Status: 200 OK
Data: 
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "id": "42", 
    "name": "gaiusyao"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "19", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.20.0"
  }, 
  "json": null, 
  "origin": "27.46.137.115", 
  "url": "http://httpbin.org/post"
}



### 2.2.3 响应

&emsp;&emsp;发送请求后，将得到服务器端的响应，requests 除了 `text` 获取响应内容外，还有很多属性和方法：

In [27]:
res = requests.get('http://www.jianshu.com', headers=headers)

In [28]:
# 获取请求地址
res.url

'https://www.jianshu.com/'

In [29]:
# 获取状态码
res.status_code

200

In [30]:
# 获取头信息
res.headers

{'Date': 'Sat, 10 Nov 2018 14:26:36 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Server': 'Tengine', 'X-Frame-Options': 'DENY', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'ETag': 'W/"ffafde6206e174a08045a8f234939ed8"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Set-Cookie': 'locale=zh-CN; path=/, _m7e_session=7bcd81ab19d94c9724dafc617c87cd5e; path=/; expires=Sat, 10 Nov 2018 20:26:36 -0000; secure; HttpOnly', 'X-Request-Id': '78624101-fd8d-4ce1-81e0-d0fa1217fe50', 'X-Runtime': '0.012926', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'Content-Encoding': 'gzip', 'X-Via': '1.1 pingwangtong38:5 (Cdn Cache Server V2.0), 1.1 PSgdjywtsn16:6 (Cdn Cache Server V2.0)'}

In [31]:
# 获取 Cookie
res.cookies

<RequestsCookieJar[Cookie(version=0, name='_m7e_session', value='7bcd81ab19d94c9724dafc617c87cd5e', port=None, port_specified=False, domain='www.jianshu.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=True, expires=1541881596, discard=False, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='locale', value='zh-CN', port=None, port_specified=False, domain='www.jianshu.com', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={}, rfc2109=False)]>

In [32]:
# 获取请求历史
res.history

[<Response [301]>]

### 2.2.4 Cookie

&emsp;&emsp;先以一个简单的例子获取 Cookie：

In [33]:
with requests.get('http://www.jianshu.com', headers=headers) as r:
    for k, v in r.cookies.items():
        print('\t{0}: {1}'.format(k, v))

	_m7e_session: 4b1494dfb47dd67395a204190fb92b8a
	locale: zh-CN


&emsp;&emsp;可以将 Cookie 设置到 Headers 中，然后发送请求：

In [34]:
c_headers = {
    'Cookie': '__yadk_uid=KJqOgEJ7RRTgmhU3x0gwXUmSB2SGF6Bv; remember_user_token=W1s2NTMzODI1XSwiJDJhJDExJGprOXAzcTQwQ09oUTQ1RW9GSEVCbi4iLCIxNTQxNzUzNTQzLjcyNTMxOCJd--c82c1473b9fb81ef55c8432ce297f13b41f223a8; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%226533825%22%2C%22%24device_id%22%3A%221665ba3af58853-0563eb383564ca-5701631-1327104-1665ba3af5aaa9%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%2C%22first_id%22%3A%221665ba3af58853-0563eb383564ca-5701631-1327104-1665ba3af5aaa9%22%7D; read_mode=day; default_font=font2; locale=zh-CN; _m7e_session=66694739e9ada77d06a7683f09d1f4ae; Hm_lvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1541605221,1541686521,1541725091,1541843383; Hm_lpvt_0c0e9d9b1e7d617b3e6842e85b9fb068=1541855281',
    'User-Agent': 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25'
}

In [35]:
with requests.get('http://www.jianshu.com', headers=c_headers) as r:
    for k, v in r.cookies.items():
        print('\t{0}: {1}'.format(k, v))

	_m7e_session: f54e0f76804cbc06653cc0f0ddf710c1
	locale: zh-CN


### 2.2.5 会话维持

&emsp;&emsp;利用 `Session` 对象可以很方便地维持一个会话，且不用担心 Cookie 的问题：

In [36]:
with requests.Session() as s:
    s.get('http://httpbin.org/cookies/set/number/12345678')
    r = s.get('http://httpbin.org/cookies')
    print(r.text)

{
  "cookies": {
    "number": "12345678"
  }
}



In [37]:
requests.get('http://httpbin.org/cookies/set/number/123456789')
r = requests.get('http://httpbin.org/cookies')
print(r.text)

{
  "cookies": {}
}



&emsp;&emsp;如果不使用 Session 对象，在 `get('http://httpbin.org/cookies')` 这步，将无法获取之前设置的 Cookie。

### 2.2.6 代理设置

&emsp;&emsp;通过设置 `proxies` 参数，可以设置代理：

In [38]:
proxies = {
    'http': 'http://61.135.217.7',
    'https': 'https://61.178.127.14'
}

In [39]:
with requests.get(get_url, proxies=proxies) as r:
    print('Status: {0} {1}'.format(r.status_code, r.reason))

Status: 200 OK


&emsp;&emsp;若代理需要用到 HTTP Basic Auth，则可使用类似 `http://user:password@host:port` 这样的语法来部署代理。

### 2.2.7 超时设置

&emsp;&emsp;requests 也是通过 `timeout` 参数来进行超时设置（不设置则会永久等待）：

In [40]:
try:
    with requests.get(get_url, timeout=1) as r:
        print('Status: {0} {1}'.format(r.status_code, r.reason))
except Exception as e:
    print(e)

Status: 200 OK


&emsp;&emsp;实际上请求分为连接和读取两个阶段，可以通过一个元组分别指定：

In [41]:
try:
    with requests.get(get_url, timeout=(2, 4)) as r:
        print('Status: {0} {1}'.format(r.status_code, r.reason))
except Exception as e:
    print(e)

Status: 200 OK


### 2.2.8 身份认证

&emsp;&emsp;requests 自带身份认证功能：

``` python
r = requests.get('http://localhost:5000', auth=('username', 'password'))
print('Status: {0} {1}'.format(r.status_code, r.reason))
```

> 想了解更多参阅 [requests 官方文档](http://www.python-requests.org/en/master/)