# 获取Web数据

通过requests和beautifulsoap(bs4)库来访问web pages，并进行数据抓取和分析。

In [8]:
import requests
import pprint as pp

### 基本请求。

In [27]:
r = requests.get('http://httpbin.org/get')

In [55]:
# 返回的状态码。200-OK,404-NOT FOUND.
r.status_code

200

In [56]:
# 本地保存的标识信息，服务器用于收集信息和识别客户端。
r.cookies

<RequestsCookieJar[]>

In [57]:
# http发送的头部信息，每一个请求都会包含。
r.headers

{'Connection': 'keep-alive', 'Server': 'gunicorn/19.8.1', 'Date': 'Wed, 25 Jul 2018 04:29:27 GMT', 'Content-Type': 'application/json', 'Content-Length': '208', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'}

In [53]:
r.headers.get("Connection")

'keep-alive'

In [58]:
# http请求中的content部分。
r.content

b'{"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.19.1"},"origin":"35.193.18.41","url":"http://httpbin.org/get"}\n'

In [59]:
# http请求返回值的纯数据部分。
r.text

'{"args":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Host":"httpbin.org","User-Agent":"python-requests/2.19.1"},"origin":"35.193.18.41","url":"http://httpbin.org/get"}\n'

#### 获取另外一个地址。

In [6]:
g = requests.get('http://google.com')

In [50]:
g.cookies

<RequestsCookieJar[Cookie(version=0, name='1P_JAR', value='2018-07-25-04', port=None, port_specified=False, domain='.google.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1535083891, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False), Cookie(version=0, name='NID', value='135=qfi76jat1X5Fqbo4LVJvONvC9ik5-24E_I-940-gWR_EVaHXbZN3rzGTtLzFEkXfiz1WvZUV7OLxeQttbjODpuL3jroDzsocmdzzKpB-vf3SYmzP2clp17rchR_ia-Zf', port=None, port_specified=False, domain='.google.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=1548303091, discard=False, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>

In [10]:
pp.pprint(g.text)

('<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" '
 'lang="en"><head><meta content="Search the world\'s information, including '
 'webpages, images, videos and more. Google has many special features to help '
 'you find exactly what you\'re looking for." name="description"><meta '
 'content="noodp" name="robots"><meta content="text/html; charset=UTF-8" '
 'http-equiv="Content-Type"><meta '
 'content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" '
 'itemprop="image"><title>Google</title><script '
 'nonce="QA9aPd6GP21fPJsbP9VV6w==">(function(){window.google={kEI:\'c_hXW7z_BsfdjwSHop-QAw\',kEXPI:\'0,201847,1151900,57,1654,246,58,1016,40,242,709,1004,636,773,156,139,446,2339139,208,219,32,329294,1294,12383,2349,2506,32691,15248,867,316,10445,1402,6381,3335,2,2,6801,364,553,664,326,1776,113,1149,1052,3191,224,502,5,349,1362,130,130,1028,241,2473,1365,444,131,1119,2,1306,310,296,1825,59,2,1,3,1297,1712,1376,505,730,377,1719,608,50,636,8,302,1267,774,1

### 带参数的请求。

- 使用get方法，参数放在url?param=values的形式。

In [13]:
pqyload = {'q':'杨彦星'}
r = requests.get('http://www.so.com/s', params=pqyload)
r.url 

'https://www.so.com/s?q=%E6%9D%A8%E5%BD%A6%E6%98%9F'

- 使用post方法，参数在url中不可见，放在请求的head里。

In [17]:
payload =  {'a':'杨','b':'hello'}
r = requests.post("http://httpbin.org/post", data=payload)
pp.pprint(r.text)

('{"args":{},"data":"","files":{},"form":{"a":"\\u6768","b":"hello"},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, '
 'deflate","Connection":"close","Content-Length":"19","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"python-requests/2.19.1"},"json":null,"origin":"35.193.18.41","url":"http://httpbin.org/post"}\n')


- data不光可以接受字典类型的数据，还可以接受json等格式

In [21]:
payload = {     'a'     :     '杨'     ,     'b'     :     'hello'     }

import  json
r     =     requests.post(     'http://httpbin.org/post'     , data     =     json.dumps(payload)) 

In [23]:
r.text

'{"args":{},"data":"{\\"a\\": \\"\\\\u6768\\", \\"b\\": \\"hello\\"}","files":{},"form":{},"headers":{"Accept":"*/*","Accept-Encoding":"gzip, deflate","Connection":"close","Content-Length":"29","Host":"httpbin.org","User-Agent":"python-requests/2.19.1"},"json":{"a":"\\u6768","b":"hello"},"origin":"35.193.18.41","url":"http://httpbin.org/post"}\n'

- 发送文件的post类型，这个相当于向网站上传一张图片，文档等操作，这时要使用files参数。

In [None]:
url = 'http://httpbin.org/post'
files = {'file':open ('touxiang.png','rb')}
r = requests.post(url, files     =     files) 

- **定制headers，使用headers参数来传递.**

In [24]:
import json

url = 'https://api.github.com/some/endpoint'
payload = {'some':'data'}
headers = {'content-type':'application/json'}
 
r = requests.post(url, data = json.dumps(payload), headers = headers) 

In [25]:
r.text

'{"message":"Not Found","documentation_url":"https://developer.github.com/v3"}'

## 高级操作

In [61]:
# 设置超时。
requests.get('http://github.com', timeout=1)  

<Response [200]>

In [62]:
#自已定义请求的COOKIES.

url     = 'http://httpbin.org/cookies'
cookies = {'cookies_are':'working'}
r       = requests.get(url,cookies = cookies) 
r.text

'{"cookies":{"cookies_are":"working"}}\n'