# Web Crawler in Python

## Basic concept of crawler

### Send request to server and Receive response data
<img src="images/http.png"/>
- Client enters the beginning url
- Web server response data via url

### Process response data
<img src="images/html-parser-json.png"/>
- Received data from web server
- Choose a hands on library or customize to parse the received data


## Request and Response

### HTTP Request Method

- HTTP 1.1: Method definitions

> OPTIONS

> GET

> HEAD

> POST

> PUT

> DELETE

> TRACE

> CONNECT

Reference: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html



- Two HTTP Request Methods: GET vs. POST
> Image as postcard and envelope

- GET: Requests data from a specified resource.

> GET requests can be cached

> GET requests remain in the browser history

> GET requests can be bookmarked

> GET requests should never be used when dealing with sensitive data
>> X) http://www.justfortest.com/login.html?username=HaHaHa&password=UCCU

> GET requests have length restrictions

> GET requests should be used only to retrieve data

- POST: Submits data to be processed to a specified resource.

> POST requests are never cached

> POST requests do not remain in the browser history

> POST requests cannot be bookmarked

> POST requests have no restrictions on data length

### Request - Response

> HTML

> XML

> JSON

> ...

---

---

---

## Python librarys for web crawler

### Install library command below:

In [1]:
!pip install requests
!pip install beautifulsoup4



### Requests: HTTP for Humans 
# TODO:
- Requests is an Apache2 Licensed HTTP library
- Python’s builtin urllib2 module provides most of the HTTP capabilities you should need
- Document: http://docs.python-requests.org/en/latest/

### Import and List Members of The Requests

In [2]:
import requests

dir(requests)

['ConnectionError',
 'HTTPError',
 'NullHandler',
 'PreparedRequest',
 'Request',
 'RequestException',
 'Response',
 'Session',
 'Timeout',
 'TooManyRedirects',
 'URLRequired',
 '__author__',
 '__build__',
 '__builtins__',
 '__copyright__',
 '__doc__',
 '__file__',
 '__license__',
 '__name__',
 '__package__',
 '__path__',
 '__title__',
 '__version__',
 'adapters',
 'api',
 'auth',
 'certs',
 'codes',
 'compat',
 'cookies',
 'delete',
 'exceptions',
 'get',
 'head',
 'hooks',
 'logging',
 'models',
 'options',
 'packages',
 'patch',
 'post',
 'put',
 'request',
 'session',
 'sessions',
 'status_codes',
 'structures',
 'utils']

---

### Available HTTP Request Methods

In [3]:
response = requests.options("http://httpbin.org/get")
response = requests.get("http://httpbin.org/get")
response = requests.head("http://httpbin.org/get")
response = requests.post("http://httpbin.org/post")
response = requests.put("http://httpbin.org/put")
response = requests.delete("http://httpbin.org/delete")

---

### Make A Request

- make an HTTP GET request

In [4]:
response = requests.get('https://api.github.com/events')
print response

<Response [200]>




- make an HTTP POST request

In [5]:
response = requests.post("http://httpbin.org/post", data = {"key":"value"})
print response

<Response [200]>


---

### Passing Parameters In URLs

#### A normal passing parameters case

In [6]:
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

http://httpbin.org/get?key2=value2&key1=value1


#### Any dictionary key whose value is None will not be added to the URL’s query string.

In [7]:
payload = {'key1': 'value1', 'key2': None}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

http://httpbin.org/get?key1=value1


#### Pass a list of items as a value

In [8]:
payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

http://httpbin.org/get?key2=value2&key2=value3&key1=value1


#### Custom Headers

In [9]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
response = requests.get(url, headers=headers)



#### A passing parameters POST requests

In [10]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "203.74.156.134", 
  "url": "http://httpbin.org/post"
}



#### POST a Multipart-Encoded File

In [11]:
url = 'http://httpbin.org/post'
files = {'filename': open('doc/test.txt', 'rb')}

r = requests.post(url, files=files)
print r.text

{
  "args": {}, 
  "data": "", 
  "files": {
    "filename": "Hello!\nHour of code.\n\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "170", 
    "Content-Type": "multipart/form-data; boundary=0b4dcf1c65174c37aec315aa7d1aad63", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "203.74.156.134", 
  "url": "http://httpbin.org/post"
}



#### Also can set the filename, content_type and headers explicitly

In [12]:
url = 'http://httpbin.org/post'
files = {'filename': ('report.xls', open('doc/test.txt', 'rb'), 'text/plain', {'Expires': '0'})}

r = requests.post(url, files=files)
print r.text

{
  "args": {}, 
  "data": "", 
  "files": {
    "filename": "Hello!\nHour of code.\n\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "210", 
    "Content-Type": "multipart/form-data; boundary=f42dabbd452f4a9bac90cdf3c3f6320d", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "203.74.156.134", 
  "url": "http://httpbin.org/post"
}



#### Send strings to be received as files

In [13]:
url = 'http://httpbin.org/post'
files = {'filename': ('test.doc', 'some,data,to,send\nanother,row,to,send\n')}

r = requests.post(url, files=files)
print r.text

{
  "args": {}, 
  "data": "", 
  "files": {
    "filename": "some,data,to,send\nanother,row,to,send\n"
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "186", 
    "Content-Type": "multipart/form-data; boundary=b07fcbd6b502421592157536c56f4e53", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.8.1"
  }, 
  "json": null, 
  "origin": "203.74.156.134", 
  "url": "http://httpbin.org/post"
}



---

### Response Methods

In [14]:
response = requests.get('https://api.github.com/events')
dir(response)



['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__iter__',
 '__module__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

#### Content of the server’s response

In [15]:
response.text
type(response.text)

unicode

#### Binary response content

In [16]:
response.text
type(response.content)

str

#### JSON Response Content

In [19]:
try:
    content  = response.json()
except ValueError as e:
    print e  # No JSON object could be decoded

#### Response Status Codes

In [23]:
r = requests.get('http://httpbin.org/get')
print r.status_code

# Requests also comes with a built-in status code lookup object for easy reference:
print r.status_code == requests.codes.ok

200
True


In [24]:
bad_r = requests.get('http://httpbin.org/status/404')
bad_r.status_code

print bad_r.raise_for_status()

HTTPError: 404 Client Error: NOT FOUND for url: http://httpbin.org/status/404

In [27]:
#since our status_code for r was 200, when we call raise_for_status() we get:
print r.raise_for_status()

None


#### Response Headers

In [28]:
print r.headers

{'Content-Length': '238', 'Server': 'nginx', 'Connection': 'keep-alive', 'Access-Control-Allow-Credentials': 'true', 'Date': 'Thu, 03 Dec 2015 09:54:46 GMT', 'Access-Control-Allow-Origin': '*', 'Content-Type': 'application/json'}


---

- DEMO: Get all hyperlink in PyCon Taiwan front page
<img src='images/pycon.png'/>


### BeautifulSoup4:  sits atop an HTML or XML parser

- A Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
type(soup)

In [None]:
dir(soup)

In [None]:
soup.title

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.find_all('a')

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
soup.get_text()