# Basic concept of crawler

## Send request to server and Receive response data
<img src="images/http.png"/>
- Client enters the beginning url
- Web server response data via url

## Process response data
<img src="images/html-parser-json.png"/>
- Received data from web server
- Choose a hands on library or customize to parse the received data


---

---

---

# Request and Response

## HTTP Request Method

### HTTP 1.1: Method definitions

> OPTIONS

> GET

> HEAD

> POST

> PUT

> DELETE

> TRACE

> CONNECT

Reference: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html


### Two HTTP Request Methods: GET vs. POST
> Image as postcard and envelope

### GET: Requests data from a specified resource.

> GET requests can be cached

> GET requests remain in the browser history

> GET requests can be bookmarked

> GET requests should never be used when dealing with sensitive data
>> X) http://www.justfortest.com/login.html?username=HaHaHa&password=UCCU

> GET requests have length restrictions

> GET requests should be used only to retrieve data

### POST: Submits data to be processed to a specified resource.

> POST requests are never cached

> POST requests do not remain in the browser history

> POST requests cannot be bookmarked

> POST requests have no restrictions on data length

## Request - Response

> HTML

> XML

> JSON

> ...

---

---

---

# Python librarys for web crawler

## Requests: HTTP for Humans 
- Requests is an Apache2 Licensed HTTP library
- Powered by urllib3, which is embedded within Requests.
- Document: http://docs.python-requests.org/en/latest/

## Install library command below:

In [None]:
!pip install requests

## Import and List Members of The Requests

In [None]:
import requests

dir(requests)

## Available HTTP Request Methods

In [None]:
response = requests.options("http://httpbin.org/get")
response = requests.get("http://httpbin.org/get")
response = requests.head("http://httpbin.org/get")
response = requests.post("http://httpbin.org/post")
response = requests.put("http://httpbin.org/put")
response = requests.delete("http://httpbin.org/delete")

---

## Make A Request

- make an HTTP GET request

In [None]:
response = requests.get('https://api.github.com/events')
print response

- make an HTTP POST request

In [None]:
response = requests.post("http://httpbin.org/post", data = {"key":"value"})
print response

---

## Passing Parameters In URLs

### A normal passing parameters case

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

### Any dictionary key whose value is None will not be added to the URL’s query string.

In [None]:
payload = {'key1': 'value1', 'key2': None}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

### Pass a list of items as a value

In [None]:
payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
response = requests.get("http://httpbin.org/get", params=payload)
print response.url

### Custom Headers

In [None]:
url = 'https://api.github.com/some/endpoint'
headers = {'user-agent': 'my-app/0.0.1'}
response = requests.get(url, headers=headers)

### A passing parameters POST requests

In [None]:
payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.post("http://httpbin.org/post", data=payload)
print(r.text)

### POST a Multipart-Encoded File

In [None]:
url = 'http://httpbin.org/post'
files = {'filename': open('doc/test.txt', 'rb')}

r = requests.post(url, files=files)
print r.text

- Also can set the filename, content_type and headers explicitly

In [None]:
url = 'http://httpbin.org/post'
files = {'filename': ('report.xls', open('doc/test.txt', 'rb'), 'text/plain', {'Expires': '0'})}

r = requests.post(url, files=files)
print r.text

- Send strings to be received as files

In [None]:
url = 'http://httpbin.org/post'
files = {'filename': ('test.doc', 'some,data,to,send\nanother,row,to,send\n')}

r = requests.post(url, files=files)
print r.text

### To send your own cookies to the server

In [None]:
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')

response = requests.get(url, cookies=cookies)
print response.text

---

## Response Methods

In [None]:
response = requests.get('https://api.github.com/events')
dir(response)

### Content of the server’s response

In [None]:
response.text
type(response.text)

### Binary response content

In [None]:
response.text
type(response.content)

### JSON Response Content

In [None]:
try:
    content  = response.json()
except ValueError as e:
    print e  # No JSON object could be decoded

### Response Status Codes

In [None]:
response = requests.get('http://httpbin.org/get')
print 'response.status_code:', response.status_code

# Requests also comes with a built-in status code lookup object for easy reference:
print response.status_code == requests.codes.ok

In [None]:
bad_response = requests.get('http://httpbin.org/status/404')
print 'bad_response.status_code:', bad_response.status_code

# If we made a bad request (a 4XX client error or 5XX server error response), we can raise it with Response.raise_for_status():
print bad_response.raise_for_status()

In [None]:
#since our status_code for r was 200, when we call raise_for_status() we get:
print 'response.status_code', response.status_code
print response.raise_for_status()

### Response Headers

In [None]:
print r.headers

---

---

---

---

---

---

## BeautifulSoup4:  sits atop an HTML or XML parser

- A Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

- it also supports a number of third-party Python parsers. One is the lxml parser.

## Install library command below:

In [1]:
!pip install beautifulsoup4 lxml

Collecting lxml
  Using cached lxml-3.5.0.tar.gz
Building wheels for collected packages: lxml
  Running setup.py bdist_wheel for lxml


## Import and List Members of The BeautifulSoup4

In [None]:
from bs4 import BeautifulSoup

dir(BeautifulSoup)

## Typical Usage

- Python's html.parser
#### soup = BeautifulSoup(html_markup, 'html.parser')

- lxml's HTML parser
#### soup = BeautifulSoup(html_markup, "lxml")

- lxml's XML parser
#### soup = BeautifulSoup(xml_markup, "lxml-xml")
#### soup = BeautifulSoup(xml_markup, "xml")

### Parsing XML

- The only currently supported XML parser

In [3]:
type(soup)

bs4.BeautifulSoup

In [None]:
from bs4 import BeautifulSoup
import lxml
import requests

response = requests.get("https://tw.pycon.org/2016")

### DEMO: Get all hyperlink in PyCon Taiwan front page
<img src='images/pycon.png'/>

In [4]:
dir(soup)

['ASCII_SPACES',
 'DEFAULT_BUILDER_FEATURES',
 'HTML_FORMATTERS',
 'ROOT_TAG_NAME',
 'XML_FORMATTERS',
 '__call__',
 '__class__',
 '__contains__',
 '__copy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__doc__',
 '__eq__',
 '__format__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__iter__',
 '__len__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '__weakref__',
 '_all_strings',
 '_attr_value_as_string',
 '_attribute_checker',
 '_feed',
 '_find_all',
 '_find_one',
 '_formatter_for_name',
 '_is_xml',
 '_lastRecursiveChild',
 '_last_descendant',
 '_most_recent_element',
 '_popToTag',
 '_select_debug',
 '_selector_combinators',
 '_should_pretty_print',
 '_tag_name_matches_and',
 'append',
 'attribselect_re',
 'attrs',
 'builder',
 'can_be_empty_element',
 'childGenerator',
 'childr

In [5]:
soup.title

<title>PyCon Taiwan 2016</title>

In [6]:
soup.title.name

'title'

In [7]:
soup.title.string

u'PyCon Taiwan 2016'

In [8]:
soup.find_all('a')

[<a href="/2016/">\n<img alt="logo" class="logo hidden-xs" src="/2016/static/images/logo-tiny--reverse.png"/>\n</a>,
 <a class="social-btn" href="https://www.facebook.com/pycontw" target="_blank">\n<i class="fa fa-facebook fa-lg"></i>\n</a>,
 <a class="social-btn" href="https://twitter.com/pycontw" target="_blank">\n<i class="fa fa-twitter fa-lg"></i>\n</a>,
 <a href="https://docs.google.com/forms/d/1JoHhAj6NeXg98OFAAvAHQ6Dcrh9Dxp2PttHKwxNbiFs/viewform" target="_blank">our volunteer form</a>,
 <a href="http://www.meetup.com/Taipei-py/" target="_blank"><div class="pull-left">Taipei.py\xa0-\xa0</div><div>Taipei\xa0Python\xa0User\xa0Group</div></a>,
 <a href="http://www.meetup.com/pythonhug/" target="_blank"><div class="pull-left">PyHUG\xa0-\xa0</div><div>Python\xa0Hsinchu\xa0User\xa0Group</div></a>,
 <a href="http://www.meetup.com/PyLadiesTW/" target="_blank">PyLadies\xa0Taiwan</a>,
 <a href="http://www.meetup.com/Tainan-py-Python-Tainan-User-Group/" target="_blank"><div class="pull-left

In [9]:
for link in soup.find_all('a'):
    print(link.get('href'))

/2016/
https://www.facebook.com/pycontw
https://twitter.com/pycontw
https://docs.google.com/forms/d/1JoHhAj6NeXg98OFAAvAHQ6Dcrh9Dxp2PttHKwxNbiFs/viewform
http://www.meetup.com/Taipei-py/
http://www.meetup.com/pythonhug/
http://www.meetup.com/PyLadiesTW/
http://www.meetup.com/Tainan-py-Python-Tainan-User-Group/
http://www.meetup.com/Kaohsiung-Python-Meetup
http://www.meetup.com/Taichung-Python-Meetup/
http://djangogirls.org/taipei
http://www.meetup.com/Hualien-Py/
mailto:sponsorship@pycon.tw
https://tw.pycon.org/2015apac/en/sponsors/
http://aktsk.com.tw/
http://ocf.tw
http://www.wolftea.com
http://www.gliacloud.com
https://www.facebook.com/pycontw
https://twitter.com/pycontw


In [None]:
soup.get_text()