# Data Analysis with Python

> Web Scraping with Python: Transferring Data

Kuo, Yao-Jen <yaojenkuo@datainpoint.com> from [DATAINPOINT](https://www.datainpoint.com/)

## About web scraping

## What is web scraping?

> Web scraping is extracting data directly from World Wide Web directly using the Hypertext Transfer Protocol(HTTP). While web scraping can be done manually by a human(via a browser.) The term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Source: <https://en.wikipedia.org/wiki/Web_scraping>

## Core missions involved in web scraping

- Transferring data
- Parsing data

## What is transferring data?

Trasferring data is the underlying network protocol between a browser(or a web scraping script) and a server that enables exchange of hypermedia documents on the Web. The technical term for transferring data is the **HyperText Transfer Protocol (HTTP)**.

## 2 ways involved in transferring data

1. To a server: requesting data
2. From a server: responding data

## Types of HTTP request methods to a server involved in web scraping

- GET method
- POST method

## What is a GET method?

> The GET method requests a representation of the specified resource, e.g. viewing an Instagram photo.

## What is a POST method?

> The POST method is used to submit an entity to the specified resource, often causing a change in state or side effects on the server, e.g. sharing a photo on Instagram.

## To transfer data between browser and server looks simple and straight-forward

- We enter a Uniform Resource Locator(URL).
- We fill a form then submit.
- We interact with abundant user-interface components.
- ...etc.

## Developer tools

## We can actually see the process of transferring data via the developer tools of a browser

> A set of web developer tools built directly into the browser. It can help us edit pages on-the-fly and diagnose problems quickly, which ultimately helps us build better websites, faster.

![Imgur](https://i.imgur.com/3Synk8m.png?1)

## We use Network to see the details of data transfer

1. Open Developer Tools.
2. Click Network.
3. Refresh website.

![Imgur](https://i.imgur.com/OG0Huwj.png?1)

## Each data(file) is a complete process of request and response

- Request
    - Headers
    - Method
- Response
    - Headers
    - Body

## There is a lot of data transferring during web page rendering

- **XHR(XMLHttpRequest)**
- JS
- CSS
- Img
- Media
- Font
- **Doc(HTML documents)**
- WS
- Manifest
- Other

## We can turn off browser's JavaScript to validate where to find the data

- If data disappears, check **XHR**.
- If data still exists, check **Doc**.

## A Chrome browser plug-in to turn JavaScript off

<https://chrome.google.com/webstore/detail/quick-javascript-switcher/geddoclleiomckbhadiaipdggiiccfje>

## Let's see how it works

- <https://ecshweb.pchome.com.tw/search/v3.3/>: Data disappears, check **XHR**.
- <https://emap.pcsc.com.tw/>: Data disappears, check **XHR**.
- <https://www.imdb.com/>: Data still exists, check **Doc**.

## Once we've found what we need, check its details

- Headers
    - General
    - Response Headers
    - Request Headers
    - Query String Parameters(if any)
    - Form Data(if any)
- Preview(Response body rendered in browser)
- Response(body)
- Cookies(if any)

![Imgur](https://i.imgur.com/cTva78r.png?1)

![Imgur](https://i.imgur.com/LMVp0m7.png?1)

## But how do we transfer data between Python and server?

We use a third-party library called [Requests](https://requests.readthedocs.io/en/master/).

## Requests

## What is Requests?

> Requests is an elegant and simple HTTP library for Python, built for human being.

Source: <https://requests.readthedocs.io/en/master/>

In [1]:
import requests

## If Requests is not installed, we will encounter a `ModuleNotFoundError`

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'requests'
```

## Use `pip install` at Terminal to install Requests

```bash
pip install requests
```

## Check version and its installation file path

- `__version__` attribute
- `__file__` attribute

In [2]:
print(requests.__version__)
print(requests.__file__)

2.23.0
/Users/kuoyaojen/pyda/lib/python3.6/site-packages/requests/__init__.py


## Requesting data with functions

- `requests.get(request_url, params=query_str_params)`: Make a request with GET method and a `query_str_params` to `request_url`.
- `requests.post(request_url, data=form_data)`: Make a request with POST method and a `form_data` to `request_url`.

In [3]:
request_url = "https://ecshweb.pchome.com.tw/search/v3.3/all/results"
query_str_params = {
    'q': 'macbook',
    'page': 1,
    'sort': 'rnk/dc'
}
response = requests.get(request_url, params=query_str_params)

In [4]:
request_url = "https://emap.pcsc.com.tw/EMapSDK.aspx"
form_data = {
    'commandid': 'SearchStore',
    'city': '台北市',
    'town': '大安區'
}
response = requests.post(request_url, data=form_data)

In [5]:
request_url = "https://www.imdb.com/title/tt10048342"
response = requests.get(request_url)

## Common attributes and methods to use on `Response` type

- `status_code` attribute to validate HTTP response status codes.
- `text` attribute to extract the response content as a Python `str`.
- `json()` method to parse a JSON format and convert to a Python data structure.

## Next step: parsing data accordingly

- JSON format: call `response.json()` method to generate a Python data structure.
- XML format: use a parser to convert `response.text` to an `element.tree` class.
- HTML format: use a parser to convert `response.text` to a `soup` class.