# Programming and Data Analysis

> Web Scraping with Python

Kuo, Yao-Jen <yaojenkuo@ntu.edu.tw> from [DATAINPOINT](https://www.datainpoint.com/)

In [1]:
import requests
import json
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

## About web scraping

## What is web scraping?

> Web scraping is extracting data directly from World Wide Web directly using the Hypertext Transfer Protocol(HTTP). While web scraping can be done manually by a human(via a browser.) The term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

Source: <https://en.wikipedia.org/wiki/Web_scraping>

## Core missions involved in web scraping

- Transferring data.
- Parsing data.

## What is transferring data?

Trasferring data is the underlying network protocol between a browser(or a web scraping script) and a server that enables exchange of hypermedia documents on the Web. The technical term for transferring data is the **HyperText Transfer Protocol (HTTP)**.

## 2 ways involved in transferring data

1. To a server: requesting data.
2. From a server: responding data.

## Types of HTTP request methods to a server involved in web scraping

- GET method.
- POST method.

Source: <https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods>

## What is a GET method

> The GET method requests a representation of the specified resource, e.g. reading a blog post.

## What is a POST method

> The POST method is used to submit an entity to the specified resource, often causing a change in state or side effects on the server, e.g. writing a blog post.

## To transfer data between browser and server looks simple and straight-forward

- We enter a Uniform Resource Locator(URL).
- We fill a form then submit.
- We interact with abundant user-interface components.
- ...etc.

## Developer tools

## We can actually see the process of transferring data via the developer tools of a browser

A set of web developer tools built directly into the browser. It can help us edit pages on-the-fly and diagnose problems quickly, which ultimately helps us build better websites, faster.

![Imgur](https://i.imgur.com/3Synk8m.png?1)

## We use Network to see the details of data transfer

1. Open Developer Tools.
2. Click Network.
3. Refresh website.

![Imgur](https://i.imgur.com/OG0Huwj.png?1)

## Each data(file) is a complete process of request and response

- Request
    - Headers
    - Method
- Response
    - Body

## There is a lot of data transferring during web page rendering

- **XHR(XMLHttpRequest)**
- JS
- CSS
- Img
- Media
- Font
- **Doc(HTML documents)**
- WS
- Manifest
- Other

## We can turn off browser's JavaScript to validate where to find the data

- If data disappears, check **XHR**.
- If data still exists, check **Doc**.

## Check which file contains the data we want to scrape

- Preview(Response body rendered in browser)
- Response(body)

## A Chrome browser plug-in to turn JavaScript off

<https://chrome.google.com/webstore/detail/quick-javascript-switcher/geddoclleiomckbhadiaipdggiiccfje>

## Let's see how it works

- <https://ecshweb.pchome.com.tw/search/v3.3>: Data disappears, check **XHR**.
- <https://emap.pcsc.com.tw>: Data disappears, check **XHR**.
- <https://www.imdb.com/chart/top>: Data still exists, check **Doc**.

## Once we've found what we need, check its details

- Headers
    - General
        - Request URL
        - Request Method
    - Request Headers
    - Query String Parameters(if any)
    - Form Data(if any)

![Imgur](https://i.imgur.com/cTva78r.png?1)

![Imgur](https://i.imgur.com/LMVp0m7.png?1)

## But how do we transfer data between Python and server?

We use a third-party library called [Requests](https://requests.readthedocs.io/en/master/).

## Requests

## What is Requests?

> Requests is an elegant and simple HTTP library for Python, built for human being.

Source: <https://requests.readthedocs.io/en/master>

In [2]:
import requests

## If Requests is not installed, we will encounter a `ModuleNotFoundError`

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'requests'
```

## Use `pip install` at Terminal to install Requests

```bash
pip install requests
```

## Check version and its installation file path

- `__version__` attribute
- `__file__` attribute

In [3]:
print(requests.__version__)
print(requests.__file__)

2.25.1
/Users/kuoyaojen/opt/miniconda3/envs/pyds/lib/python3.8/site-packages/requests/__init__.py


## Requesting data with functions

- `requests.get(request_url)`: Make a request with GET method.
- `requests.post(request_url)`: Make a request with POST method.

## Sending requests with key-word arguments

- `params=query_string_parameters`
- `data=form_data`
- `headers=request_headers`

In [4]:
request_url = "https://ecshweb.pchome.com.tw/search/v3.3/all/results"
query_string_parameters = {
    'q': 'macbook',
    'page': 1,
    'sort': 'rnk/dc'
}
response = requests.get(request_url, params=query_string_parameters)

In [5]:
request_url = "https://emap.pcsc.com.tw/EMapSDK.aspx"
form_data = {
    'commandid': 'SearchStore',
    'city': '台北市',
    'town': '大安區'
}
response = requests.post(request_url, data=form_data)

In [6]:
request_url = "https://www.imdb.com/chart/top"
request_headers = {
    "accept-language": "en-US,en;q=1.0"
}
response = requests.get(request_url, headers=request_headers)

## Common attributes and methods to use on `Response` type

- `status_code` attribute to validate HTTP response status codes.
- `text` attribute to extract the response content as a Python `str`.
- `json()` method to parse a JSON format and convert to a Python data structure.

## Next step: parsing data accordingly

- JSON format: call `response.json()` method to generate a Python data structure.
- XML format: use a parser to convert `response.text` to an `element.tree` class.
- HTML format: use a parser to convert `response.text` to a `soup` class.

## JSON format

## What is JSON?

> JavaScript Object Notation (JSON) is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications (e.g., sending some data from the server to the client's browser, so it can be displayed on a web page, or vice versa).

Source: [mozilla.org](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON)

## How to validate if a response belongs to JSON format?

- Looking into Preview(Response body rendered in browser).
- Looking into Response(body).

## JSON exists as a string

- Key-Value Storage layout, quite similar to a Python `dict`.
- Array layout, quite similar to a Python `list`.

## Let's see how a JSON format looks like

<https://ecshweb.pchome.com.tw/search/v3.3/>: Data disappears after turning JavaScript off, check **XHR**.

In [7]:
request_url = "https://ecshweb.pchome.com.tw/search/v3.3/all/results"
query_str_params = {
    'q': 'macbook',
    'page': 1,
    'sort': 'rnk/dc'
}
response = requests.get(request_url, params=query_str_params)

## Calling the `json` method of `Response` to get the parsed structure

In [8]:
json_format = response.json()
print(type(json_format))
print(json_format.keys())

<class 'dict'>
dict_keys(['QTime', 'totalRows', 'totalPage', 'range', 'cateName', 'q', 'subq', 'token', 'isMust', 'prods'])


## In fact, the `json` method of `Response` is actually calling the `loads` function from standard library `json`

In [9]:
json_format = json.loads(response.text)
print(type(json_format))
print(json_format.keys())

<class 'dict'>
dict_keys(['QTime', 'totalRows', 'totalPage', 'range', 'cateName', 'q', 'subq', 'token', 'isMust', 'prods'])


## JSON could also exist as an array layout

In [10]:
print(type(json_format['prods']))
print(len(json_format['prods']))

<class 'list'>
20


## XML format

## What is XML?

> XML (Extensible Markup Language) is a markup language without predefined tags to use. Instead, we define our own tags for our needs. This is a powerful way to store data in a format that can be stored, searched, and shared.

Source: <https://developer.mozilla.org/en-US/docs/Web/XML>

## How to validate if a response belongs to XML format?

Looking into Response(body) and check if there is a XML format being declared.

## XML exists as a string

Self-defined tags with a hierarchical tree layout.

## Let's see how a XML format looks like

<https://emap.pcsc.com.tw/EMapSDK.aspx>: Data disappears after turning JavaScript off, check **XHR**.

In [11]:
request_url = "https://emap.pcsc.com.tw/EMapSDK.aspx"
form_data = {
    'commandid': 'SearchStore',
    'city': '台北市',
    'town': '大安區'
}
response = requests.post(request_url, data=form_data)

## Use the `fromstring` function from `ET` to get the parsed structure

In [12]:
root = ET.fromstring(response.text)
print(type(root))

<class 'xml.etree.ElementTree.Element'>


## Use XPath to extract data from specific tags

> XPath stands for XML Path Language. It provides a flexible way of addressing (pointing to) different parts of an XML document.

Source: <https://developer.mozilla.org/en-US/docs/Web/XPath>

In [13]:
# The XPath for POIName
poinames = [e.text for e in root.findall('.//POIName')]
print(poinames)

['大台', '大信', '大敦', '中廣', '仁安', '仕吉', '台科一', '永康', '禾光', '立仁', '光忠', '吉孝', '吉忠', '合旺', '合維', '安居', '安松', '佑安', '技安', '辛亥', '卓聯', '和平東', '和安', '和金', '和泰', '和樂', '延吉', '昇隆', '東門', '欣安和', '欣隆昌', '花市', '金信', '金華', '長星', '阿波羅', '信中', '信安', '信義', '信興', '建安', '建忠', '建南', '建綸', '恆安', '科技站', '科建', '科興', '師大', '泰利', '國館', '崇光', '康福', '教育大學', '統合', '統家', '統領', '通化', '頂好', '頂安', '頂東', '喜悅', '富陽', '復忠', '復昌', '復維', '敦仁', '敦禾', '敦安', '敦信', '敦南', '敦頂', '敦隆', '敦維', '敦親', '森美', '華電', '愛國', '新北科', '新東帝', '新泰順', '新國聯', '溫州', '溫東', '瑞升', '瑞安', '義村', '誠安', '福亭', '鳳翔', '樂安', '樂利', '樂和', '樂隆', '黎元', '豫銘', '錢忠', '靜安', '龍和', '龍延', '龍門', '龍泉', '龍淵', '龍普', '濟南', '臨江', '臨通', '豐安', '懷生', '羅鑫', '麟光', '鑫忠孝', '鑫泰', '鑫通', '鑫富民', '鑫復']


In [14]:
# The XPath for Address
addresses = [e.text for e in root.findall('.//Address')]
print(addresses)

['台北市大安區羅斯福路三段333巷18號1樓20號1樓(部分)', '台北市大安區信義路三段33號', '台北市大安區敦化南路二段63巷7號1樓', '台北市大安區仁愛路三段25-1號27號', '台北市大安區仁愛路四段27巷1號', '台北市大安區忠孝東路四段223巷42號', '台北市大安區基隆路四段43號1樓', '台北市大安區永康街43號', '台北市大安區和平東路二段63號1樓', '台北市大安區安和路二段74巷1號', '台北市大安區復興南路一段107巷5弄1號1樓', '台北市大安區忠孝東路四段299號', '台北市大安區延吉街72號', '台北市大安區復興南路二段151巷41號', '台北市大安區四維路170巷8號1樓', '台北市大安區安居街33號', '台北市大安區安東街50之2號50之3號50之4號', '台北市大安區忠孝東路三段217巷1弄2號', '台北市大安區和平東路三段97號97之1號1樓', '台北市大安區辛亥路二段57號', '台北市大安區羅斯福路四段1號1樓卓聯大樓', '台北市大安區和平東路一段129之1號', '台北市大安區和平東路三段230號', '台北市大安區和平東路一段91號', '台北市大安區和平東路一段169號', '台北市大安區和平東路三段228巷45號1樓', '台北市大安區延吉街237號', '台北市大安區敦化南路二段238號', '台北市大安區信義路二段198巷6號1樓', '台北市大安區安和路一段47號', '台北市大安區基隆路二段142之1號及142之2號', '台北市大安區建國南路一段274號', '台北市大安區金山南路二段18號1樓', '台北市大安區金華街140號1樓', '台北市大安區基隆路三段85號', '台北市大安區忠孝東路四段222號224號1樓', '台北市大安區信義路三段101號', '台北市大安區大安路一段218號', '台北市大安區信義路四段265巷12弄1號', '台北市大安區信義路四段32號', '台北市大安區敦化南路一段187巷29號', '台北市大安區忠孝東路三段249號', '台北市大安區建國南路二段151巷6之8號', '台北市大安區仁愛路四段151巷33號忠孝東路四段216巷32弄19號21號', '台北市大安區永康街2巷12號1樓', '台北市大安區復興南路二段20

## HTML format

## What is HTML?

> HTML (HyperText Markup Language) is a descriptive language that specifies webpage structure. An HTML document is a plaintext document structured with elements. Elements are surrounded by matching opening and closing tags. Each tag begins and ends with angle brackets (<>).

Source: <https://developer.mozilla.org/en-US/docs/Glossary/HTML>

## How to validate if a response belongs to HTML format?

Looking into Response(body) and check if there is a HTML format being declared.

## HTML exists as a string

Predefined tags with a hierarchical tree layout.

## Let's see how a HTML format looks like

<https://www.imdb.com/>: Data still exists after turning JavaScript off, check **Doc**.

In [15]:
request_url = "https://www.imdb.com/title/tt0111161"
response = requests.get(request_url, headers={"accept-language": "en-US,en;q=1.0"})

## Use the `BeautifulSoup` function from `bs4` to get the parsed structure

In [16]:
soup = BeautifulSoup(response.text)
print(type(soup))

<class 'bs4.BeautifulSoup'>


## Use CSS Selector to extract data from specific tags

> A CSS selector is the part of a CSS rule that describes what elements in a document the rule will match. The matching elements will have the rule's specified style applied to them.

Source: <https://developer.mozilla.org/en-US/docs/Glossary/CSS_Selector>

## A CSS selector can be mixed and matched with

1. Tag names, e.g. `h1`, `a`
2. Class attribute in tags, e.g. `.character`
3. Id attribute in tags, e.g. `#title-overview-widget`

## It is not easy to find CSS selector unless we are a seasoned front-end engineer

A Chrome browser plug-in to help us find the specific CSS selector of element(s): <https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb>

In [17]:
# The CSS Selector for title
title = soup.select('h1')[0].text.strip()
print(title)

The Shawshank Redemption


In [18]:
# The CSS Selector for rating
rating = float(soup.select('a div span')[0].text)
print(rating)

9.3


In [19]:
# The CSS Selector for poster
poster = soup.select('img')[0].get('src')
print(poster)

https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_QL75_UX190_CR0,0,190,281_.jpg


In [20]:
# The CSS Selector for cast
request_url = "https://www.imdb.com/title/tt0111161/fullcredits"
response = requests.get(request_url)
soup = BeautifulSoup(response.text)
cast = [e.text.strip() for e in soup.select('.primary_photo+ td a')]
print(cast)

['Tim Robbins', 'Morgan Freeman', 'Bob Gunton', 'William Sadler', 'Clancy Brown', 'Gil Bellows', 'Mark Rolston', 'James Whitmore', 'Jeffrey DeMunn', 'Larry Brandenburg', 'Neil Giuntoli', 'Brian Libby', 'David Proval', 'Joseph Ragno', 'Jude Ciccolella', 'Paul McCrane', 'Renee Blaine', 'Scott Mann', 'John Horton', 'Gordon Greene', 'Alfonso Freeman', 'Vincent Foster', 'John E. Summers', 'Frank Medrano', 'Mack Miles', 'Alan R. Kessler', 'Morgan Lund', 'Cornell Wallace', 'Gary Lee Davis', 'Neil Summers', 'Ned Bellamy', 'Joe Pecoraro', 'Harold E. Cope Jr.', 'Brian Delate', 'Don McManus', 'Donald Zinn', 'Dorothy Silver', 'Robert Haley', 'Dana Snyder', 'John D. Craig', 'Ken Magee', 'Eugene C. DePasquale', 'Bill Bolender', 'Ron Newell', 'John R. Woodward', 'Chuck Brauchler', 'Dion Anderson', 'Claire Slemmer', 'James Kisicki', 'Rohn Thomas', 'Charlie Kearns', 'Rob Reider', 'Brian Brophy', 'Paul Kennedy', 'James Babson', 'Dennis Baker', 'Fred Culbertson', 'Richard Doone', 'Shane Grove', 'Rita Hay

In [21]:
# The CSS Selector for characters
characters = [e.text.strip() for e in soup.select('.character a:nth-child(1)')]
print(characters)

['Andy Dufresne', "Ellis Boyd 'Red' Redding", 'Warden Norton', 'Heywood', 'Captain Hadley', 'Tommy', 'Bogs Diamond', 'Brooks Hatlen', '1946 D.A.', 'Skeet', 'Jigger', 'Floyd', 'Snooze', 'Ernie', 'Guard Mert', 'Guard Trout', "Andy Dufresne's Wife", 'Glenn Quentin', '1946 Judge', '1947 Parole Hearings Man', 'Fresh Fish Con', 'Hungry Fish Con', 'New Fish Guard', 'Fat Ass', 'Tyrell', 'Laundry Bob', 'Laundry Truck Driver', 'Laundry Leonard', 'Rooster', 'Pete', 'Guard Youngblood', 'Projectionist', 'Hole Guard', 'Guard Dekins', 'Guard Wiley', 'Moresby Batter', '1954 Landlady', '1954 Food-Way Manager', '1954 Food-Way Woman', '1957 Parole Hearings Man', 'Ned Grimes', 'Mail Caller', 'Elmo Blatch', 'Elderly Hole Guard', 'Bullhorn Tower Guard', 'Man Missing Guard', 'Head Bull Haig', 'Bank Teller', 'Bank Manager', 'Bugle Editor', '1966 D.A.', 'Duty Guard', '1967 Parole Hearings Man', '1967 Food-Way Manager', 'Con', 'Old Man on Bus', 'Police Officer', 'Con', 'Inmate', 'Gilda Mundson Farrell', 'Bank T