# 網路爬蟲 2019-05-15

## 隨堂練習

### Web API: AQI DATA

```python
import requests

r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False)
```

- How many sites in Taiwan?
- How many sites in Taipei City?
- Highest PM2.5 site name?
- Lowest PM2.5 site name?

### Non Web API: IMDB DATA

```python
import requests

r = requests.get("https://www.imdb.com/title/tt4154796/releaseinfo/")
```

- Parsing release information of Avengers: Endgame (2019)
- Summarizing number of countries grouped by release date
- When is Taiwan's release date?

## Python Environment

- [Download Miniconda](https://docs.conda.io/en/latest/miniconda.html)
- Installing jupyter

```bash
pip install jupyter
```

- Check environments

```bash
conda env list
```

- Creating a environment for web crawling

```bash
# crawler as the env name
conda create --name crawler python=3
```

- Activate environment

```bash
conda activate crawler
```

- Install required packages for web crawling: `requirements.txt`
    - `requests`: Getting data
    - `beautifulsoup`/`pyquery`: Parsing data
    - `selenium`: Browser automation
    - `numpy`/`pandas`: Data wrangling
    - `ipykernel`: Connecting jupyter with environments

```
requests
beautifulsoup4
pyquery
selenium
numpy
pandas
ipykernel
```

- Installing required packages

```bash
pip install -r requirements.txt
```

- Check Jupyker kernels

```bash
jupyter kernelspec list
```

- Connecting kernel with environment

```bash
# Run this command while "crawler" is activated
python -m ipykernel install --user --name crawler --display-name "crawler"
```

## Hello world

In [12]:
import requests

r = requests.get("http://data.nba.net/prod/v1/20190514/scoreboard.json") # GET
print(r.status_code)

200


In [13]:
today_scoreboard = r.json()
print(type(today_scoreboard))

<class 'dict'>


In [14]:
today_scoreboard.keys()

dict_keys(['_internal', 'numGames', 'games'])

In [15]:
western_final_g1 = today_scoreboard["games"][0]
western_final_g1.keys()

dict_keys(['seasonStageId', 'seasonYear', 'gameId', 'arena', 'isGameActivated', 'statusNum', 'extendedStatusNum', 'startTimeEastern', 'startTimeUTC', 'startDateEastern', 'clock', 'isBuzzerBeater', 'isPreviewArticleAvail', 'isRecapArticleAvail', 'tickets', 'hasGameBookPdf', 'isStartTimeTBD', 'nugget', 'attendance', 'gameDuration', 'tags', 'playoffs', 'period', 'vTeam', 'hTeam', 'watch'])

In [16]:
western_final_g1["vTeam"]

{'teamId': '1610612757',
 'triCode': 'POR',
 'win': '53',
 'loss': '29',
 'seriesWin': '0',
 'seriesLoss': '0',
 'score': '53',
 'linescore': [{'score': '23'},
  {'score': '22'},
  {'score': '8'},
  {'score': '0'}]}

In [17]:
western_final_g1["hTeam"]

{'teamId': '1610612744',
 'triCode': 'GSW',
 'win': '57',
 'loss': '25',
 'seriesWin': '0',
 'seriesLoss': '0',
 'score': '67',
 'linescore': [{'score': '27'},
  {'score': '27'},
  {'score': '13'},
  {'score': '0'}]}

## Web Scraping Methods

- Web API
    - `.json`
    - `requests`
- Non Web API
    - `.html`
    - `requests` + Parser(beautifulsoup4, pyquery)
- Browser Automation
    - `selenium`

## JSONView

讓 JSON 資料格式在瀏覽器上呈現得比較漂亮

<https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc>

## Web API Examples

- [data.nba.net](http://data.nba.net/prod/v1/today.json)
- [空氣品質指標(AQI)](https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json)
- [PCHome 線上購物](https://ecshweb.pchome.com.tw/search/v3.3/all/results?q=macbook&page=1&sort=sale/dc)

### Important attrs / methods

- `r = requests.get(URL)`
- `r.status_code`
- `r.json()`: `dict` / `list`
- `r.content`: `bytes`
- `r.text`: `str`

In [19]:
import requests

r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False)
print(type(r))



<class 'requests.models.Response'>


In [21]:
print(r.status_code)

200


In [22]:
print(r.json())

[{'SiteName': '屏東(琉球)', 'County': '屏東縣', 'AQI': '', 'Pollutant': '', 'Status': '設備維護', 'SO2': '', 'CO': '', 'CO_8hr': '0.2', 'O3': '28', 'O3_8hr': '22', 'PM10': '', 'PM2.5': '', 'NO2': '', 'NOx': '', 'NO': '', 'WindSpeed': '2.3', 'WindDirec': '176', 'PublishTime': '2019-05-15 11:00', 'PM2.5_AVG': '', 'PM10_AVG': '15', 'SO2_AVG': '2', 'Longitude': '120.377222', 'Latitude': '22.352222'}, {'SiteName': '苗栗(後龍)', 'County': '苗栗縣', 'AQI': '78', 'Pollutant': '細懸浮微粒', 'Status': '普通', 'SO2': '', 'CO': '', 'CO_8hr': '0.3', 'O3': '', 'O3_8hr': '13', 'PM10': '', 'PM2.5': '', 'NO2': '', 'NOx': '', 'NO': '', 'WindSpeed': '2.2', 'WindDirec': '280', 'PublishTime': '2019-05-15 11:00', 'PM2.5_AVG': '26', 'PM10_AVG': '41', 'SO2_AVG': '3', 'Longitude': '120.786028', 'Latitude': '24.616369'}, {'SiteName': '彰化(大城)', 'County': '彰化縣', 'AQI': '65', 'Pollutant': '細懸浮微粒', 'Status': '普通', 'SO2': '2.5', 'CO': '0.24', 'CO_8hr': '0.2', 'O3': '39', 'O3_8hr': '23', 'PM10': '19', 'PM2.5': '10', 'NO2': '1.6', 'NOx': '3.4

In [None]:
# How many sites in Taiwan?
# How many sites in Taipei City?
# Highest PM2.5 site name?
# Lowest PM2.5 site name?

In [30]:
import requests

r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False)
aqi_data = r.json()



In [33]:
# How many sites in Taiwan?
len(aqi_data)

81

In [48]:
# How many sites in Taipei City?
ans = 0
for i in aqi_data:
    if i["County"] == "臺北市":
         ans += 1
print(ans)

7


In [53]:
import numpy as np

site_names = [aqi_data[i]['SiteName'] for i in range(len(aqi_data))]
pm25 = np.zeros(len(site_names))
for i in range(len(aqi_data)):
    if aqi_data[i]['PM2.5'] == '' or aqi_data[i]['PM2.5'] == 'ND':
        pm25[i] = np.nan
    else:
        pm25[i] = float(aqi_data[i]['PM2.5'])

In [62]:
max_pm25 = pm25[~np.isnan(pm25)].max()
min_pm25 = pm25[~np.isnan(pm25)].min()
print(max_pm25)
print(min_pm25)

49.0
2.0


In [63]:
# Highest PM2.5 site name? 
for idx, val in enumerate(pm25):
    if val == max_pm25:
        print(site_names[idx])

大園


In [67]:
np.array(site_names)[pm25 == max_pm25]

array(['大園'], dtype='<U6')

In [64]:
# Lowest PM2.5 site name? 
for idx, val in enumerate(pm25):
    if val == min_pm25:
        print(site_names[idx])

竹山
新營


In [68]:
np.array(site_names)[pm25 == min_pm25]

array(['竹山', '新營'], dtype='<U6')

## Non Web API

- `requests` Getting data
- `beautifulsoup4` / `pyquery` Parsing data
- [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

### Examples

- [Avengers: Endgame (2019)](https://www.imdb.com/title/tt4154796)
- [Yahoo! 奇摩股市](https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e=tse&n=100)
- [PTT 實業坊](https://www.ptt.cc/bbs/NBA/index.html)

In [70]:
import requests

r = requests.get("https://www.imdb.com/title/tt4154796")
html_str = r.text
print(len(html_str))

241126


## `beautifulsoup4`

- `from bs4 import BeautifulSoup`
- `soup = BeautifulSoup(html_str)`
- `soup.find()`: Finding specific data via tag name
- `soup.find_all()`: Finding specific data via tag names
- `soup.select()`: Finding specific data via CSS Selector

- `bs4.BeautifulSoup`
    - `bs4.element.Tag`
        - `.text`
        - `.get("<ATTR>")`

In [71]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str)

In [81]:
type(soup)

bs4.BeautifulSoup

In [82]:
soup.find("h1")

<h1 class="">Avengers: Endgame <span id="titleYear">(<a href="/year/2019/">2019</a>)</span> </h1>

In [83]:
print(type(soup.find("h1")))

<class 'bs4.element.Tag'>


In [84]:
soup.find("h1").text

'Avengers: Endgame\xa0(2019) '

In [88]:
soup.find("a")

<a class="navbarSprite" href="/" id="home_img" title="Home"></a>

In [86]:
soup.find("a").get("class")

['navbarSprite']

In [87]:
soup.find("a").get("href")

'/'

In [89]:
soup.find("a").get("id")

'home_img'

In [76]:
soup.find_all("h2")

[<h2><div class="checkin-error">Error</div></h2>,
 <h2><div class="checkin-success">Added to Your Check-Ins.</div></h2>,
 <h2>Videos</h2>,
 <h2>Photos</h2>,
 <h2 class="rec_heading_wrapper">
 <span class="rec_heading" data-spec="p13nsims:tt4154796">More Like This </span>
 </h2>,
 <h2>Cast</h2>,
 <h2>Storyline</h2>,
 <h2>Details</h2>,
 <h2>Did You Know?</h2>,
 <h2>Frequently Asked Questions</h2>,
 <h2>User Reviews</h2>,
 <h2>Contribute to This Page</h2>]

In [79]:
soup.select(".ratingValue span")

[<span itemprop="ratingValue">8.8</span>,
 <span class="grey">/</span>,
 <span class="grey" itemprop="bestRating">10</span>]

In [95]:
soup.find_all("img")[2].get("src")

'https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg'

## pyquery

- `from pyquery import PyQuery as pq`
- `d = pq(html_str)`
- `d("<SELECTOR>")`: Finding specific data via CSS Selector
- PyQuery
    - `HtmlElement.items()`
        - `.text()`
        - `.attr("<ATTR>")`

In [97]:
from pyquery import PyQuery as pq

d = pq(html_str)
type(d)

pyquery.pyquery.PyQuery

In [112]:
for i in d(".ratingValue span").items():
    print(i.text())

8.8
/
10


In [122]:
for i in d(".poster img").items():
    print(i.attr("src"))

https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg


In [125]:
for i in soup.find_all("time"):
    print(i.text.strip())

3h 1min
181 min


In [129]:
for i in soup.select(".primary_photo+ td a"):
    print(i.text.strip())

Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Benedict Cumberbatch
Chadwick Boseman
Brie Larson
Tom Holland
Karen Gillan
Zoe Saldana
Evangeline Lilly


In [131]:
for i in soup.select(".primary_photo+ td a"):
    route = i.get("href")
    print("https://www.imdb.com" + route)

https://www.imdb.com/name/nm0000375/
https://www.imdb.com/name/nm0262635/
https://www.imdb.com/name/nm0749263/
https://www.imdb.com/name/nm1165110/
https://www.imdb.com/name/nm0424060/
https://www.imdb.com/name/nm0719637/
https://www.imdb.com/name/nm0000332/
https://www.imdb.com/name/nm0748620/
https://www.imdb.com/name/nm1212722/
https://www.imdb.com/name/nm1569276/
https://www.imdb.com/name/nm0488953/
https://www.imdb.com/name/nm4043618/
https://www.imdb.com/name/nm2394794/
https://www.imdb.com/name/nm0757855/
https://www.imdb.com/name/nm1431940/


In [132]:
for i in d("time").items():
    print(i.text())

3h 1min
181 min


In [133]:
for i in d(".primary_photo+ td a").items():
    print(i.text())

Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Benedict Cumberbatch
Chadwick Boseman
Brie Larson
Tom Holland
Karen Gillan
Zoe Saldana
Evangeline Lilly


In [135]:
for i in d(".primary_photo+ td a").items():
    route = i.attr("href")
    print("https://www.imdb.com" + route)

https://www.imdb.com/name/nm0000375/
https://www.imdb.com/name/nm0262635/
https://www.imdb.com/name/nm0749263/
https://www.imdb.com/name/nm1165110/
https://www.imdb.com/name/nm0424060/
https://www.imdb.com/name/nm0719637/
https://www.imdb.com/name/nm0000332/
https://www.imdb.com/name/nm0748620/
https://www.imdb.com/name/nm1212722/
https://www.imdb.com/name/nm1569276/
https://www.imdb.com/name/nm0488953/
https://www.imdb.com/name/nm4043618/
https://www.imdb.com/name/nm2394794/
https://www.imdb.com/name/nm0757855/
https://www.imdb.com/name/nm1431940/


In [137]:
import requests
from bs4 import BeautifulSoup
from pyquery import PyQuery
import pandas as pd

r = requests.get("https://www.imdb.com/title/tt4154796/releaseinfo/")
html_str = r.text

In [153]:
soup = BeautifulSoup(html_str)
# Parsing release information of Avengers: Endgame (2019)
countries = [i.text.strip() for i in soup.select(".release-date-item__country-name a")]
release_dates = [i.text for i in soup.select(".release-date-item__date")]
# Summarizing number of countries grouped by release date
df = pd.DataFrame()
df["country"] = countries
df["release_date"] = release_dates
print(df.groupby("release_date")["country"].count())
# When is Taiwan's release date?
print(df[df["country"] == "Taiwan"]["release_date"].values[0])

release_date
22 April 2019     1
23 April 2019     1
24 April 2019    31
25 April 2019    21
26 April 2019    14
28 April 2019     1
29 April 2019     1
Name: country, dtype: int64
24 April 2019


In [158]:
d = PyQuery(html_str)
# Parsing release information of Avengers: Endgame (2019)
countries = [i.text() for i in d(".release-date-item__country-name a").items()]
release_dates = [i.text() for i in d(".release-date-item__date").items()]
# Summarizing number of countries grouped by release date
df = pd.DataFrame()
df["country"] = countries
df["release_date"] = release_dates
print(df.groupby("release_date")["country"].count())
# When is Taiwan's release date?
print(df[df["country"] == "Taiwan"]["release_date"].values[0])

release_date
22 April 2019     1
23 April 2019     1
24 April 2019    31
25 April 2019    21
26 April 2019    14
28 April 2019     1
29 April 2019     1
Name: country, dtype: int64
24 April 2019


In [174]:
def get_imdb_movie_data(movie_url):
    r = requests.get(movie_url)
    html_str = r.text
    d = pq(html_str)
    rating = float([i.text() for i in d(".ratingValue span").items()][0])
    r = requests.get(movie_url + "releaseinfo")
    html_str = r.text
    d = pq(html_str)
    countries = [i.text() for i in d(".release-date-item__country-name a").items()]
    release_dates = [i.text() for i in d(".release-date-item__date").items()]
    release_dates_dict = {
        "country": countries,
        "release_date": release_dates
    }
    return rating, release_dates_dict

In [178]:
def get_movie(movie_title):
    q_url = "https://www.imdb.com/find?q={}&s=tt&ttype=ft&ref_=fn_ft".format(movie_title)
    r = requests.get(q_url)
    html_str = r.text
    soup = BeautifulSoup(html_str)
    movie_url = soup.select(".result_text a")[0].get("href")
    movie_url = "https://www.imdb.com" + movie_url
    movie = get_imdb_movie_data(movie_url)
    return movie

In [179]:
get_movie("Avengers: Endgame")

(8.8,
 {'country': ['USA',
   'Russia',
   'United Arab Emirates',
   'Austria',
   'Australia',
   'Belgium',
   'China',
   'Colombia',
   'Cyprus',
   'Germany',
   'Denmark',
   'Egypt',
   'Finland',
   'France',
   'Greece',
   'Hong Kong',
   'Indonesia',
   'Israel',
   'Italy',
   'South Korea',
   'Lebanon',
   'Malaysia',
   'Netherlands',
   'Norway',
   'New Zealand',
   'Philippines',
   'Paraguay',
   'Saudi Arabia',
   'Sweden',
   'Singapore',
   'Thailand',
   'Taiwan',
   'Kosovo',
   'Argentina',
   'Brazil',
   'Costa Rica',
   'Spain',
   'UK',
   'Hungary',
   'Ireland',
   'Cambodia',
   'Kuwait',
   'Lithuania',
   'Montenegro',
   'Nigeria',
   'Peru',
   'Poland',
   'Portugal',
   'Romania',
   'Serbia',
   'Slovakia',
   'Turkey',
   'Ukraine',
   'Uruguay',
   'Bangladesh',
   'Bulgaria',
   'Canada',
   'Estonia',
   'India',
   'Japan',
   'Sri Lanka',
   'Lithuania',
   'Morocco',
   'Mexico',
   'Nepal',
   'Pakistan',
   'USA',
   'Vietnam',
   'Armen

## R Environemt

```r
pkgs <- c("jsonlite", "rvest", "httr", "magrittr")
install.packages(pkgs)
```

### Web API

- `jsonlite::fromJSON()`
    - Object: `list` as output
    - Array of Objects: `data.frame` as output

### Non Web API

- `rvest`
    - `read_html()`: Performs the same behavior as `requests.get()`
    - `html_nodes()`: Performs the same behavior as `soup.select()`
    - `html_text()`: Extracts text part from tags
    - `html_attr()`: Extracts attr part from tags