## Python Environment
- [Download Miniconda](https://docs.conda.io/en/latest/miniconda.html)
- Installing jupyter
```bash
pip install jupyter
```
- Check environments
```bash
conda env list
```
- Creating a environment for web crawling
```bash
# crawler as the env name
conda create --name crawler python=3
```
- Activate environment
```bash
conda activate crawler
```
- Install required packages for web crawling: `requirements.txt`
    - requests: Getting data
    - beautifulsoup/pyquery: Parsing data
    - selenium: Browser automation
    - numpy/pandas: Data wrangling
    - ipykernel: Connecting jupyter with environments
```
requests
beautifulsoup4
pyquery
selenium
numpy
pandas
ipykernel
```
- Installing required packages
```bash
pip install -r requirements.txt
```
- Check Jupyker kernels
```bash
jupyter kernelspec list
```
- Connecting kernel with environment
```bash
python -m ipykernel install --user --name crawler --display-name "crawler"
```

In [19]:
import requests

r = requests.get("http://data.nba.net/prod/v1/20190514/scoreboard.json") # get json data
print(r.status_code)

200


In [20]:
today_scoreboard = r.json()
print(type(today_scoreboard))

<class 'dict'>


In [21]:
today_scoreboard.keys()

dict_keys(['_internal', 'numGames', 'games'])

In [22]:
western_final_g1 = today_scoreboard["games"][0]
western_final_g1.keys()

dict_keys(['seasonStageId', 'seasonYear', 'gameId', 'arena', 'isGameActivated', 'statusNum', 'extendedStatusNum', 'startTimeEastern', 'startTimeUTC', 'startDateEastern', 'clock', 'isBuzzerBeater', 'isPreviewArticleAvail', 'isRecapArticleAvail', 'tickets', 'hasGameBookPdf', 'isStartTimeTBD', 'nugget', 'attendance', 'gameDuration', 'tags', 'playoffs', 'period', 'vTeam', 'hTeam', 'watch'])

In [23]:
print(western_final_g1["hTeam"]["score"])
print(western_final_g1["vTeam"]["score"])

75
67


## Web Scraping Methods

- Web API
    - `.json`
    - `requests`
- Non Web API
    - `.html`
    - `requests` + Parser(beautifulsoup4, pyquery)

In [25]:
import requests

r = requests.get("https://ecshweb.pchome.com.tw/search/v3.3/all/results?q=macbook&page=1&sort=sale/dc", verify=False) # get json data
print(r.status_code)
print(type(r))


200
<class 'requests.models.Response'>


In [27]:
print(r.json())

{'QTime': 105, 'totalRows': 14470, 'totalPage': 100, 'range': {'min': '', 'max': ''}, 'cateName': '', 'q': 'macbook', 'subq': '', 'token': ['macbook'], 'prods': [{'Id': 'DYAJBG-19009S5Q7', 'cateId': 'DYAJCW', 'picS': '/items/DYAJBG19009S5Q7/000002_1553827164.jpg', 'picB': '/items/DYAJBG19009S5Q7/000001_1554886523.jpg', 'name': 'MacBook Air 13-inch: 1.8GHz dual-core Intel Core i5, 128GB (MQD32TA/A)', 'describe': '降2千★再搭原廠滑鼠MacBook Air 13第五代 i5 / 8GB / 128GB / 1.8GHz dual\\r\\n降2千★再搭原廠滑鼠活動日期：2019 04 12(五) 15:00 -2019 06 21(五) 23:59\r\nmacbook air  128gb (市價$31900) + \r\nmagic mouse 2 (市價$2290) \r\n數量有限，售完為止\r\n網路價$34190．驚喜優惠價↘$２９９００\r\n\r\n● intel core  i5 處理器\r\n● intel hd graphics 6000\r\n● ssd 儲存裝置\r\n● 長達 12 小時電池續航力\r\n● 802.11 ac wi-fi\r\n● multi - touch 觸控式軌跡板\r\n● 最長可達 30 天待機時間\r\n● 節能低耗又兼具高效能的設計\r\n\r\n注意事項', 'price': 29900, 'originPrice': 29900, 'author': '', 'brand': '', 'publishDate': '', 'sellerId': '', 'isPChome': 1, 'isNC17': 0, 'couponActid': [], 'BU': 'ec'}, {'Id': 'DYAJB

In [28]:
print(r.content)

b'{"QTime":105,"totalRows":14470,"totalPage":100,"range":{"min":"","max":""},"cateName":"","q":"macbook","subq":"","token":["macbook"],"prods":[{"Id":"DYAJBG-19009S5Q7","cateId":"DYAJCW","picS":"\\/items\\/DYAJBG19009S5Q7\\/000002_1553827164.jpg","picB":"\\/items\\/DYAJBG19009S5Q7\\/000001_1554886523.jpg","name":"MacBook Air 13-inch: 1.8GHz dual-core Intel Core i5, 128GB (MQD32TA\\/A)","describe":"\\u964d2\\u5343\\u2605\\u518d\\u642d\\u539f\\u5ee0\\u6ed1\\u9f20MacBook Air 13\\u7b2c\\u4e94\\u4ee3 i5 \\/ 8GB \\/ 128GB \\/ 1.8GHz dual\\\\r\\\\n\\u964d2\\u5343\\u2605\\u518d\\u642d\\u539f\\u5ee0\\u6ed1\\u9f20\\u6d3b\\u52d5\\u65e5\\u671f\\uff1a2019 04 12(\\u4e94) 15:00 -2019 06 21(\\u4e94) 23:59\\r\\nmacbook air  128gb (\\u5e02\\u50f9$31900) + \\r\\nmagic mouse 2 (\\u5e02\\u50f9$2290) \\r\\n\\u6578\\u91cf\\u6709\\u9650\\uff0c\\u552e\\u5b8c\\u70ba\\u6b62\\r\\n\\u7db2\\u8def\\u50f9$34190\\uff0e\\u9a5a\\u559c\\u512a\\u60e0\\u50f9\\u2198$\\uff12\\uff19\\uff19\\uff10\\uff10\\r\\n\\r\\n\\u25cf int

In [30]:
# 非結構化資料
import requests

test = requests.get("https://ecshweb.pchome.com.tw/search/v3.3/all/categories?q=macbook", verify=False) # get json data
print(test.status_code)
print(type(test))
print(test.text)

200
<class 'requests.models.Response'>
[{"Id":"Q","name":"\u8cfc\u7269\u4e2d\u5fc3","qty":3090,"nodes":[{"Id":"QAAO","name":"Apple\/\u914d\u4ef6","qty":941,"nodes":[{"Id":"QAAO0E","qty":337},{"Id":"QAAO0D","qty":106},{"Id":"QAAO6E","qty":71},{"Id":"QAAO6C","qty":64},{"Id":"QAAO6D","qty":62},{"Id":"QAAO6B","qty":38},{"Id":"QAAO0Y","qty":33},{"Id":"QAAO5I","qty":24},{"Id":"QAAO6A","qty":24},{"Id":"QAAOAP","qty":19},{"Id":"QAAO5F","qty":15},{"Id":"QAAO2F","qty":13},{"Id":"QAAO1K","qty":11},{"Id":"QAAO1B","qty":6},{"Id":"QAAO32","qty":6},{"Id":"QAAOBT","qty":6},{"Id":"QAAO1E","qty":5},{"Id":"QAAO5H","qty":5},{"Id":"QAAO0U","qty":4},{"Id":"QAAO21","qty":4},{"Id":"QAAO40","qty":4},{"Id":"QAAO5D","qty":4},{"Id":"QAAO5G","qty":4},{"Id":"QAAO71","qty":4},{"Id":"QAAO78","qty":4},{"Id":"QAAO9O","qty":4},{"Id":"QAAOBR","qty":4},{"Id":"QAAO2T","qty":3},{"Id":"QAAO37","qty":3},{"Id":"QAAO3B","qty":3},{"Id":"QAAO59","qty":3},{"Id":"QAAO5K","qty":3},{"Id":"QAAO72","qty":3},{"Id":"QAAO76","qty":3},{"Id



### Exercise 1: AQI

In [2]:
import requests
r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False) # get json data
print(r.status_code)

200




In [4]:
# How many sites in Taiwan?
todayAQI = r.json()
#todayAQI[:10]
print("總共有" + str(len(todayAQI)) + "間測站")

總共有81間測站


In [37]:
#todayAQI[:50]

In [5]:
# How many sited in Taipei City?
site_Taipei_City = []
for i in range(len(todayAQI)):
    if todayAQI[i]["County"] == '臺北市':
        site_Taipei_City.append(todayAQI[i]["County"])

print("台北市共有" + str(len(site_Taipei_City)) + "間測站")

台北市共有7間測站


In [9]:
# Highest PM2.5 site name?
site_names = [todayAQI[i]["SiteName"] for i in range(len(todayAQI))]
pm25 = []
for i in range(len(todayAQI)):
    if todayAQI[i]["PM2.5"] == '' or todayAQI[i]["PM2.5"] == 'ND':
        pm25.append(0)
    else:
        pm25.append(float(todayAQI[i]["PM2.5"]))
        
import pandas as pd
ser = pd.Series(pm25, index=site_names)

for i in range(len(todayAQI)):
    if todayAQI[i]["PM2.5"] == max(pm25):
        print(todayAQI[i]["SiteName"])

48.0


In [41]:
# Lowest PM2.5 site name?
site_names = [todayAQI[i]["SiteName"] for i in range(len(todayAQI))]
pm25 = []
for i in range(len(todayAQI)):
    if todayAQI[i]["PM2.5"] == '' or todayAQI[i]["PM2.5"] == 'ND':
        pm25.append(0)
    else:
        pm25.append(float(todayAQI[i]["PM2.5"]))
        
import pandas as pd
ser = pd.Series(pm25, index=site_names)
print(max(ser.sort_values()))

49.0


## Non Web API

- `requests` Getting data
- `beautifulsoup4` / `pyquery` Parsing data

### Examples

- [Avengers: Endgame (2019)](https://www.imdb.com/title/tt4154796)
- [Yahoo! 奇摩股市](https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e=tse&n=100)
- [PTT 實業坊](https://www.ptt.cc/bbs/NBA/index.html)

In [11]:
import requests
r = requests.get("https://www.imdb.com/title/tt4154796/", verify=False) # get nonWEBAPI data
print(r.status_code)
html_str = r.text
print(len(html_str))



200
240075


In [12]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str)

In [16]:
soup.find("h1")

<h1 class="">Avengers: Endgame <span id="titleYear">(<a href="/year/2019/">2019</a>)</span> </h1>

In [23]:
soup.find("a").get("id")

'home_img'

In [22]:
soup.find_all("h2")
print(type(soup.find_all("h2")))
print(type(soup.find("h2")))

<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>


In [19]:
soup.select(".ratingValue span")

[<span itemprop="ratingValue">8.8</span>,
 <span class="grey">/</span>,
 <span class="grey" itemprop="bestRating">10</span>]

## pyquery

- `from pyquery import PyQuery as pq`
- `d = pq(html_str)`
- `d("<SELECTOR>")`: Finding specific data via CSS Selector
- PyQuery
    - HtmlElement.items()
        - .text()
        - .attr("<ATTR>")

In [24]:
from pyquery import PyQuery as pq
d = pq(html_str)
type(d)

for i in d(".ratingValue span").items():
    print(i.text())
    
for i in d("img").items():
    print(i)

pyquery.pyquery.PyQuery

## Demo
#### 片長、卡司

In [27]:
import requests
r = requests.get("https://www.imdb.com/title/tt4154796/", verify=False) # get nonWEBAPI data
print(r.status_code)
html_str = r.text
print(len(html_str))
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str)



200
242381


In [28]:
# 片長
for i in soup.find_all("time"):
    print(i.text.strip())

3h 1min
181 min


In [32]:
# 卡司表
for i in soup.select(".primary_photo+ td a"):
    route = i.get("href")
    print(i.text.strip(), "https://www.imdb.com/" + route)

Robert Downey Jr. https://www.imdb.com//name/nm0000375/

Chris Evans https://www.imdb.com//name/nm0262635/

Mark Ruffalo https://www.imdb.com//name/nm0749263/

Chris Hemsworth https://www.imdb.com//name/nm1165110/

Scarlett Johansson https://www.imdb.com//name/nm0424060/

Jeremy Renner https://www.imdb.com//name/nm0719637/

Don Cheadle https://www.imdb.com//name/nm0000332/

Paul Rudd https://www.imdb.com//name/nm0748620/

Benedict Cumberbatch https://www.imdb.com//name/nm1212722/

Chadwick Boseman https://www.imdb.com//name/nm1569276/

Brie Larson https://www.imdb.com//name/nm0488953/

Tom Holland https://www.imdb.com//name/nm4043618/

Karen Gillan https://www.imdb.com//name/nm2394794/

Zoe Saldana https://www.imdb.com//name/nm0757855/

Evangeline Lilly https://www.imdb.com//name/nm1431940/



## 練習題
- 把所有Released Date抓出來，並做Summary

In [1]:
import requests
r = requests.get("https://www.imdb.com/title/tt4154796/releaseinfo?ref_=tt_ov_inf", verify=False) # get nonWEBAPI data
print(r.status_code)
html_str = r.text
print(len(html_str))
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_str)



200
109850


In [11]:
date = []
for i in soup.select(".release-date-item__date"):
    d = i.text.strip()
    date.append(d)

country = []
for i in soup.select(".release-date-item__country-name a"):
    d = i.text.strip()
    country.append(d)

import pandas as pd
ser = pd.Series(date, index=country)
print("Parsing Release Information of Avengers: Endgame(2019)")
print(ser.head())
# ==================================================================
print()
print("Method I")
print("Summarizing Number of Countries Grouped by Released Date")
print(ser.value_counts())
print()
print()
# ==============================
print("Method II")
df = pd.DataFrame()
df['Country'] = country
df['Released Date'] = date
summary = df.groupby('Released Date').count()
print(summary)
# ==================================================================
print()
print("The Date Released in Taiwan is " + str(ser['Taiwan']))

Parsing Release Information of Avengers: Endgame(2019)
USA                     22 April 2019
Russia                  23 April 2019
United Arab Emirates    24 April 2019
Austria                 24 April 2019
Australia               24 April 2019
dtype: object

Method I
Summarizing Number of Countries Grouped by Released Date
24 April 2019    31
25 April 2019    21
26 April 2019    14
22 April 2019     1
28 April 2019     1
23 April 2019     1
29 April 2019     1
dtype: int64


Method II
               Country
Released Date         
22 April 2019        1
23 April 2019        1
24 April 2019       31
25 April 2019       21
26 April 2019       14
28 April 2019        1
29 April 2019        1

The Date Released in Taiwan is 24 April 2019


In [None]:
from bs4 import BeautifulSoup

def get_imdb_movie_data(movie_url):
    r = requests.get(movie_url, verify=False) 
    html_str = r.text
    soup = BeautifulSoup(html_str)
