# 網路爬蟲

## 隨堂練習

### Web API: AQI DATA

```python
import requests

r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False)
```

- How many sites in Taiwan?
- How many sites in Taipei City?
- Highest PM2.5 site name?
- Lowest PM2.5 site name?

### Non Web API: IMDB DATA

```python
import requests

r = requests.get("https://www.imdb.com/title/tt4154796/releaseinfo/")
```

- Parsing release information of Avengers: Endgame (2019)
- Summarizing number of countries grouped by release date
- When is Taiwan's release date?

### Selenium: IMDB DATA

```python
# get_movie_info()
#movie_titles = ["The Avengers (2012)", "Avengers: Age of Ultron (2015)", "Avengers: Infinity War (2018)", "Avengers: Endgame (2019)"]
#get_movie_info(movie_titles) # return list / dict(key: movie_title)
#print(rating)
#print(movie_time)
#print(genre)
#print(cast)
#print(countries)
#print(release_dates)
```

## 2019-05-15

## Python Environment

- [Download Miniconda](https://docs.conda.io/en/latest/miniconda.html)
- Installing jupyter

```bash
pip install jupyter
```

- Check environments

```bash
conda env list
```

- Creating a environment for web crawling

```bash
# crawler as the env name
conda create --name crawler python=3
```

- Activate environment

```bash
conda activate crawler
```

- Install required packages for web crawling: `requirements.txt`
    - `requests`: Getting data
    - `beautifulsoup`/`pyquery`: Parsing data
    - `selenium`: Browser automation
    - `numpy`/`pandas`: Data wrangling
    - `ipykernel`: Connecting jupyter with environments

```
requests
beautifulsoup4
pyquery
selenium
numpy
pandas
ipykernel
```

- Installing required packages

```bash
pip install -r requirements.txt
```

- Check Jupyker kernels

```bash
jupyter kernelspec list
```

- Connecting kernel with environment

```bash
# Run this command while "crawler" is activated
python -m ipykernel install --user --name crawler --display-name "crawler"
```

## Hello world

In [12]:
import requests

r = requests.get("http://data.nba.net/prod/v1/20190514/scoreboard.json") # GET
print(r.status_code)

200


In [13]:
today_scoreboard = r.json()
print(type(today_scoreboard))

<class 'dict'>


In [14]:
today_scoreboard.keys()

dict_keys(['_internal', 'numGames', 'games'])

In [15]:
western_final_g1 = today_scoreboard["games"][0]
western_final_g1.keys()

dict_keys(['seasonStageId', 'seasonYear', 'gameId', 'arena', 'isGameActivated', 'statusNum', 'extendedStatusNum', 'startTimeEastern', 'startTimeUTC', 'startDateEastern', 'clock', 'isBuzzerBeater', 'isPreviewArticleAvail', 'isRecapArticleAvail', 'tickets', 'hasGameBookPdf', 'isStartTimeTBD', 'nugget', 'attendance', 'gameDuration', 'tags', 'playoffs', 'period', 'vTeam', 'hTeam', 'watch'])

In [16]:
western_final_g1["vTeam"]

{'teamId': '1610612757',
 'triCode': 'POR',
 'win': '53',
 'loss': '29',
 'seriesWin': '0',
 'seriesLoss': '0',
 'score': '53',
 'linescore': [{'score': '23'},
  {'score': '22'},
  {'score': '8'},
  {'score': '0'}]}

In [17]:
western_final_g1["hTeam"]

{'teamId': '1610612744',
 'triCode': 'GSW',
 'win': '57',
 'loss': '25',
 'seriesWin': '0',
 'seriesLoss': '0',
 'score': '67',
 'linescore': [{'score': '27'},
  {'score': '27'},
  {'score': '13'},
  {'score': '0'}]}

## Web Scraping Methods

- Web API
    - `.json`
    - `requests`
- Non Web API
    - `.html`
    - `requests` + Parser(beautifulsoup4, pyquery)
- Browser Automation
    - `selenium`

## JSONView

讓 JSON 資料格式在瀏覽器上呈現得比較漂亮

<https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc>

## Web API Examples

- [data.nba.net](http://data.nba.net/prod/v1/today.json)
- [空氣品質指標(AQI)](https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json)
- [PCHome 線上購物](https://ecshweb.pchome.com.tw/search/v3.3/all/results?q=macbook&page=1&sort=sale/dc)

### Important attrs / methods

- `r = requests.get(URL)`
- `r.status_code`
- `r.json()`: `dict` / `list`
- `r.content`: `bytes`
- `r.text`: `str`

In [19]:
import requests

r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False)
print(type(r))



<class 'requests.models.Response'>


In [21]:
print(r.status_code)

200


In [22]:
print(r.json())

[{'SiteName': '屏東(琉球)', 'County': '屏東縣', 'AQI': '', 'Pollutant': '', 'Status': '設備維護', 'SO2': '', 'CO': '', 'CO_8hr': '0.2', 'O3': '28', 'O3_8hr': '22', 'PM10': '', 'PM2.5': '', 'NO2': '', 'NOx': '', 'NO': '', 'WindSpeed': '2.3', 'WindDirec': '176', 'PublishTime': '2019-05-15 11:00', 'PM2.5_AVG': '', 'PM10_AVG': '15', 'SO2_AVG': '2', 'Longitude': '120.377222', 'Latitude': '22.352222'}, {'SiteName': '苗栗(後龍)', 'County': '苗栗縣', 'AQI': '78', 'Pollutant': '細懸浮微粒', 'Status': '普通', 'SO2': '', 'CO': '', 'CO_8hr': '0.3', 'O3': '', 'O3_8hr': '13', 'PM10': '', 'PM2.5': '', 'NO2': '', 'NOx': '', 'NO': '', 'WindSpeed': '2.2', 'WindDirec': '280', 'PublishTime': '2019-05-15 11:00', 'PM2.5_AVG': '26', 'PM10_AVG': '41', 'SO2_AVG': '3', 'Longitude': '120.786028', 'Latitude': '24.616369'}, {'SiteName': '彰化(大城)', 'County': '彰化縣', 'AQI': '65', 'Pollutant': '細懸浮微粒', 'Status': '普通', 'SO2': '2.5', 'CO': '0.24', 'CO_8hr': '0.2', 'O3': '39', 'O3_8hr': '23', 'PM10': '19', 'PM2.5': '10', 'NO2': '1.6', 'NOx': '3.4

In [None]:
# How many sites in Taiwan?
# How many sites in Taipei City?
# Highest PM2.5 site name?
# Lowest PM2.5 site name?

In [30]:
import requests

r = requests.get("https://opendata.epa.gov.tw/ws/Data/AQI/?$format=json", verify=False)
aqi_data = r.json()



In [33]:
# How many sites in Taiwan?
len(aqi_data)

81

In [48]:
# How many sites in Taipei City?
ans = 0
for i in aqi_data:
    if i["County"] == "臺北市":
         ans += 1
print(ans)

7


In [53]:
import numpy as np

site_names = [aqi_data[i]['SiteName'] for i in range(len(aqi_data))]
pm25 = np.zeros(len(site_names))
for i in range(len(aqi_data)):
    if aqi_data[i]['PM2.5'] == '' or aqi_data[i]['PM2.5'] == 'ND':
        pm25[i] = np.nan
    else:
        pm25[i] = float(aqi_data[i]['PM2.5'])

In [62]:
max_pm25 = pm25[~np.isnan(pm25)].max()
min_pm25 = pm25[~np.isnan(pm25)].min()
print(max_pm25)
print(min_pm25)

49.0
2.0


In [63]:
# Highest PM2.5 site name? 
for idx, val in enumerate(pm25):
    if val == max_pm25:
        print(site_names[idx])

大園


In [67]:
np.array(site_names)[pm25 == max_pm25]

array(['大園'], dtype='<U6')

In [64]:
# Lowest PM2.5 site name? 
for idx, val in enumerate(pm25):
    if val == min_pm25:
        print(site_names[idx])

竹山
新營


In [68]:
np.array(site_names)[pm25 == min_pm25]

array(['竹山', '新營'], dtype='<U6')

## Non Web API

- `requests` Getting data
- `beautifulsoup4` / `pyquery` Parsing data
- [Selector Gadget](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb)

### Examples

- [Avengers: Endgame (2019)](https://www.imdb.com/title/tt4154796)
- [Yahoo! 奇摩股市](https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e=tse&n=100)
- [PTT 實業坊](https://www.ptt.cc/bbs/NBA/index.html)

In [70]:
import requests

r = requests.get("https://www.imdb.com/title/tt4154796")
html_str = r.text
print(len(html_str))

241126


## `beautifulsoup4`

- `from bs4 import BeautifulSoup`
- `soup = BeautifulSoup(html_str)`
- `soup.find()`: Finding specific data via tag name
- `soup.find_all()`: Finding specific data via tag names
- `soup.select()`: Finding specific data via CSS Selector

- `bs4.BeautifulSoup`
    - `bs4.element.Tag`
        - `.text`
        - `.get("<ATTR>")`

In [71]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str)

In [81]:
type(soup)

bs4.BeautifulSoup

In [82]:
soup.find("h1")

<h1 class="">Avengers: Endgame <span id="titleYear">(<a href="/year/2019/">2019</a>)</span> </h1>

In [83]:
print(type(soup.find("h1")))

<class 'bs4.element.Tag'>


In [84]:
soup.find("h1").text

'Avengers: Endgame\xa0(2019) '

In [88]:
soup.find("a")

<a class="navbarSprite" href="/" id="home_img" title="Home"></a>

In [86]:
soup.find("a").get("class")

['navbarSprite']

In [87]:
soup.find("a").get("href")

'/'

In [89]:
soup.find("a").get("id")

'home_img'

In [76]:
soup.find_all("h2")

[<h2><div class="checkin-error">Error</div></h2>,
 <h2><div class="checkin-success">Added to Your Check-Ins.</div></h2>,
 <h2>Videos</h2>,
 <h2>Photos</h2>,
 <h2 class="rec_heading_wrapper">
 <span class="rec_heading" data-spec="p13nsims:tt4154796">More Like This </span>
 </h2>,
 <h2>Cast</h2>,
 <h2>Storyline</h2>,
 <h2>Details</h2>,
 <h2>Did You Know?</h2>,
 <h2>Frequently Asked Questions</h2>,
 <h2>User Reviews</h2>,
 <h2>Contribute to This Page</h2>]

In [79]:
soup.select(".ratingValue span")

[<span itemprop="ratingValue">8.8</span>,
 <span class="grey">/</span>,
 <span class="grey" itemprop="bestRating">10</span>]

In [95]:
soup.find_all("img")[2].get("src")

'https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg'

## pyquery

- `from pyquery import PyQuery as pq`
- `d = pq(html_str)`
- `d("<SELECTOR>")`: Finding specific data via CSS Selector
- PyQuery
    - `HtmlElement.items()`
        - `.text()`
        - `.attr("<ATTR>")`

In [97]:
from pyquery import PyQuery as pq

d = pq(html_str)
type(d)

pyquery.pyquery.PyQuery

In [112]:
for i in d(".ratingValue span").items():
    print(i.text())

8.8
/
10


In [122]:
for i in d(".poster img").items():
    print(i.attr("src"))

https://m.media-amazon.com/images/M/MV5BMTc5MDE2ODcwNV5BMl5BanBnXkFtZTgwMzI2NzQ2NzM@._V1_UX182_CR0,0,182,268_AL_.jpg


In [125]:
for i in soup.find_all("time"):
    print(i.text.strip())

3h 1min
181 min


In [129]:
for i in soup.select(".primary_photo+ td a"):
    print(i.text.strip())

Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Benedict Cumberbatch
Chadwick Boseman
Brie Larson
Tom Holland
Karen Gillan
Zoe Saldana
Evangeline Lilly


In [131]:
for i in soup.select(".primary_photo+ td a"):
    route = i.get("href")
    print("https://www.imdb.com" + route)

https://www.imdb.com/name/nm0000375/
https://www.imdb.com/name/nm0262635/
https://www.imdb.com/name/nm0749263/
https://www.imdb.com/name/nm1165110/
https://www.imdb.com/name/nm0424060/
https://www.imdb.com/name/nm0719637/
https://www.imdb.com/name/nm0000332/
https://www.imdb.com/name/nm0748620/
https://www.imdb.com/name/nm1212722/
https://www.imdb.com/name/nm1569276/
https://www.imdb.com/name/nm0488953/
https://www.imdb.com/name/nm4043618/
https://www.imdb.com/name/nm2394794/
https://www.imdb.com/name/nm0757855/
https://www.imdb.com/name/nm1431940/


In [132]:
for i in d("time").items():
    print(i.text())

3h 1min
181 min


In [133]:
for i in d(".primary_photo+ td a").items():
    print(i.text())

Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Benedict Cumberbatch
Chadwick Boseman
Brie Larson
Tom Holland
Karen Gillan
Zoe Saldana
Evangeline Lilly


In [135]:
for i in d(".primary_photo+ td a").items():
    route = i.attr("href")
    print("https://www.imdb.com" + route)

https://www.imdb.com/name/nm0000375/
https://www.imdb.com/name/nm0262635/
https://www.imdb.com/name/nm0749263/
https://www.imdb.com/name/nm1165110/
https://www.imdb.com/name/nm0424060/
https://www.imdb.com/name/nm0719637/
https://www.imdb.com/name/nm0000332/
https://www.imdb.com/name/nm0748620/
https://www.imdb.com/name/nm1212722/
https://www.imdb.com/name/nm1569276/
https://www.imdb.com/name/nm0488953/
https://www.imdb.com/name/nm4043618/
https://www.imdb.com/name/nm2394794/
https://www.imdb.com/name/nm0757855/
https://www.imdb.com/name/nm1431940/


In [137]:
import requests
from bs4 import BeautifulSoup
from pyquery import PyQuery
import pandas as pd

r = requests.get("https://www.imdb.com/title/tt4154796/releaseinfo/")
html_str = r.text

In [153]:
soup = BeautifulSoup(html_str)
# Parsing release information of Avengers: Endgame (2019)
countries = [i.text.strip() for i in soup.select(".release-date-item__country-name a")]
release_dates = [i.text for i in soup.select(".release-date-item__date")]
# Summarizing number of countries grouped by release date
df = pd.DataFrame()
df["country"] = countries
df["release_date"] = release_dates
print(df.groupby("release_date")["country"].count())
# When is Taiwan's release date?
print(df[df["country"] == "Taiwan"]["release_date"].values[0])

release_date
22 April 2019     1
23 April 2019     1
24 April 2019    31
25 April 2019    21
26 April 2019    14
28 April 2019     1
29 April 2019     1
Name: country, dtype: int64
24 April 2019


In [158]:
d = PyQuery(html_str)
# Parsing release information of Avengers: Endgame (2019)
countries = [i.text() for i in d(".release-date-item__country-name a").items()]
release_dates = [i.text() for i in d(".release-date-item__date").items()]
# Summarizing number of countries grouped by release date
df = pd.DataFrame()
df["country"] = countries
df["release_date"] = release_dates
print(df.groupby("release_date")["country"].count())
# When is Taiwan's release date?
print(df[df["country"] == "Taiwan"]["release_date"].values[0])

release_date
22 April 2019     1
23 April 2019     1
24 April 2019    31
25 April 2019    21
26 April 2019    14
28 April 2019     1
29 April 2019     1
Name: country, dtype: int64
24 April 2019


In [174]:
def get_imdb_movie_data(movie_url):
    r = requests.get(movie_url)
    html_str = r.text
    d = pq(html_str)
    rating = float([i.text() for i in d(".ratingValue span").items()][0])
    r = requests.get(movie_url + "releaseinfo")
    html_str = r.text
    d = pq(html_str)
    countries = [i.text() for i in d(".release-date-item__country-name a").items()]
    release_dates = [i.text() for i in d(".release-date-item__date").items()]
    release_dates_dict = {
        "country": countries,
        "release_date": release_dates
    }
    return rating, release_dates_dict

In [178]:
def get_movie(movie_title):
    q_url = "https://www.imdb.com/find?q={}&s=tt&ttype=ft&ref_=fn_ft".format(movie_title)
    r = requests.get(q_url)
    html_str = r.text
    soup = BeautifulSoup(html_str)
    movie_url = soup.select(".result_text a")[0].get("href")
    movie_url = "https://www.imdb.com" + movie_url
    movie = get_imdb_movie_data(movie_url)
    return movie

In [179]:
get_movie("Avengers: Endgame")

(8.8,
 {'country': ['USA',
   'Russia',
   'United Arab Emirates',
   'Austria',
   'Australia',
   'Belgium',
   'China',
   'Colombia',
   'Cyprus',
   'Germany',
   'Denmark',
   'Egypt',
   'Finland',
   'France',
   'Greece',
   'Hong Kong',
   'Indonesia',
   'Israel',
   'Italy',
   'South Korea',
   'Lebanon',
   'Malaysia',
   'Netherlands',
   'Norway',
   'New Zealand',
   'Philippines',
   'Paraguay',
   'Saudi Arabia',
   'Sweden',
   'Singapore',
   'Thailand',
   'Taiwan',
   'Kosovo',
   'Argentina',
   'Brazil',
   'Costa Rica',
   'Spain',
   'UK',
   'Hungary',
   'Ireland',
   'Cambodia',
   'Kuwait',
   'Lithuania',
   'Montenegro',
   'Nigeria',
   'Peru',
   'Poland',
   'Portugal',
   'Romania',
   'Serbia',
   'Slovakia',
   'Turkey',
   'Ukraine',
   'Uruguay',
   'Bangladesh',
   'Bulgaria',
   'Canada',
   'Estonia',
   'India',
   'Japan',
   'Sri Lanka',
   'Lithuania',
   'Morocco',
   'Mexico',
   'Nepal',
   'Pakistan',
   'USA',
   'Vietnam',
   'Armen

## R Environemt

```r
pkgs <- c("jsonlite", "rvest", "httr", "magrittr")
install.packages(pkgs)
```

### Web API

- `jsonlite::fromJSON()`
    - Object: `list` as output
    - Array of Objects: `data.frame` as output

### Non Web API

- `rvest`
    - `read_html()`: Performs the same behavior as `requests.get()`
    - `html_nodes()`: Performs the same behavior as `soup.select()`
    - `html_text()`: Extracts text part from tags
    - `html_attr()`: Extracts attr part from tags

## 2019-05-16

## Practice

In [30]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def get_stock_df(url):
    r = requests.get(url)
    html_str = r.text
    soup = BeautifulSoup(html_str)
    tickers = [i.text.split()[0] for i in soup.select(".name a")]
    companies = [i.text.split()[1] for i in soup.select(".name a")]
    prices = [float(i.text) for i in soup.select(".name+ td")]
    volumes = [int(i.text.replace(",", "")) for i in soup.select("td:nth-child(9)")]
    mkt_values = [float(i.text)*100000000 for i in soup.select("td:nth-child(10)")]
    stock_df = pd.DataFrame()
    stock_df["ticker"] = tickers
    stock_df["company"] = companies
    stock_df["price"] = prices
    stock_df["volume"] = volumes
    stock_df["mkt_value"] = mkt_values
    return stock_df

tse_url = "https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e=tse&n=100"
otc_url = "https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e=otc&n=100"
tse = get_stock_df(tse_url)
tse["type"] = "上市"
otc = get_stock_df(otc_url)
otc["type"] = "上櫃"
df = tse.append(otc)
df = df.reset_index(drop=True)
df.head()
# ...
# pd.DataFrame: ticker(3008)/company(大立光)/price/volume/mkt_value/type
# 200 x 6

Unnamed: 0,ticker,company,price,volume,mkt_value,type
0,3008,大立光,4210.0,263,1109710000.0,上市
1,6409,旭隼,613.0,13,8020000.0,上市
2,5269,祥碩,514.0,153,78740000.0,上市
3,6415,矽力-KY,482.5,112,54430000.0,上市
4,2207,和泰車,439.5,136,59840000.0,上市


In [31]:
df.tail()

Unnamed: 0,ticker,company,price,volume,mkt_value,type
195,3455,由田,74.3,939,69540000.0,上櫃
196,2235,謚源,73.8,89,6560000.0,上櫃
197,6279,胡連,73.7,41,3040000.0,上櫃
198,6576,逸達,73.4,160,11750000.0,上櫃
199,5474,聰泰,73.3,6,440000.0,上櫃


In [32]:
df.shape

(200, 6)

In [33]:
df.index

RangeIndex(start=0, stop=200, step=1)

In [36]:
df.to_csv("stock.csv", index=False)

In [39]:
df.to_json("stock.json", force_ascii=False, orient="records")

## Selenium

<https://selenium-python.readthedocs.io/>

- `selenium` package
- Update Chrome / Firefox
- Download ChromeDriver / Geckodriver
    - ChromeDriver: <https://chromedriver.storage.googleapis.com/index.html?path=74.0.3729.6/>
    - Geckodriver: <https://github.com/mozilla/geckodriver/releases/tag/v0.24.0>
- Common driver method calls
    - `webdriver.BROWSER(executable_path="PATH_TO_YOUR_DRIVER")` to initiate driver: Chrome / Firefox
    - `driver.get(URL)` to get to specific URL
    - `driver.back()` to move backward
    - `driver.forward()` to move forward
    - `driver.close()` to close driver
    - `driver.find_element_by_XXX()` to select specific **element** by different locators
        - find_element_by_id
        - find_element_by_name
        - find_element_by_xpath
        - find_element_by_link_text
        - find_element_by_partial_link_text
        - find_element_by_tag_name
        - find_element_by_class_name
        - find_element_by_css_selector
    - `driver.find_elements_by_XXX()` to select specific **elements** by different locators
        - find_elements_by_name
        - find_elements_by_xpath
        - find_elements_by_link_text
        - find_elements_by_partial_link_text
        - find_elements_by_tag_name
        - find_elements_by_class_name
        - find_elements_by_css_selector
- Common element methods/attributes
    - `.send_keys()`
    - `.click()`
    - `.text`

In [19]:
# test
from selenium import webdriver

driver = webdriver.Chrome(executable_path="chromedriver.exe")

In [9]:
# test
from selenium import webdriver

driver = webdriver.Firefox(executable_path="geckodriver.exe")

In [10]:
driver.get("https://www.imdb.com")

In [11]:
driver.get("https://www.bbc.com")

In [12]:
driver.back()

In [13]:
driver.forward()

In [14]:
driver.close()

## XPath locator

- XPath Helper: <https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl>

In [52]:
from selenium import webdriver

driver = webdriver.Firefox(executable_path="geckodriver.exe")
driver.get("https://www.imdb.com/")
elem = driver.find_element_by_xpath("//input[@id='navbar-query']")
elem.send_keys("Captain Marvel")
elem = driver.find_element_by_xpath("//button[@id='navbar-submit-button']")
elem.click()
elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a")
elem.click()
elem = driver.find_element_by_xpath("//div[@class='findSection'][1]/table[@class='findList']/tbody/tr[@class='findResult odd'][1]/td[@class='result_text']/a")
elem.click()
elem = driver.find_element_by_xpath("//strong/span")
rating = float(elem.text)
elem = driver.find_elements_by_xpath("//div[@class='subtext']/a")
genre = [i.text for i in elem]
elem = driver.find_elements_by_xpath("//td[2]/a")
cast = [i.text for i in elem]
genre.pop()
elem = driver.find_element_by_xpath("//div[@class='subtext']/a[4]")
elem.click()
elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__country-name']/a")
countries = [i.text for i in elem]
elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__date']")
release_dates = [i.text for i in elem]
driver.close()

In [53]:
print(rating)
print(genre)
print(cast)
print(countries)
print(release_dates)

7.1
['Action', 'Adventure', 'Sci-Fi']
['Brie Larson', 'Samuel L. Jackson', 'Ben Mendelsohn', 'Jude Law', 'Annette Bening', 'Lashana Lynch', 'Clark Gregg', 'Rune Temte', 'Gemma Chan', 'Algenis Perez Soto', 'Djimon Hounsou', 'Lee Pace', 'Chuku Modu', 'Matthew Maher', 'Akira Akbar']
['UK', 'USA', 'Belgium', 'Colombia', 'Denmark', 'Estonia', 'Finland', 'France', 'Indonesia', 'Italy', 'South Korea', 'Kuwait', 'Morocco', 'Malaysia', 'Netherlands', 'Norway', 'Norway', 'Philippines', 'Portugal', 'Serbia', 'Sweden', 'Slovenia', 'Taiwan', 'Argentina', 'Australia', 'Bulgaria', 'Brazil', 'Czech Republic', 'Germany', 'Georgia', 'Greece', 'Hungary', 'Israel', 'Kazakhstan', 'Lebanon', 'Peru', 'Russia', 'Saudi Arabia', 'Singapore', 'Slovakia', 'Ukraine', 'Uruguay', 'Bangladesh', 'Canada', 'China', 'Spain', 'UK', 'Hong Kong', 'Ireland', 'India', 'Sri Lanka', 'Lithuania', 'Mexico', 'Nepal', 'Poland', 'Romania', 'Turkey', 'USA', 'Vietnam', 'Japan', 'Nigeria']
['27 February 2019', '4 March 2019', '6 March

In [59]:
def get_movie_info(movie_titles):
    driver = webdriver.Firefox(executable_path="geckodriver.exe")
    movie_info = {}
    for movie_title in movie_titles:
        driver.get("https://www.imdb.com/")
        elem = driver.find_element_by_xpath("//input[@id='navbar-query']")
        elem.send_keys(movie_title)
        elem = driver.find_element_by_xpath("//button[@id='navbar-submit-button']")
        elem.click()
        elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a")
        elem.click()
        elem = driver.find_element_by_xpath("//div[@class='findSection'][1]/table[@class='findList']/tbody/tr[@class='findResult odd'][1]/td[@class='result_text']/a")
        elem.click()
        elem = driver.find_element_by_xpath("//strong/span")
        rating = float(elem.text)
        elem = driver.find_elements_by_xpath("//div[@class='subtext']/a")
        genre = [i.text for i in elem]
        elem = driver.find_elements_by_xpath("//td[2]/a")
        cast = [i.text for i in elem]
        genre.pop()
        elem = driver.find_element_by_xpath("//time")
        movie_time = elem.text
        elem = driver.find_element_by_xpath("//div[@class='subtext']/a[4]")
        elem.click()
        elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__country-name']/a")
        countries = [i.text for i in elem]
        elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__date']")
        release_dates = [i.text for i in elem]
        movie = {
            "rating": rating,
            "genre": genre,
            "cast": cast,
            "movieTime": movie_time,
            "releaseCountries": countries,
            "releaseDates": release_dates
        }
        movie_info[movie_title] = movie
    driver.close()
    return movie_info
    
movie_titles = ["The Avengers (2012)", "Avengers: Age of Ultron (2015)", "Avengers: Infinity War (2018)", "Avengers: Endgame (2019)"]
results = get_movie_info(movie_titles) # return list / dict(key: movie_title)

In [60]:
results.keys()

dict_keys(['The Avengers (2012)', 'Avengers: Age of Ultron (2015)', 'Avengers: Infinity War (2018)', 'Avengers: Endgame (2019)'])

In [61]:
results["Avengers: Age of Ultron (2015)"]

{'rating': 7.3,
 'genre': ['Action', 'Adventure', 'Sci-Fi'],
 'cast': ['Robert Downey Jr.',
  'Chris Hemsworth',
  'Mark Ruffalo',
  'Chris Evans',
  'Scarlett Johansson',
  'Jeremy Renner',
  'James Spader',
  'Samuel L. Jackson',
  'Don Cheadle',
  'Aaron Taylor-Johnson',
  'Elizabeth Olsen',
  'Paul Bettany',
  'Cobie Smulders',
  'Anthony Mackie',
  'Hayley Atwell'],
 'movieTime': '2h 21min',
 'releaseCountries': ['USA',
  'Belgium',
  'Switzerland',
  'Switzerland',
  'Finland',
  'France',
  'Indonesia',
  'Italy',
  'Netherlands',
  'Norway',
  'Philippines',
  'Sweden',
  'Taiwan',
  'Albania',
  'Argentina',
  'Austria',
  'Australia',
  'Azerbaijan',
  'Brazil',
  'Belarus',
  'Switzerland',
  'Germany',
  'Denmark',
  'UK',
  'Georgia',
  'Hong Kong',
  'Ireland',
  'Israel',
  'Iceland',
  'South Korea',
  'Kazakhstan',
  'New Zealand',
  'Romania',
  'Russia',
  'Singapore',
  'Ukraine',
  'Bulgaria',
  'India',
  'Vietnam',
  'South Africa',
  'Egypt',
  'Spain',
  'Kuwai

In [65]:
def get_movie_info(*args):
    driver = webdriver.Firefox(executable_path="geckodriver.exe")
    movie_info = {}
    for movie_title in args:
        driver.get("https://www.imdb.com/")
        elem = driver.find_element_by_xpath("//input[@id='navbar-query']")
        elem.send_keys(movie_title)
        elem = driver.find_element_by_xpath("//button[@id='navbar-submit-button']")
        elem.click()
        elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a")
        elem.click()
        elem = driver.find_element_by_xpath("//div[@class='findSection'][1]/table[@class='findList']/tbody/tr[@class='findResult odd'][1]/td[@class='result_text']/a")
        elem.click()
        elem = driver.find_element_by_xpath("//strong/span")
        rating = float(elem.text)
        elem = driver.find_elements_by_xpath("//div[@class='subtext']/a")
        genre = [i.text for i in elem]
        elem = driver.find_elements_by_xpath("//td[2]/a")
        cast = [i.text for i in elem]
        genre.pop()
        elem = driver.find_element_by_xpath("//time")
        movie_time = elem.text
        elem = driver.find_element_by_xpath("//div[@class='subtext']/a[4]")
        elem.click()
        elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__country-name']/a")
        countries = [i.text for i in elem]
        elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__date']")
        release_dates = [i.text for i in elem]
        movie = {
            "rating": rating,
            "genre": genre,
            "cast": cast,
            "movieTime": movie_time,
            "releaseCountries": countries,
            "releaseDates": release_dates
        }
        movie_info[movie_title] = movie
    driver.close()
    return movie_info
    
results = get_movie_info("Avengers: Infinity War (2018)", "Avengers: Endgame (2019)") # return list / dict(key: movie_title)

In [64]:
import json

with open("movie.json", "w") as f:
    json.dump(results, f)

## headless selenium

In [54]:
from selenium.webdriver.firefox.options import Options
from selenium import webdriver

options = Options()
options.headless = True
driver = webdriver.Firefox(executable_path="geckodriver.exe", options=options)
driver.get("https://www.imdb.com/")
print("Getting IMDB Homepage...")
elem = driver.find_element_by_xpath("//input[@id='navbar-query']")
movie_title = "Captain Marvel"
print("Searching {}...".format(movie_title))
elem.send_keys(movie_title)
elem = driver.find_element_by_xpath("//button[@id='navbar-submit-button']")
elem.click()
elem = driver.find_element_by_xpath("//ul[@class='findTitleSubfilterList']/li[1]/a")
elem.click()
elem = driver.find_element_by_xpath("//div[@class='findSection'][1]/table[@class='findList']/tbody/tr[@class='findResult odd'][1]/td[@class='result_text']/a")
elem.click()
print("Parsing movie info...")
elem = driver.find_element_by_xpath("//strong/span")
rating = float(elem.text)
elem = driver.find_elements_by_xpath("//div[@class='subtext']/a")
genre = [i.text for i in elem]
elem = driver.find_elements_by_xpath("//td[2]/a")
cast = [i.text for i in elem]
genre.pop()
elem = driver.find_element_by_xpath("//div[@class='subtext']/a[4]")
elem.click()
print("Parsing release dates for {}".format(movie_title))
elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__country-name']/a")
countries = [i.text for i in elem]
elem = driver.find_elements_by_xpath("//tr[@class='ipl-zebra-list__item release-date-item']/td[@class='release-date-item__date']")
release_dates = [i.text for i in elem]
print("Done parsing {}.".format(movie_title))
driver.close()

Getting IMDB Homepage...
Searching Captain Marvel...
Parsing movie info...
Parsing release dates for Captain Marvel
Done parsing Captain Marvel.


In [56]:
print(rating)
print(genre)
print(cast)
print(countries)
print(release_dates)

7.1
['Action', 'Adventure', 'Sci-Fi']
['Brie Larson', 'Samuel L. Jackson', 'Ben Mendelsohn', 'Jude Law', 'Annette Bening', 'Lashana Lynch', 'Clark Gregg', 'Rune Temte', 'Gemma Chan', 'Algenis Perez Soto', 'Djimon Hounsou', 'Lee Pace', 'Chuku Modu', 'Matthew Maher', 'Akira Akbar']
['UK', 'USA', 'Belgium', 'Colombia', 'Denmark', 'Estonia', 'Finland', 'France', 'Indonesia', 'Italy', 'South Korea', 'Kuwait', 'Morocco', 'Malaysia', 'Netherlands', 'Norway', 'Norway', 'Philippines', 'Portugal', 'Serbia', 'Sweden', 'Slovenia', 'Taiwan', 'Argentina', 'Australia', 'Bulgaria', 'Brazil', 'Czech Republic', 'Germany', 'Georgia', 'Greece', 'Hungary', 'Israel', 'Kazakhstan', 'Lebanon', 'Peru', 'Russia', 'Saudi Arabia', 'Singapore', 'Slovakia', 'Ukraine', 'Uruguay', 'Bangladesh', 'Canada', 'China', 'Spain', 'UK', 'Hong Kong', 'Ireland', 'India', 'Sri Lanka', 'Lithuania', 'Mexico', 'Nepal', 'Poland', 'Romania', 'Turkey', 'USA', 'Vietnam', 'Japan', 'Nigeria']
['27 February 2019', '4 March 2019', '6 March

In [83]:
driver = webdriver.Chrome(executable_path="chromedriver.exe")
driver.get("http://www.mafengwo.cn/jd/10769/gonglve.html")
# ...

In [87]:
driver.execute_script('next = document.querySelector(".pg-next"); next.click();')

In [75]:
elem = driver.find_elements_by_tag_name("a")
for e in elem:
    print("{} : {}".format(e.text, e.get_attribute("href")))

 : http://www.mafengwo.cn/
首页 : http://www.mafengwo.cn/
目的地 : http://www.mafengwo.cn/mdd/
旅游攻略 : http://www.mafengwo.cn/gonglve/
去旅行 : http://www.mafengwo.cn/sales/
 : http://www.mafengwo.cn/sales/
 : http://www.mafengwo.cn/sales/0-0-0-0-0-0-0-0.html?group=4
 : http://www.mafengwo.cn/localdeals/
 : http://www.mafengwo.cn/sales/0-0-0-5-0-0-0-0.html
 : http://www.mafengwo.cn/sales/visa/
机票 : http://www.mafengwo.cn/flight/
订酒店 : http://www.mafengwo.cn/hotel/
 : http://www.mafengwo.cn/wenda/
 : http://www.mafengwo.cn/mall/things.php
 : http://www.mafengwo.cn/club/
 : http://www.mafengwo.cn/together/
 : http://www.mafengwo.cn/group/
 : http://www.mafengwo.cn/rudder/
 : http://www.mafengwo.cn/auction/
 : http://www.mafengwo.cn/photo_pk/prev.php
 : http://www.mafengwo.cn/focus/
 : http://www.mafengwo.cn/mall/virtual_goods.php
 : http://www.mafengwo.cn/app/intro/gonglve.php
 : javascript:
 : https://passport.mafengwo.cn/weibo
 : https://passport.mafengwo.cn/qq
 : https://passport.mafengwo.cn/w

 : http://t.qq.com/mafengwovip
 : http://1213600479.qzone.qq.com/
 : http://www.mafengwo.cn/
京ICP备11015476号 : http://www.miibeian.gov.cn/
京公网安备11010502013401号 : http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010502013401
京ICP证110318号 : http://images.mafengwo.net/images/about/icp2.jpg
营业执照 : https://n1-q.mafengwo.net/s12/M00/A5/45/wKgED1xJi3uAA7KLAAf_CkKLHRQ87.jpeg
帮助中心 : http://www.mafengwo.cn/sales/uhelp/doc
 : https://search.szfw.org/cert/l/CX20140627008255008321
 : https://ss.knet.cn/verifyseal.dll?sn=e130816110100420286o93000000&ct=df&a=1&pa=787189
 : http://www.itrust.org.cn/Home/Index/itrust_certifi/wm/1669928206.html
 : None
 : None
 : None
 : None


In [30]:
len(elem)

195

In [82]:
driver.close()

In [33]:
for e in elem:
    print(e.get_attribute("href"), "||", e.text)

http://www.mafengwo.cn/ || 
http://www.mafengwo.cn/ || 首页
http://www.mafengwo.cn/mdd/ || 目的地
http://www.mafengwo.cn/gonglve/ || 旅游攻略
http://www.mafengwo.cn/sales/ || 去旅行
http://www.mafengwo.cn/sales/ || 
http://www.mafengwo.cn/sales/0-0-0-0-0-0-0-0.html?group=4 || 
http://www.mafengwo.cn/localdeals/ || 
http://www.mafengwo.cn/sales/0-0-0-5-0-0-0-0.html || 
http://www.mafengwo.cn/sales/visa/ || 
http://www.mafengwo.cn/flight/ || 机票
http://www.mafengwo.cn/hotel/ || 订酒店
http://www.mafengwo.cn/wenda/ || 
http://www.mafengwo.cn/mall/things.php || 
http://www.mafengwo.cn/club/ || 
http://www.mafengwo.cn/together/ || 
http://www.mafengwo.cn/group/ || 
http://www.mafengwo.cn/rudder/ || 
http://www.mafengwo.cn/auction/ || 
http://www.mafengwo.cn/photo_pk/prev.php || 
http://www.mafengwo.cn/focus/ || 
http://www.mafengwo.cn/mall/virtual_goods.php || 
http://www.mafengwo.cn/app/intro/gonglve.php || 
javascript: || 
https://passport.mafengwo.cn/weibo || 
https://passport.mafengwo.cn/qq || 
https:/

https://n1-q.mafengwo.net/s12/M00/A5/45/wKgED1xJi3uAA7KLAAf_CkKLHRQ87.jpeg || 营业执照
http://www.mafengwo.cn/sales/uhelp/doc || 帮助中心
https://search.szfw.org/cert/l/CX20140627008255008321 || 
https://ss.knet.cn/verifyseal.dll?sn=e130816110100420286o93000000&ct=df&a=1&pa=787189 || 
http://www.itrust.org.cn/Home/Index/itrust_certifi/wm/1669928206.html || 
None || 
None || 
None || 
None || 


In [28]:
e.get_attribute("href")

## GET with cookies

- EditThisCookie <https://chrome.google.com/webstore/detail/editthiscookie/fngmhnnpilhplaeedifhccceomclgfbg>

In [88]:
import requests

r = requests.get("https://www.ptt.cc/bbs/Gossiping/index.html")
html_str = r.text

In [89]:
print(html_str)

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-print.css" media="print">




	</head>
    <body>
		
<div class="bbs-screen bbs-content">
    <div class="over18-notice">
        <p>本網站已依網站內容分級規定處理</p>

        <p>警告︰您即將進入之看板內容需滿十八歲方可瀏覽。</p>

        <p>若您尚未年滿十八歲，請點選離開。若您已滿十八歲，亦不可將本區之內容派發、傳閱、出售、出租、交給或借予年齡未滿18歲的人士瀏覽，或將本網站內容向該人士出示、播放或放映。</p>
    </div>
</div>

<div class="bbs-screen bbs-content center clear">
    <form action="/ask/over18"

In [99]:
import requests

cookies = dict(over18='1')
r = requests.get("https://www.ptt.cc/bbs/Gossiping/index.html", cookies=cookies)

In [100]:
html_str = r.text
print(html_str)

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		

<meta name="viewport" content="width=device-width, initial-scale=1">

<title>看板 Gossiping 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-common.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-base.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-custom.css">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/pushstream.css" media="screen">
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.25/bbs-print.css" media="print">




	</head>
    <body>
		
<div id="topbar-container">
	<div id="topbar" class="bbs-content">
		<a id="logo" href="/bbs/">批踢踢實業坊</a>
		<span>&rsaquo;</span>
		<a class="board" href="/bbs/Gossiping/index.html"><span class="board-label">看板 </span>Gossiping</a>
		<a class="right small" href="/about.html">關於我們</a>
		<a class="right small" href="/co

In [104]:
cookies = dict(COOKIE_LANGUAGE='en')
r = requests.get("http://www.fantasy-sky.com/ContentList.aspx?section=002", cookies=cookies)
html_str = r.text
print(html_str)



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1"><meta name="og:type" content="article" /><meta name="og:site_name" content="Fantasy Sky" /><meta name="og:image" content="http://www.fantasy-sky.com:83/magazine/2/201905/article/17421_636921374700370352.jpg" /><meta name="og:url" content="http://www.fantasy-sky.com:83/ContentList.aspx?section=002" /><meta name="og:description" content="A Dog's Way Home
" /><script type="text/javascript">var prevIssue = new Array(); prevIssue[0] = 5; prevIssue[1] = 4; prevIssue[2] = 3; prevIssue[3] = 2; prevIssue[4] = 1; </script><title>
	Fantasy Sky – The Inflight Magazine of China Airlines / 華航雜誌 – 中華航空機上雜誌
</title><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" /><meta http-equiv="X-UA-Compatible" content="IE=EmulateIE7" /><m

In [105]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_str)
movie_titles = [i.text for i in soup.select(".movies-name")]
print(movie_titles)

["A Dog's Way Home", 'The Kid Who Would Be King', 'Mary Poppins Returns', 'Aquaman', 'Glass', 'The Mule', 'The Upside', 'Destroyer', 'Spider-Man™: Into the Spider-Verse', 'Vice', 'Creed II', 'Ben Is Back', 'Robin Hood', 'Mortal Engines', 'The Scoundrels', 'Last Letter', 'A Land Imagined', 'Dying To Survive', 'Project Gutenberg', 'Hidden Man', 'High Flash', 'To My 19 Year-Old', 'How to Train Our Dragon', 'The House Where The Mermaid Sleeps', 'Million Dollar Man', 'Hichki', 'Territory Of Love', 'Rampant', 'Unstoppable', 'Intimate Strangers', 'Newton', 'Dogman', 'Hibiki', 'Code Blue: The Movie', "Samurai's Promise"]


In [124]:
import requests
from pyquery import PyQuery as pq
from bs4 import BeautifulSoup
import pandas as pd

def get_imdb_movie_data(movie_url):
    r = requests.get(movie_url)
    html_str = r.text
    d = pq(html_str)
    rating = float([i.text() for i in d("strong span").items()][0])
    movie_time = int([i.text() for i in d("time").items()][1].split()[0])
    return rating, movie_time

def get_movie(movie_title):
    q_url = "https://www.imdb.com/find?q={}&s=tt&ttype=ft&ref_=fn_ft".format(movie_title)
    r = requests.get(q_url)
    html_str = r.text
    soup = BeautifulSoup(html_str)
    movie_url = soup.select(".result_text a")[0].get("href")
    movie_url = "https://www.imdb.com" + movie_url
    rating, movie_time = get_imdb_movie_data(movie_url)
    return rating, movie_time

def get_ci_movie_titles(ci_urls):
    cookies = dict(COOKIE_LANGUAGE='en')
    ci_movie_titles = []
    for ci_url in ci_urls:
        r = requests.get(ci_url, cookies=cookies)
        html_str = r.text
        soup = BeautifulSoup(html_str)
        movie_titles = [i.text for i in soup.select(".movies-name")]
        ci_movie_titles += movie_titles
    return ci_movie_titles

In [116]:
ci_urls = ["http://www.fantasy-sky.com/ContentList.aspx?section=002&category=0020{}".format(i) for i in range(1, 5)]
ci_movie_titles = get_ci_movie_titles(ci_urls)

In [117]:
len(ci_movie_titles)

162

In [128]:
movie_ratings = []
movie_times = []
for idx, val in enumerate(ci_movie_titles):
    print("###### {}th movie: {} ######".format(idx+1, val))
    try:
        rating, movie_time = get_movie(val)
        movie_ratings.append(rating)
        movie_times.append(movie_time)
    except:
        movie_ratings.append(None)
        movie_times.append(None)

###### 1th movie: A Dog's Way Home ######
###### 2th movie: The Kid Who Would Be King ######
###### 3th movie: Mary Poppins Returns ######
###### 4th movie: Aquaman ######
###### 5th movie: Glass ######
###### 6th movie: The Mule ######
###### 7th movie: The Upside ######
###### 8th movie: Destroyer ######
###### 9th movie: Spider-Man™: Into the Spider-Verse ######
###### 10th movie: Vice ######
###### 11th movie: Creed II ######
###### 12th movie: Ben Is Back ######
###### 13th movie: Robin Hood ######
###### 14th movie: Mortal Engines ######
###### 15th movie: The Scoundrels ######
###### 16th movie: Last Letter ######
###### 17th movie: A Land Imagined ######
###### 18th movie: Dying To Survive ######
###### 19th movie: Project Gutenberg ######
###### 20th movie: Hidden Man ######
###### 21th movie: High Flash ######
###### 22th movie: To My 19 Year-Old ######
###### 23th movie: How to Train Our Dragon ######
###### 24th movie: The House Where The Mermaid Sleeps ######
###### 25th m

In [None]:
movie_df = pd.DataFrame()
movie_df["movie_title"] = ci_movie_titles
movie_df["movie_rating"] = movie_ratings
movie_df["movie_time"] = movie
movie_df.head()

In [1]:
import requests

r = requests.get("https://www.backpackers.com.tw/forum/forumdisplay.php?f=57")
print(r.status_code)

ConnectionError: HTTPSConnectionPool(host='www.backpackers.com.tw', port=443): Max retries exceeded with url: /forum/forumdisplay.php?f=57 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x000001C24C6834E0>: Failed to establish a new connection: [WinError 10060] 連線嘗試失敗，因為連線對象有一段時間並未正確回應，或是連線建立失敗，因為連線的主機無法回應。'))