# Python 資料分析

> 以 Python 擷取網路資料

[數據交點](https://www.datainpoint.com/) | 郭耀仁 <yaojenkuo@datainpoint.com>

## 練習題指引

- 第一個程式碼儲存格會將可能用得到的模組（套件）以及單元測試 `unittest` 載入。
- 如果練習題需要載入檔案，檔案與練習題存放在同個資料夾中，意即我們可以指定工作目錄來載入。
- 練習題已經定義好函數或者類別的名稱以及參數名稱，我們只需要寫作主體。
- 函數或者類別的 `"""docstring"""` 部分會描述測試如何進行。
- 觀察 `"""docstring"""` 的部分能夠暸解輸入以及預期輸出之間的關係，能幫助我們更暸解題目。
- 請在 `### BEGIN SOLUTION` 與 `### END SOLUTION` 這兩個單行註解之間寫作函數或者類別的主體。
- 執行測試的方式為點選上方選單的 Kernel -> Restart Kernel And Run All Cells -> Restart。
- 可以每寫一題就執行測試，也可以全部寫完再執行測試。
- 練習題閒置超過 10 分鐘會自動斷線，這時只要重新點選練習題連結即可重新啟動。

In [1]:
import unittest
import os
import requests
import json
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

## 01. Define a function named `extract_team_names` that is able to extract the team names of NBA given a JSON file.

- Expected inputs: a JSON file.
- Expected outputs: a list.

In [2]:
def extract_team_names(json_file):
    """
    >>> team_names = extract_team_names('teams.json')
    >>> len(team_names)
    30
    >>> "Boston Celtics" in team_names
    True
    >>> "Brooklyn Nets" in team_names
    True
    >>> "Los Angeles Lakers" in team_names
    True
    >>> "Phoenix Suns" in team_names
    True
    """
    ### BEGIN SOLUTION
    with open(json_file) as jf:
        teams_dict = json.load(jf)
    nba_teams = teams_dict['league']['standard']
    team_names = [e['fullName'] for e in nba_teams if e['isNBAFranchise']]
    return team_names
    ### END SOLUTION

## 02. Define a function named `extract_store_addresses` that is able to extract the store addresses of 7-11 conveniece stores in Xinyi District, Taipei given a XML file.

- Expected inputs: a XML file.
- Expected outputs: a list.

In [3]:
def extract_store_addresses(xml_file):
    """
    >>> store_addresses = extract_store_addresses('stores.xml')
    >>> len(store_addresses)
    74
    >>> '台北市信義區信義路五段7號35樓' in store_addresses
    True
    >>> '台北市信義區吳興街156巷2弄2號4號1樓' in store_addresses
    True
    >>> '台北市信義區忠孝東路五段68號24樓' in store_addresses
    True
    >>> '台北市信義區虎林街85號' in store_addresses
    True
    """
    ### BEGIN SOLUTION
    root = ET.parse(xml_file)
    address_xpath = './/Address'
    store_addresses = [e.text for e in root.findall(address_xpath)]
    return store_addresses
    ### END SOLUTION

## 03. Define a function named `extract_movie_titles` that is able to extract the titles of top rated movies from IMDb.com given a HTML file.

- Expected inputs: a HTML file.
- Expected outputs: a list.

In [4]:
def extract_movie_titles(html_file):
    """
    >>> movie_titles = extract_movie_titles('movies.html')
    >>> len(movie_titles)
    250
    >>> 'The Shawshank Redemption' in movie_titles
    True
    >>> 'The Godfather' in movie_titles
    True
    >>> 'The Dark Knight' in movie_titles
    True
    >>> 'Forrest Gump' in movie_titles
    True
    """
    ### BEGIN SOLUTION
    with open(html_file) as hf:
        soup = BeautifulSoup(hf)
    movie_titles = [e.text for e in soup.select('.titleColumn a')]
    return movie_titles
    ### END SOLUTION

## 04. Define a function named `extract_movie_ratings` that is able to extract the ratings of top rated movies from IMDb.com given a HTML file.

- Expected inputs: a HTML file.
- Expected outputs: a list.

In [5]:
def extract_movie_ratings(html_file):
    """
    >>> movie_ratings = extract_movie_ratings('movies.html')
    >>> len(movie_ratings)
    250
    >>> max(movie_ratings)
    9.2
    >>> min(movie_ratings)
    8.0
    >>> sum(movie_ratings) / len(movie_ratings)
    8.253999999999975
    """
    ### BEGIN SOLUTION
    with open(html_file) as hf:
        soup = BeautifulSoup(hf)
    movie_ratings = [float(e.text) for e in soup.select('strong')]
    return movie_ratings
    ### END SOLUTION

## 05. Define a function named `extract_movie_table` that is able to extract the ranking, titles, release_years, and ratings of top rated movies from IMDb.com given a HTML file.

- Expected inputs: a HTML file.
- Expected outputs: a (250, 4) DataFrame.

In [6]:
def extract_movie_table(html_file):
    """
    >>> movie_table = extract_movie_table('movies.html')
    >>> print(movie_table)
         ranking                     title  release_year  rating
    0          1  The Shawshank Redemption          1994     9.2
    1          2             The Godfather          1972     9.1
    2          3    The Godfather: Part II          1974     9.0
    3          4           The Dark Knight          2008     9.0
    4          5              12 Angry Men          1957     8.9
    ..       ...                       ...           ...     ...
    245      246        The Princess Bride          1987     8.0
    246      247     The Battle of Algiers          1966     8.0
    247      248              Winter Sleep          2014     8.0
    248      249                Tangerines          2013     8.0
    249      250            The Terminator          1984     8.0

    [250 rows x 4 columns]
    >>> print(type(movie_table))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(movie_table.shape)
    (250, 4)
    >>> print(movie_table["release_year"].min())
    1921
    >>> print(movie_table["release_year"].max())
    2020
    """
    ### BEGIN SOLUTION
    with open(html_file) as hf:
        soup = BeautifulSoup(hf)
    movie_titles = [e.text for e in soup.select('.titleColumn a')]
    release_years = [int(e.text.replace("(", "").replace(")", "")) for e in soup.select('.secondaryInfo')]
    movie_ratings = [float(e.text) for e in soup.select('strong')]
    movie_df = pd.DataFrame()
    movie_df["ranking"] = list(range(1, len(movie_titles)+1))
    movie_df["title"] = movie_titles
    movie_df["release_year"] = release_years
    movie_df["rating"] = movie_ratings
    return movie_df
    ### END SOLUTION

## 06. Define a function named `extract_dark_knight_release_dates` that is able to extract the titles and release dates from <https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy>.

- Expected inputs: a URL.
- Expected outputs: 2 lists.

In [7]:
def extract_dark_knight_release_dates(request_url):
    """
    >>> titles, release_dates = extract_dark_knight_release_dates("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
    >>> print(titles)
    ['Batman Begins', 'The Dark Knight', 'The Dark Knight Rises']
    >>> print(release_dates)
    ['June 15, 2005', 'July 18, 2008', 'July 20, 2012']
    """
    ### BEGIN SOLUTION
    response = requests.get(request_url)
    soup = BeautifulSoup(response.text)
    titles = [e.text for e in soup.select("tbody tr td i")][:3]
    release_dates = [e.text.strip() for e in soup.select("td:nth-child(2)")][:3]
    return titles, release_dates
    ### END SOLUTION

## 07. Define a function named `extract_dark_knight_box_offices` that is able to extract the titles and worldwide box offices from <https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy>.

- Expected inputs: a URL.
- Expected outputs: 2 lists.

In [8]:
def extract_dark_knight_box_offices(request_url):
    """
    >>> titles, box_offices = extract_dark_knight_box_offices("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
    >>> print(titles)
    ['Batman Begins', 'The Dark Knight', 'The Dark Knight Rises']
    >>> print(box_offices)
    ['$374,218,673', '$1,004,934,033', '$1,084,939,099']
    """
    ### BEGIN SOLUTION
    response = requests.get(request_url)
    soup = BeautifulSoup(response.text)
    titles = [e.text for e in soup.select("tbody tr td i")][:3]
    box_offices = [e.text.strip() for e in soup.select("td:nth-child(6)")]
    return titles, box_offices
    ### END SOLUTION

## 08. Define a function named `extract_dark_knight_trilogy` that is able to extract the specified information from <https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy> as a DataFrame.

- Expected inputs: a URL.
- Expected outputs: a (3, 3) DataFrame.

```
                   title release_date  box_office
0          Batman Begins   2005-06-15   374218673
1        The Dark Knight   2008-07-18  1004934033
2  The Dark Knight Rises   2012-07-20  1084939099
```

In [9]:
def extract_dark_knight_trilogy(request_url):
    """
    >>> dark_knight_trilogy = extract_dark_knight_trilogy("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
    >>> print(type(dark_knight_trilogy))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(dark_knight_trilogy.shape)
    (3, 3)
    """
    ### BEGIN SOLUTION
    response = requests.get(request_url)
    soup = BeautifulSoup(response.text)
    titles = [e.text for e in soup.select("tbody tr td i")][:3]
    box_offices = [e.text.strip() for e in soup.select("td:nth-child(6)")]
    box_offices = [int(i.replace("$", "").replace(",", "")) for i in box_offices]
    release_dates = [e.text.strip() for e in soup.select("td:nth-child(2)")][:3]
    split_release_dates = [i.split() for i in release_dates]
    month_dict = {
        "June": "06",
        "July": "07"
    }
    release_dates_to_df = []
    for lst in split_release_dates:
        month_part = month_dict[lst[0]]
        day_part = lst[1].replace(",", "")
        year_part = lst[2]
        release_date = "{}-{}-{}".format(year_part, month_part, day_part)
        release_dates_to_df.append(release_date)
    out_df = pd.DataFrame()
    out_df["title"] = titles
    out_df["release_date"] = release_dates_to_df
    out_df["box_office"] = box_offices
    return out_df
    ### END SOLUTION

## 09. Define a function named `extract_lord_of_the_rings_trilogy` that is able to extract the specified information from <https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)> as a DataFrame.

- Expected inputs: a URL.
- Expected outputs: a (3, 3) DataFrame.

```
                        title release_date  box_office
0  The Fellowship of the Ring   2001-12-19   897690072
1              The Two Towers   2002-12-18   947495095
2      The Return of the King   2003-12-17  1146030912
```

In [10]:
def extract_lord_of_the_rings_trilogy(request_url):
    """
    >>> lord_of_the_rings_trilogy = extract_lord_of_the_rings_trilogy("https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)")
    >>> print(type(lord_of_the_rings_trilogy))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(lord_of_the_rings_trilogy.shape)
    (3, 3)
    """
    ### BEGIN SOLUTION
    response = requests.get(request_url)
    soup = BeautifulSoup(response.text)
    titles = [e.text for e in soup.select("i a")][1:4]
    box_offices = [e.text.strip() for e in soup.select("tr~ tr+ tr td:nth-child(5)")]
    box_offices = [int(i.replace("$", "").replace(",", "")) for i in box_offices]
    release_dates = [e.text.strip().replace("\xa0", " ") for e in soup.select("td+ td:nth-child(2)")][3:6]
    split_release_dates = [i.split() for i in release_dates]
    release_dates_to_df = [lst[3].replace("(", "").replace(")", "") for lst in split_release_dates]
    out_df = pd.DataFrame()
    out_df["title"] = titles
    out_df["release_date"] = release_dates_to_df
    out_df["box_office"] = box_offices
    return out_df
    ### END SOLUTION

## 09. Define a function named `create_trilogy_df` that is able to concatenate the 2 DataFrames obtained from previous exercises.

- Expected inputs: None.
- Expected outputs: a (6, 4) DataFrame.

```
                        title release_date  box_office           director
0  The Fellowship of the Ring   2001-12-19   897690072      Peter Jackson
1              The Two Towers   2002-12-18   947495095      Peter Jackson
2      The Return of the King   2003-12-17  1146030912      Peter Jackson
3               Batman Begins   2005-06-15   374218673  Christopher Nolan
4             The Dark Knight   2008-07-18  1004934033  Christopher Nolan
5       The Dark Knight Rises   2012-07-20  1084939099  Christopher Nolan
```

In [11]:
def create_trilogy_df():
    """
    >>> trilogy_df = create_trilogy_df()
    >>> print(type(trilogy_df))
    <class 'pandas.core.frame.DataFrame'>
    >>> print(trilogy_df.shape)
    (6, 4)
    """
    ### BEGIN SOLUTION
    dark_knight_trilogy = extract_dark_knight_trilogy("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
    lord_of_the_rings_trilogy = extract_lord_of_the_rings_trilogy("https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)")
    out_df = pd.concat([lord_of_the_rings_trilogy, dark_knight_trilogy], ignore_index=True)
    directors = ["Peter Jackson"] * 3 + ["Christopher Nolan"] * 3
    col_loc = out_df.shape[1]
    out_df.insert(col_loc, "director", directors)
    return out_df
    ### END SOLUTION

## 執行測試！

Kernel -> Restart & Run All -> Restart and Run All Cells.

In [12]:
class TestWebScraping(unittest.TestCase):
    def test_01_extract_team_names(self):
        team_names = extract_team_names('teams.json')
        self.assertEqual(len(team_names), 30)
        self.assertTrue("Boston Celtics" in team_names)
        self.assertTrue("Brooklyn Nets" in team_names)
        self.assertTrue("Los Angeles Lakers" in team_names)
        self.assertTrue("Phoenix Suns" in team_names)
    def test_02_extract_store_addresses(self):
        store_addresses = extract_store_addresses('stores.xml')
        self.assertEqual(len(store_addresses), 74)
        self.assertTrue('台北市信義區信義路五段7號35樓' in store_addresses)
        self.assertTrue('台北市信義區吳興街156巷2弄2號4號1樓' in store_addresses)
        self.assertTrue('台北市信義區忠孝東路五段68號24樓' in store_addresses)
        self.assertTrue('台北市信義區虎林街85號' in store_addresses)
    def test_03_extract_movie_titles(self):
        movie_titles = extract_movie_titles('movies.html')
        self.assertEqual(len(movie_titles), 250)
        self.assertTrue('The Shawshank Redemption' in movie_titles)
        self.assertTrue('The Godfather' in movie_titles)
        self.assertTrue('The Dark Knight' in movie_titles)
        self.assertTrue('Forrest Gump' in movie_titles)
    def test_04_extract_movie_ratings(self):
        movie_ratings = extract_movie_ratings('movies.html')
        self.assertEqual(len(movie_ratings), 250)
        self.assertAlmostEqual(max(movie_ratings), 9.2)
        self.assertAlmostEqual(min(movie_ratings), 8.0)
        self.assertAlmostEqual(sum(movie_ratings) / len(movie_ratings), 8.253999999999975)
    def test_05_extract_movie_table(self):
        movie_table = extract_movie_table('movies.html')
        self.assertIsInstance(movie_table, pd.core.frame.DataFrame)
        self.assertEqual(movie_table.shape, (250, 4))
        self.assertEqual(movie_table["release_year"].min(), 1921)
        self.assertEqual(movie_table["release_year"].max(), 2020)
    def test_06_extract_dark_knight_release_dates(self):
        titles, release_dates = extract_dark_knight_release_dates("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
        self.assertEqual(titles, ['Batman Begins', 'The Dark Knight', 'The Dark Knight Rises'])
        self.assertEqual(release_dates, ['June 15, 2005', 'July 18, 2008', 'July 20, 2012'])
    def test_07_extract_dark_knight_box_offices(self):
        titles, box_offices = extract_dark_knight_box_offices("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
        self.assertEqual(titles, ['Batman Begins', 'The Dark Knight', 'The Dark Knight Rises'])
        self.assertEqual(box_offices, ['$374,218,673', '$1,004,934,033', '$1,084,939,099'])
    def test_08_extract_dark_knight_trilogy(self):
        dark_knight_trilogy = extract_dark_knight_trilogy("https://simple.wikipedia.org/wiki/The_Dark_Knight_Trilogy")
        self.assertIsInstance(dark_knight_trilogy, pd.core.frame.DataFrame)
        self.assertEqual(dark_knight_trilogy.shape, (3, 3))
    def test_09_extract_lord_of_the_rings_trilogy(self):
        lord_of_the_rings_trilogy = extract_lord_of_the_rings_trilogy("https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_(film_series)")
        self.assertIsInstance(lord_of_the_rings_trilogy, pd.core.frame.DataFrame)
        self.assertEqual(lord_of_the_rings_trilogy.shape, (3, 3))
    def test_10_create_trilogy_df(self):
        trilogy_df = create_trilogy_df()
        self.assertIsInstance(trilogy_df, pd.core.frame.DataFrame)
        self.assertEqual(trilogy_df.shape, (6, 4))
        
suite = unittest.TestLoader().loadTestsFromTestCase(TestWebScraping)
runner = unittest.TextTestRunner(verbosity=2)
test_results = runner.run(suite)
number_of_failures = len(test_results.failures)
number_of_errors = len(test_results.errors)
number_of_test_runs = test_results.testsRun
number_of_successes = number_of_test_runs - (number_of_failures + number_of_errors)
cwd = os.getcwd()
folder_name = cwd.split("/")[-1]
with open("../exercise_index.json", "r") as content:
    exercise_index = json.load(content)
chapter_name = exercise_index[folder_name]

test_01_extract_team_names (__main__.TestWebScraping) ... ok
test_02_extract_store_addresses (__main__.TestWebScraping) ... ok
test_03_extract_movie_titles (__main__.TestWebScraping) ... ok
test_04_extract_movie_ratings (__main__.TestWebScraping) ... ok
test_05_extract_movie_table (__main__.TestWebScraping) ... ok
test_06_extract_dark_knight_release_dates (__main__.TestWebScraping) ... ok
test_07_extract_dark_knight_box_offices (__main__.TestWebScraping) ... ok
test_08_extract_dark_knight_trilogy (__main__.TestWebScraping) ... ok
test_09_extract_lord_of_the_rings_trilogy (__main__.TestWebScraping) ... ok
test_10_create_trilogy_df (__main__.TestWebScraping) ... ok

----------------------------------------------------------------------
Ran 10 tests in 4.469s

OK


In [13]:
print("你在「{}」章節中的 {} 道 Python 練習答對了 {} 題。".format(chapter_name, number_of_test_runs, number_of_successes))

你在「以 Python 擷取網路資料」章節中的 10 道 Python 練習答對了 10 題。
