# Data Analysis with Python

> Web Scraping with Python

Kuo, Yao-Jen <yaojenkuo@datainpoint.com> from [DATAINPOINT](https://www.datainpoint.com/)

## Instructions

- We've imported necessary modules/libraries at the beginning of each exercise.
- We've put necessary files(if any) in the working directory of each exercise.
- We've defined the names of functions/inputs/arguments for you.
- Write down your solution between the comments `### BEGIN SOLUTION` and `### END SOLUTION`.
- Running tests to see if your solutions are right: Kernel -> Restart & Run All -> Restart and Run All Cells.
- You can run tests after each question or after finishing all questions.

In [1]:
import unittest
import json
import xml.etree.ElementTree as ET
from bs4 import BeautifulSoup

## 00. Define a function named `extract_team_names` that is able to extract the team names of NBA given a JSON file.

- Expected inputs: a JSON file.
- Expected outputs: a list.

In [2]:
def extract_team_names(json_file):
    """
    >>> team_names = extract_team_names('teams.json')
    >>> len(team_names)
    30
    >>> "Boston Celtics" in team_names
    True
    >>> "Brooklyn Nets" in team_names
    True
    >>> "Los Angeles Lakers" in team_names
    True
    >>> "Phoenix Suns" in team_names
    True
    """
    ### BEGIN SOLUTION
    with open(json_file) as jf:
        teams_dict = json.load(jf)
    nba_teams = teams_dict['league']['standard']
    team_names = [e['fullName'] for e in nba_teams if e['isNBAFranchise']]
    return team_names
    ### END SOLUTION

## 01. Define a function named `extract_store_addresses` that is able to extract the store addresses of 7-11 conveniece stores in Xinyi District, Taipei given a XML file.

- Expected inputs: a XML file.
- Expected outputs: a list.

In [3]:
def extract_store_addresses(xml_file):
    """
    >>> store_addresses = extract_store_addresses('stores.xml')
    >>> len(store_addresses)
    74
    >>> '台北市信義區信義路五段7號35樓' in store_addresses
    True
    >>> '台北市信義區吳興街156巷2弄2號4號1樓' in store_addresses
    True
    >>> '台北市信義區忠孝東路五段68號24樓' in store_addresses
    True
    >>> '台北市信義區虎林街85號' in store_addresses
    True
    """
    ### BEGIN SOLUTION
    root = ET.parse(xml_file)
    address_xpath = './/Address'
    store_addresses = [e.text for e in root.findall(address_xpath)]
    return store_addresses
    ### END SOLUTION

## 02. Define a function named `extract_movie_titles` that is able to extract the titles of top rated movies from IMDb.com given a HTML file.

- Expected inputs: a HTML file.
- Expected outputs: a list.

In [4]:
def extract_movie_titles(html_file):
    """
    >>> movie_titles = extract_movie_titles('movies.html')
    >>> len(movie_titles)
    250
    >>> 'The Shawshank Redemption' in movie_titles
    True
    >>> 'The Godfather' in movie_titles
    True
    >>> 'The Dark Knight' in movie_titles
    True
    >>> 'Forrest Gump' in movie_titles
    True
    """
    ### BEGIN SOLUTION
    with open(html_file) as hf:
        soup = BeautifulSoup(hf)
    movie_titles = [e.text for e in soup.select('.titleColumn a')]
    return movie_titles
    ### END SOLUTION

## 03. Define a function named `extract_movie_ratings` that is able to extract the ratings of top rated movies from IMDb.com given a HTML file.

- Expected inputs: a HTML file.
- Expected outputs: a list.

In [5]:
def extract_movie_ratings(html_file):
    """
    >>> movie_ratings = extract_movie_ratings('movies.html')
    >>> len(movie_ratings)
    250
    >>> max(movie_ratings)
    9.2
    >>> min(movie_ratings)
    8.0
    >>> sum(movie_ratings) / len(movie_ratings)
    8.253999999999975
    """
    ### BEGIN SOLUTION
    with open(html_file) as hf:
        soup = BeautifulSoup(hf)
    movie_ratings = [float(e.text) for e in soup.select('strong')]
    return movie_ratings
    ### END SOLUTION

## Run tests!

Kernel -> Restart & Run All -> Restart and Run All Cells.

In [6]:
class TestWebScraping(unittest.TestCase):
    def test_00_extract_team_names(self):
        team_names = extract_team_names('teams.json')
        self.assertEqual(len(team_names), 30)
        self.assertTrue("Boston Celtics" in team_names)
        self.assertTrue("Brooklyn Nets" in team_names)
        self.assertTrue("Los Angeles Lakers" in team_names)
        self.assertTrue("Phoenix Suns" in team_names)
    def test_01_extract_store_addresses(self):
        store_addresses = extract_store_addresses('stores.xml')
        self.assertEqual(len(store_addresses), 74)
        self.assertTrue('台北市信義區信義路五段7號35樓' in store_addresses)
        self.assertTrue('台北市信義區吳興街156巷2弄2號4號1樓' in store_addresses)
        self.assertTrue('台北市信義區忠孝東路五段68號24樓' in store_addresses)
        self.assertTrue('台北市信義區虎林街85號' in store_addresses)
    def test_02_extract_movie_titles(self):
        movie_titles = extract_movie_titles('movies.html')
        self.assertEqual(len(movie_titles), 250)
        self.assertTrue('The Shawshank Redemption' in movie_titles)
        self.assertTrue('The Godfather' in movie_titles)
        self.assertTrue('The Dark Knight' in movie_titles)
        self.assertTrue('Forrest Gump' in movie_titles)
    def test_03_extract_movie_ratings(self):
        movie_ratings = extract_movie_ratings('movies.html')
        self.assertEqual(len(movie_ratings), 250)
        self.assertAlmostEqual(max(movie_ratings), 9.2)
        self.assertAlmostEqual(min(movie_ratings), 8.0)
        self.assertAlmostEqual(sum(movie_ratings) / len(movie_ratings), 8.253999999999975)
        
suite = unittest.TestLoader().loadTestsFromTestCase(TestWebScraping)
runner = unittest.TextTestRunner(verbosity=2)
test_results = runner.run(suite)
number_of_failures = len(test_results.failures)
number_of_errors = len(test_results.errors)
number_of_test_runs = test_results.testsRun
number_of_successes = number_of_test_runs - (number_of_failures + number_of_errors)

test_00_extract_team_names (__main__.TestWebScraping) ... ok
test_01_extract_store_addresses (__main__.TestWebScraping) ... ok
test_02_extract_movie_titles (__main__.TestWebScraping) ... ok
test_03_extract_movie_ratings (__main__.TestWebScraping) ... ok

----------------------------------------------------------------------
Ran 4 tests in 1.219s

OK


In [7]:
print("You've got {} successes among {} questions.".format(number_of_successes, number_of_test_runs))

You've got 4 successes among 4 questions.
