Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Caleb Andree"
COLLABORATORS = ""

---

# Lab 8 Web Scraping [Total: 4 points]

The purpose of this assignment is for you to engage with a concrete web scraping task. This will be accomplished through a coding assignment. You will carry out this task in the present notebook, and use the notebook to document the various steps of the exercise and to answer all questions.


## Required skills

This lab will let you practice the following skills:
- Download HTML
- Parse HTML

A few additional resources can be found here:
- https://docs.python-requests.org/
- https://beautiful-soup-4.readthedocs.io


## Table of Contents
<ul>
    <li><a href="#Submission-checklist">Submission checklist</a></li>
    <li><a href="#Q1">Question 1</a></li>
    <li><a href="#Q2">Question 2</a></li>
    <li><a href="#Q3">Bonus Question</a></li>
</ul>

## Submission checklist

**Points**: 1

Before submitting make sure that:

1. Your name is included above, plus the name of any collaborator you worked with;
2. All Markdown cells you edited are rendering correctly, especially the ones with answers.
3. You have removed any `raise NotImplementedError()` line from your code cells.

## Q1

**Points**: 1

Write a function `gettextbytag` that extracts all the text of a given HTML tag within an HTML file. Your function should accept 2 arguments -- the name of an HTML file as a string, and the name of an HTML tag as a string.

It should return a list of string, where for each occurrence of the tag in the file, it should include the text of the tag.

For example, if the file `q1file.html` is:

```html
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>

<h1>This is a Heading</h1>
<p class="firstpara">This is first paragraph.</p>
<p>This is second paragraph.</p>
<p id="third">This is third paragraph.</p>

</body>
</html>
```

and the tag is `"h1"`, then your function should return
```
['This is a Heading']
```

In [2]:
from bs4 import BeautifulSoup

# YOUR CODE HERE
def gettextbytag(file, tag_name):
    
    html_file = open(file,'r')
    soup = BeautifulSoup( html_file, "html.parser")    
    tag_list = []
    for tag in soup.find_all(tag_name):
        tag_list.append(tag.get_text())
    return tag_list

Use the cell below to run your function and see what it returns. You may want to test different dates to see if your code returns the correct answer.

In [3]:
gettextbytag("q1file.html", "h1")

['This is a Heading']

In [4]:
q1CORRECT_ANSWER = ["This is a Heading"]

# Call the student's function
q1STUDENT_ANSWER = gettextbytag("q1file.html", "h1")

# Check if the student's answer matches the correct answer
try:
    assert q1STUDENT_ANSWER == q1CORRECT_ANSWER
    print("All tests passed! üëç")
except AssertionError:
    print("Error: your solution does not match the correct one.")

All tests passed! üëç


## Q2

__Points__: 1

Write a function called `getbooksprice` that scrapes prices of books from the front page of the website [books.toscrape.com](//books.toscrape.com).

Your function should take no parameter. It should fetch the front page of the website, and it should return a Python list of float values with the price (in pounds) of the books in it.

**Hint**: if you see the symbol `√Ç` being printed in the price, then make sure to set the encoding from the response, like this:
```
    response = requests.get("https://books.toscrape.com/")
    response.encoding = 'utf-8'
```
before extracting the text from the response object.

In [5]:
from bs4 import BeautifulSoup
import requests 

# YOUR CODE HERE
def getbooksprice():
    response = requests.get("https://books.toscrape.com/")
    response.encoding = 'utf-8'
    
    price_list = []
    soup = BeautifulSoup(response.text, "html.parser") 
    for price in soup.findAll('p', {'class': 'price_color'}):
        price_text = price.getText().strip('¬£')
        price_as_float = float(price_text)
        price_list.append(price_as_float)
    return price_list
 
    
#raise NotImplementedError()

Use the cell below to run your function and see what it returns.

In [6]:
getbooksprice()

[51.77,
 53.74,
 50.1,
 47.82,
 54.23,
 22.65,
 33.34,
 17.93,
 22.6,
 52.15,
 13.99,
 20.66,
 17.46,
 52.29,
 35.02,
 57.25,
 23.88,
 37.59,
 51.33,
 45.17]

In [7]:
q2CORRECT_ANS = [51.77, 53.74, 50.1, 47.82, 54.23, 22.65, 33.34, 17.93, 22.6, 52.15, 13.99, 20.66, 17.46, 52.29, 35.02, 57.25, 23.88, 37.59, 51.33, 45.17]
q2STUDENT_ANS = getbooksprice()

assert type(q2STUDENT_ANS) is list, f"Error: Your solution returned a {type(q2STUDENT_ANS)}, while it should return a list"
assert q2CORRECT_ANS == q2STUDENT_ANS, f"Error: Your solution returned {q2STUDENT_ANS!r}. Correct answer: {q2CORRECT_ANS!r}"

# Print a summary
print("All tests passed! üëç")

All tests passed! üëç


## Q3

__Points__: 1

Write a function called `getbookspricebonus` scrapes the prices of all the books on the website catalog, not just the front page. Your function should fetch one page at a time, scrape the book prices, and append them to the final array or list. Besides from the front page there are 50 additional pages of book catalog, each with 20 books, so the website contains exactly 1020 books in total, and so this is the size of the array that you should return.

**Hint 1**: if you see the symbol `√Ç` being printed in the price, then make sure to set the encoding from the response, like this:
```
    response = requests.get("https://books.toscrape.com/")
    response.encoding = 'utf-8'
```
before extracting the text from the response object.

**Hint 2**: Each additional catalog page is located at the following URL:

    https://books.toscrape.com/catalogue/page-XX.html
    
Where `XX` ranges from 1 to 50.

In [8]:
from bs4 import BeautifulSoup
import requests 
import numpy as np

# YOUR CODE HERE
def getbookspricebonus():
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    price_list = []

    front_page_response = requests.get("https://books.toscrape.com/")
    front_page_response.encoding = 'utf-8'

    if front_page_response.status_code == 200:
        front_page_soup = BeautifulSoup(front_page_response.text, "html.parser")
        for price in front_page_soup.find_all('p', class_='price_color'):
            price_text = price.get_text().strip('¬£')
            price_as_float = float(price_text)
            price_list.append(price_as_float)
            
    for page_number in range(1, 51):
        page_url = base_url.format(page_number)
        response = requests.get(page_url)
        response.encoding = 'utf-8'

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, "html.parser")
            for price in soup.find_all('p', class_='price_color'):
                price_text = price.get_text().strip('¬£')
                price_as_float = float(price_text)
                price_list.append(price_as_float)
        else:
            print(f"Failed to fetch page {page_number}")

    return np.array(price_list)


#raise NotImplementedError()

Use the cell below to run your function and see what it returns. You may want to test different dates to see if your code returns the correct answer.

In [9]:
getbookspricebonus()

array([51.77, 53.74, 50.1 , ..., 16.97, 53.98, 26.08])

In [10]:
import numpy as np
from numpy.testing import assert_array_equal

q3CORRECT_ANS = np.loadtxt('.solutionq3.txt')
q3STUDENT_ANS = getbookspricebonus()

assert type(q3STUDENT_ANS) is np.ndarray, f"Error: Your solution returned a {type(q3STUDENT_ANS)}, while it should return a numpy Array"
assert_array_equal(q3STUDENT_ANS, q3CORRECT_ANS) 

# Print a summary
print("All tests passed! üëç")

All tests passed! üëç
