In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

# A Basic Example
## 1. Searching HTML with `BeautifulSoup`

BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch from the web but first we'll try it out with a simple HTML example.

In [2]:
sample_html = """
<html>
<body>
    <h1> BeautifulSoup </h1>
    <p> BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch
        from the web.
        Here, we're just using it to parse some simple sample HTML. </p>

    <h2 class="important"> Searching the tree </h2>
    <p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
        id, by CSS class, and so on. </p>
    <ol>
        <li> By tag: we could search for every li </li>
        <li> By id: we could search for the p tag with id="searching_description" </li>
        <li class="important"> By class: we could search for every tag with a given class </li>
    </ol>

<body>
</html>
"""

At present the `sample_html` is just a string. In order to search through it, we'll create a Soup object with beautifulSoup - a searchable object representation. 

In [3]:
soup = BeautifulSoup(sample_html, 'html.parser') # Parsing the HTML using BeautifulSoup and the built-in HTML parser 

We can now search the `soup` object to extract useful information. Let's try searching by tag, id, and class:

**First, by tag:** let's find the text of every list element (`<li>`) on the page:

In [4]:
list_elements = soup.find_all('li') # Finding all the list elements in the HTML
for element in list_elements:
    print(element.text) # Printing the text of each list element

 By tag: we could search for every li 
 By id: we could search for the p tag with id="searching_description" 
 By class: we could search for every tag with a given class 


**Next, by id** let's find what the paragraph (`<p>`) with the `id` `searching_description` says and let's also take a look at its style.

In [5]:
# instead of using find_all, we can use find to find the first element that matches the search criteria
description_element = soup.find('p', {'id': 'searching_description'}) # Finding the paragraph with id="searching_description"
description_element

print('text:', description_element.text) # Printing the text of the paragraph with id="searching_description"
print('style:', description_element['style']) # Printing the value of the style attribute of the paragraph with id="searching_description"

text:  BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by
        id, by CSS class, and so on. 
style: color: red


**Finally, by class:** let's find all elements with the `class` `important`

In [6]:
important_elements = soup.find_all(class_='important') # Finding all the elements with class="important"
print(f"There are {len(important_elements)} elements with class='important'")
for element in important_elements:
    print(element) # Printing each element with class="important"

There are 2 elements with class='important'
<h2 class="important"> Searching the tree </h2>
<li class="important"> By class: we could search for every tag with a given class </li>


## 2. Getting HTML from the web with `Requests`

<div style="display: flex; align-items: flex-start;">
    <div style="flex: 0 2 auto;">
        <img src="https://raw.githubusercontent.com/FM-ds/ScrapingWorkshop/main/notebook_images/sample_html_safari.png" width="400px">
    </div>
    <div style="flex: 1 1 auto; margin-top: 10px; margin-right: 150px">
        <p>We just extracted information from HTML which was defined locally in a string <code>sample_html</code>. Usually, we care about HTML found on the internet. As a simple example, the page defined in <code>sample_html</code> is available at <a href="http://www.fmcevoy.io/ScrapingWorkshop/sample_html">http://www.fmcevoy.io/ScrapingWorkshop/sample_html</a></p>
        <p>To download HTML (and any other resources) from the internet, we can use the <code>requests</code> module.</p>
    </div>
</div>


In [7]:
req = requests.get('http://www.fmcevoy.io/ScrapingWorkshop/sample_html') # Making a request for the sample HTML hosted on the web
req.text # looking at the text of the response

'<html>\n<body>\n    <h1> BeautifulSoup </h1>\n    <p> BeautifulSoup is a Python library for parsing HTML documents. We can use it to extract data from HTML we fetch\n        from the web.\n        Here, we\'re just using it to parse some simple sample HTML. </p>\n\n    <h2 class="important"> Searching the tree </h2>\n    <p id="searching_description" style="color: red"> BeautifulSoup allows us to search the HTML tree in lots of different ways: by tag, by\n        id, by CSS class, and so on. </p>\n    <ol>\n        <li> By tag: we could search for every li </li>\n        <li> By id: we could search for the p tag with id="searching_description" </li>\n        <li class="important"> By class: we could search for every tag with a given class </li>\n    </ol>\n\n<body>\n</html>'

We can then make a traversable `soup` object using the response and search as we did before.

In [8]:
soup = BeautifulSoup(req.text, 'html.parser') # Instead of using the sample HTML, we're using the HTML from the web that we fetched in the previous step
soup.find_all('li') # Finding all the list elements in the HTML

[<li> By tag: we could search for every li </li>,
 <li> By id: we could search for the p tag with id="searching_description" </li>,
 <li class="important"> By class: we could search for every tag with a given class </li>]

# A Useful Example: Scraping Prices from M&S

<div style="display: flex; align-items: flex-start;">
    <div style="flex: 0 1.75 auto; ">
        <img src="https://raw.githubusercontent.com/FM-ds/ScrapingWorkshop/main/notebook_images/ms_plants.png" width="500px">
    </div>
    <div style="flex: 1 1 auto; margin-top: 10px; margin-right: 150px">
        Scraping the sample html is a basic introduction. Let's move to collecting some real-world data.

Marks and Spencers is a high-end British supermarket. Let's see if we can collect prices from one of their pages. Josh and I scrape around 100K prices every day to investigate inflation but there are lots of other uses for this data as well:

- **Price Comparisons**: You could make a supermarket vs supermarket price comparison tool
- **Market Research**: Maybe you want to investigate pricing of your competitors
- **Procurement**: Waiting for sales to emerge

Today, we'll try scraping prices from just one page, Marks and Spencers' plants: https://www.marksandspencer.com/l/flowers-and-plants/plants?sort=best_seller+desc 
    </div>
</div>




## Identifying Structure with Inspect Element

Before we write any code, we should use inspect element to see how prices, pictures and more are defined in the HTML.

<img src="https://raw.githubusercontent.com/FM-ds/ScrapingWorkshop/main/notebook_images/ms_inspect_element.png" width="800vw">

There are lots of elements and intimidating details but here we can see:

- Product Details are contained in `<div>` elements with a `data-tagg` attribute that starts with `product-card-`
- Product Titles are found in `<h2>` elements within the `product-card-` divs



## Extracting Product Info

Let's put together these two facts to find all the product titles. 
First, we'll download the HTML from the M&S site and then we'll search for all the product cards to find their titles.

In [9]:
url = "https://www.marksandspencer.com/l/flowers-and-plants/plants?sort=best_seller+desc"
req = requests.get(url) # Making a request for the M&S plants page
soup = BeautifulSoup(req.text, 'html.parser') # Parsing the HTML using BeautifulSoup and the built-in HTML parser

### Product Titles

To find all the product titles, first we'll build a list of all divs that contain product details by searching for all elements with the data-tagg attribute that contain "product-card".

For each of these product divs, we'll then search for the title (a `h2` element) and print it.

In [24]:
# find all divs with data-tagg attribute that contain "product-card" and potentially other values
product_divs = soup.find_all('div', attrs={'data-tagg': lambda x: x and "product-card" in x}) # Using a lambda function to find all divs with data-tagg attribute that contain "product-card" and potentially other values
for product_div in product_divs[:5]: # Looping through the first 5 product divs
    name = product_div.find('h2').text # Finding the h2 tag within the div and getting its text
    print(name) # Printing the name of the product


Oriental Lily
Spring Flowering Basket
Rose Trough
Yellow Miniature Phalaenopsis Orchid in Ceramic Pot
White Miniature Phalaenopsis Orchid in Ceramic Pot


### Product Prices

Returning to inspect element, we can see that prices are defined in `spans` within the product divs. There are lots of spans, so the best way is to search through all the spans and just keep the one that starts with a `£` sign.

<img src="https://raw.githubusercontent.com/FM-ds/ScrapingWorkshop/main/notebook_images/ms_inspect_price.png" width="800vw">


In [25]:
# find all divs with data-tagg attribute that contain "product-card" and potentially other values
product_divs = soup.find_all('div', attrs={'data-tagg': lambda x: x and "product-card" in x}) # Using a lambda function to find all divs with data-tagg attribute that contain "product-card" and potentially other values
for product_div in product_divs[:5]: # Looping through the first 5 product divs
    name = product_div.find('h2').text # Finding the h2 tag within the div and getting its text

    # Finding the price of the product
    ## First find all spans in the product div
    spans = product_div.find_all('span')
    ## We want to find only the span with text that starts with "£"
    ## To to this we'll filter the list of spans for only those that start with "£" and take the first one
    for span in spans:
        if span.text.startswith("£"):
            price = span.text
            break # Once we've found the price, we can stop looking

    print(f"{name}: {price}") # Printing the name and price of the product

Oriental Lily: £25
Spring Flowering Basket: £30
Rose Trough: £25
Yellow Miniature Phalaenopsis Orchid in Ceramic Pot: £20
White Miniature Phalaenopsis Orchid in Ceramic Pot: £20


# Finding Reviews 

Finally, we'll find the average review and review counts. 

<img src="https://raw.githubusercontent.com/FM-ds/ScrapingWorkshop/main/notebook_images/ms_inspect_rating.png" width="800vw">

Going back to inspect element, we can see that reviews and review counts are contained as spans in a `<button>` element:
- We can find the rating (4.6) by looking for spans with text that are numeric
- We can find the review count by looking for spans that have text-decoration="underline"

In [30]:
product_divs = soup.find_all('div', attrs={'data-tagg': lambda x: x and "product-card" in x}) # Using a lambda function to find all divs with data-tagg attribute that contain "product-card" and potentially other values
for product_div in product_divs[:5]: # Looping through the first 5 product divs
    name = product_div.find('h2').text # Finding the h2 tag within the div and getting its text

    # Finding the price of the product
    ## First find all spans in the product div
    spans = product_div.find_all('span')
    ## We want to find only the span with text that starts with "£"
    ## To to this we'll filter the list of spans for only those that start with "£" and take the first one
    for span in spans:
        if span.text.startswith("£"):
            price = span.text
            break # Once we've found the price, we can stop looking

    # Finding Ratings
    # Next, we'll find the first button element and loop over the spans to find 
    rating_btn = product_div.find('button')
    if rating_btn:
        spans = rating_btn.find_all('span')
        rating = spans[0].text # The first span contains the rating
        rating_count = spans[-1].text # The last span contains the rating count

    else:
        rating = None
        rating_count = None

    print(f"{name}: {price}") # Printing the name and price of the product
    print(f"    Rating: {rating} ({rating_count})") # Printing the rating and rating count of the product

Oriental Lily: £25
    Rating: 4.6 (1759 reviews)
Spring Flowering Basket: £30
    Rating: 4.4 (968 reviews)
Rose Trough: £25
    Rating: 4.5 (1020 reviews)
Yellow Miniature Phalaenopsis Orchid in Ceramic Pot: £20
    Rating: 4.6 (573 reviews)
White Miniature Phalaenopsis Orchid in Ceramic Pot: £20
    Rating: 4.7 (477 reviews)
