#### **Beautiful Soup**
- Used to parse HTML and XML documents
    - **`parse`**:
        - Take messy, unstructured data → turn it into structured data you can easily use.
  -  **`HTML parsing`** → extract data from web pages
  -  **`XML parsing`** → extract data from structured XML documents

##### Importing a BeautifulSoup

In [1]:
from bs4 import BeautifulSoup

#### **Create a `soup` object:**
-  A **`soup object`** is how BeautifulSoup stores HTML or XML so you can easily search and extract parts of it.
-  **In Simple Words**
   - HTML = messy text
   - BeautifulSoup reads it → **creates a “soup object.”**
   - That object turns the HTML into a **tree structure** you can work with.

In [2]:
with open('db/html-doc.html') as file:
    soup = BeautifulSoup(file, 'html.parser')

In [3]:
type(soup)

bs4.BeautifulSoup

In [4]:
print(soup)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>



#### **Common BeautifulSoup Attributes**

| **Attribute**              | **Description**                              |
|-----------------------------|---------------------------------------------|
| **soup.tagname**           | Accesses the first occurrence of a tag      |
| **.name**                  |Returns the tag’s name as a string           |
| **.text**                  | Gets the text inside a tag                  |
| **.attrs**                 | Dictionary of a tag’s attributes            |
| **.tag['attribute']**       | Accesses a specific attribute value         |
| **.parent**                | The parent tag                              |
| **.parents**               | Generator for all ancestors up the tree     |
| **.children** / **.contents** | Direct child tags of a tag            |
| **.descendants**           | All child tags recursively                  |
| **.next_sibling**          | Next sibling tag or string                  |
| **.previous_sibling**      | Previous sibling tag or string              |
| **.string**                | Tag’s text if it has only one text child    |



In [5]:
#pretify
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



#### Navigating using tag names:
- Directly access tags as attributes of the soup object
- **Syntax:** **`soup.tagname`**
- **NOTE:** Direct attribute access only returns the first occurrence.


In [6]:
print(soup.html)

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body></html>


In [7]:
print(soup.body)

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p></body>


In [8]:
# If tag doesn't exsits
print(soup.div)

None


#### Navigating Deeper
- You can chain tags together:

In [9]:
print(soup.body.a)

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


In [10]:
print(soup.title.text)

The Dormouse's story


In [11]:
type(soup.title.text)

str

In [12]:
# Applying string methods
title = soup.title.text
print(title)
print(f'Upper() Method: {title.upper()}')
print(f'Split() Method: {title.split()}')

The Dormouse's story
Upper() Method: THE DORMOUSE'S STORY
Split() Method: ['The', "Dormouse's", 'story']


In [13]:
# name tag
soup.name

'[document]'

In [14]:
# parent tag : Display the parent tag
soup.a.parent

<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

#### **`.children vs .descendants`** attributes


| **Attribute**    | **Description**                                     | **Returns**                                  | **Example Output**                             |
|------------------|-----------------------------------------------------|----------------------------------------------|------------------------------------------------|
| **.children**    | Only the **immediate child** tags or strings.       | Iterator of direct children (1 level deep)   | `<p>First paragraph</p>`<br>`<p>Second...</p>` |
| **.descendants** | **All tags and strings recursively** inside a tag.  | Iterator of all nested content at any depth  | `<p>First paragraph</p>`<br>`First paragraph`<br>`<b>bold</b>`<br>`bold` |


In [15]:
# children tag
childrens = soup.body.children
childrens

<list_iterator at 0x1d3ea697760>

In [16]:
for child in childrens:
    print(child)



<p class="title"><b>The Dormouse's story</b></p>


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


<p class="story">...</p>


In [17]:
for child in soup.head.children:
    print(child)

<title>The Dormouse's story</title>


In [18]:
# descendants tag
soup.body.descendants

<generator object Tag.descendants at 0x000001D3EA6A2400>

In [19]:
for descendant in soup.body.descendants:
    print(descendant)



<p class="title"><b>The Dormouse's story</b></p>
<b>The Dormouse's story</b>
The Dormouse's story


<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
Once upon a time there were three little sisters; and their names were

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
,

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 and

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie
;
and they lived at the bottom of a well.


<p class="story">...</p>
...


#### **Common BeautifulSoup Methods**

| **Method**            | **Description**                                 |
|-----------------------|-------------------------------------------------|
| **.prettify()**       | Pretty-prints the HTML tree                     |
| **.find()**           | Finds the first matching tag                    |
| **.find_all()**       | Finds all matching tags (returns list)          |
| **.get_text()**       | Extracts text inside a tag                      |
| **.get()**            | Retrieves an attribute value from a tag         |
| **.select()**         | Finds tags using CSS selectors                  |
| **.select_one()**     | Finds a single tag using a CSS selector         |
| **.decode()**         | Converts a tag or soup object back to HTML text |



In [20]:
#get_text() method: extract the plain text of tags
soup.body.get_text()

"\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n..."

In [21]:
print(soup.body.get_text().strip())

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...


In [22]:
soup.title.get_text()

"The Dormouse's story"

### **Searching the tree**

In [23]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

#### **find() Method:**
- The **`.find()`** method searches the HTML document and returns the `first` tag that matches your search criteria.
- **Syntax:** **`soup.find(name=None, attrs={}, recursive=True, string=None, **kwargs)`**
    - **Parameters:**
        - **`name:`** Tag name to search for `(e.g. "div", "a", "span")`
        - **`attrs:`** Dictionary of HTML attributes to match `(e.g. {"class": "player"})`
        - **`recursive:`** Whether to search inside all child elements `(True by default)`
        - **`string:`** Search tag contents that exactly match a given string
        - **`**kwargs:`** Shorthand for filtering attributes directly `(e.g., class_="info")`

In [24]:
# name
soup.find('a')

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [25]:
# Find the anchor tag with class:
soup.find("a", class_="sister")

# OR r using attrs:
# soup.find('a', attrs={'class': 'sister'})

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [26]:
# Find by ID
soup.find('a', id="link2")

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

In [27]:
# Find element containing exact text
soup.find('a', string="Lacie")

<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

#### **find_all() Method:**
- The **`.find_all()`** method searches the parsed HTML and **returns a list of tags** that match the criteria you specify.
- **Syntax:** **`soup.find_all(name=None, attrs={}, recursive=True, string=None, limit=None, **kwargs)`**
    - **Parameters:**
        - **`name:`** Tag name to search for `(e.g. "div", "a", "span")`
        - **`attrs:`** Dictionary of HTML attributes to match `(e.g. {"class": "player"})`
        - **`recursive:`** Whether to search all levels or only direct children  `(True by default)`
        - **`string:`** Search tag contents that exactly match a given string
        - **`limit`** Maximum number of results to return
        - **`**kwargs:`** Shortcut for filtering by attributes `(like class_, id, href, etc.)`

In [28]:
# name
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [29]:
# Find the anchor tag with class:
# soup.find_all("a", class_="sister")

# OR r using attrs:
soup.find_all('a', attrs={'class': 'sister'})

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [30]:
# Find all tags by multiple tag names
soup.find_all(["a", "title", "p"])


[<title>The Dormouse's story</title>,
 <p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>,
 <p class="story">...</p>]

In [31]:
# Find a p tag that contains an exact string
soup.find_all("p", string="The Dormouse's story")

[<p class="title"><b>The Dormouse's story</b></p>]

In [32]:
# Limit results to first 3 matches
soup.find_all("a", class_="sister", limit=2)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

In [33]:
#  Bonus: Custom filtering using a function
def has_id_and_has_class(tag):
    return tag.has_attr('id') and tag.has_attr('class')

soup.find_all(has_id_and_has_class)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [34]:
type(soup.find_all('a'))

bs4.element.ResultSet

#### **Access Attributes of an Anchor Tag**
- **`Access Single Attribute:`**
    - **Bracket notation**:
        - **Ex.** print(a_tag["class"]) 
    - **.get() Method**
        - **Ex.** print(a_tag.get("href"))   
        - If the attribute **does not exist**, **.get()** returns `None` instead of throwing an error:
    - **`Access All Attributes:`**
        - **Ex.** print(a_tag.attrs)


-  Use **`tag["attr"]`** or **`tag.get("attr")`** to access anchor tag attributes
- Use **`tag.attrs`** for all attributes.

In [35]:
# get() method
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [36]:
# Access Single Attribute:
soup.a['id'] # using bracket notation

'link1'

In [37]:
# Access Single Attribute:
soup.a.get('class') # using get() method

['sister']

In [38]:
#Access All Attributes:
soup.a.attrs

{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}

##### **`select()` method:**
- **Returns a list** of all elements matching a given CSS selector.
- If no elements are found, it returns an **empty list**.
- **Useful** when you expect **multiple elements** to match the selector.
- **Syntax:** **`soup.select(selector, namespaces=None, limit=None, **kwargs)`**
    - **Parameters:**
        - **`selector:`** A string containing a CSS selector `(e.g., "div.item", "#main")`. Required.
        - **`namespaces:`** A dictionary of namespace prefixes (mainly used for XML). Optional.
        - **`limit`** Integer to limit the number of matches. Optional.
        - **`**kwargs:`**	Additional keyword arguments (rarely used).

--------------------------------------------

##### **`select_one()` method:**
- **Returns only the first element** that matches the CSS selector.
- If no match is found, it returns **`None`**.
- Useful when you're sure or only care about a **single match**.
- **Syntax:** **`soup.select(selector, namespaces=None, **kwargs)`**
    - **Parameters:**
        - **`selector:`** A string containing a **CSS selector** `(e.g., "div.item", "#main")`. Required.
        - **`namespaces:`** A dictionary of namespace prefixes (mainly used for XML). Optional.
        - **`**kwargs:`**	Additional keyword arguments (rarely used).

--------------------------------------------

- **`NOTE:`**
Both use full CSS selector syntax, like:
    - `.class-name`
    - `#id-name`
    - `tag`
    - `tag > child-tag`
    - `tag[attr=value],` etc.

--------------------------------------------

##### **`.select()` vs `.select_one()`?**
These methods let you search for elements using CSS selectors, just like in web development (HTML + CSS):
- **`.select()`** – Returns all matching elements (like **.find_all()**) → list
- **`.select_one()`** – Returns only the first matching element (like **.find()**) → element or None

In [39]:
# select() method

from bs4 import BeautifulSoup

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p></body></html>
"""
soup = BeautifulSoup(html, 'html.parser')

# print(soup.prettify())

print(soup.select("a.sister"))

print(soup.select("a.sister", limit=2))

# return empty list id selector not found
print(soup.select("a.brother")) 


# select() to get all <a> tags with class "sister"
all_tags = soup.select("a.sister")
for tag in all_tags:
    print(f" text: {tag.get_text()}, links: {tag['href']}")

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
[]
 text: Elsie, links: http://example.com/elsie
 text: Lacie, links: http://example.com/lacie
 text: Tillie, links: http://example.com/tillie


In [40]:
# select_one()
print(soup.select_one("a.sister"))


<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


### CSS Combinators Table

| Combinator | Syntax     | Description                                      | Example in BeautifulSoup               |
|------------|------------|--------------------------------------------------|----------------------------------------|
| Descendant | `A B`      | Selects all `B` inside `A` at any level         | `soup.select("div p")`                 |
| Child      | `A > B`    | Selects `B` if it is a **direct child** of `A`  | `soup.select("ul > li")`               |
| Adjacent   | `A + B`    | Selects `B` if it is the **next sibling** of `A`| `soup.select("h2 + p")`                |
| General    | `A ~ B`    | Selects all `B` siblings **after** `A`          | `soup.select("h2 ~ p")`                |


In [41]:
# Example for performing CSS Combinators 

html = """

<div class="div_1">
    <h2>Heading</h2>
    <p>Paragraph 1</p>
    <p>Paragraph 2</p>
    <section>
        <p>Nested Paragraph</p>
    </section>
</div>

<div class="div_2">
    <h3>Heading of div 2</h3>
    <p>Paragraph of div 2</p>
    <p>Paragraph of div 2</p>
    <h2>Heading 2 of div 2</h2>
    <section>
        <p>Nested Paragraph of div 2</p>
    </section>
</div>
"""

css_soup = BeautifulSoup(html, 'html.parser')


In [42]:
# 1.Descendant Combinator (A B) : Finds all <p> tags anywhere inside <div>.
p_tags = css_soup.select("div p")
print(p_tags)
for tag in p_tags:
    print(f"{ tag.text }")

[<p>Paragraph 1</p>, <p>Paragraph 2</p>, <p>Nested Paragraph</p>, <p>Paragraph of div 2</p>, <p>Paragraph of div 2</p>, <p>Nested Paragraph of div 2</p>]
Paragraph 1
Paragraph 2
Nested Paragraph
Paragraph of div 2
Paragraph of div 2
Nested Paragraph of div 2


In [43]:
#2: Child Combinator (A > B): Finds only <p> tags that are direct children of <div>, not nested.
p_tags = css_soup.select("div > p")
print(p_tags)
for tag in p_tags:
    print(f"{ tag.text }")

[<p>Paragraph 1</p>, <p>Paragraph 2</p>, <p>Paragraph of div 2</p>, <p>Paragraph of div 2</p>]
Paragraph 1
Paragraph 2
Paragraph of div 2
Paragraph of div 2


In [44]:
# 3. Adjacent Sibling Combinator (A + B): Finds the first <p> that comes immediately after <h2>
css_soup.select("h2 + section")

[<section>
 <p>Nested Paragraph of div 2</p>
 </section>]

In [45]:
# General Sibling Combinator (A ~ B) : Finds all <p> that are siblings of <h2> and come after it.
css_soup.select("h2 ~ p") 

[<p>Paragraph 1</p>, <p>Paragraph 2</p>]

In [46]:

css_soup.select("div ~ div")

[<div class="div_2">
 <h3>Heading of div 2</h3>
 <p>Paragraph of div 2</p>
 <p>Paragraph of div 2</p>
 <h2>Heading 2 of div 2</h2>
 <section>
 <p>Nested Paragraph of div 2</p>
 </section>
 </div>]

### **Fetch webpage with `Requests`**
- **URL**: `https://www.bbc.com/sport/football/premier-league/top-scorers`

In [47]:
## GET requests
import requests

url = "https://www.bbc.com/sport/football/premier-league/top-scorers"

In [48]:
response = requests.get(url)

In [49]:
# Check for errors
response.raise_for_status()

In [50]:
print(response.raise_for_status())

None


In [51]:
response.status_code

200

In [52]:
response.text[:500]

'<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">Premier League Top Scorers - BBC Sport</title><meta data-rh="true" name="description" content="Premier League top scorers. Showing assists, time on pitch and the shots on and off target."/><meta data-rh="true" name="theme-color" content="#FFFFFF"/><meta data-rh="true" property="og:description" content="Premier League top scorers'

In [53]:
type(response.text[:500])

str

In [54]:
# for byte
response.content[:500]

b'<!DOCTYPE html><html lang="en-GB" class="no-js"><head><meta charSet="utf-8" /><meta name="viewport" content="width=device-width, initial-scale=1" /><title data-rh="true">Premier League Top Scorers - BBC Sport</title><meta data-rh="true" name="description" content="Premier League top scorers. Showing assists, time on pitch and the shots on and off target."/><meta data-rh="true" name="theme-color" content="#FFFFFF"/><meta data-rh="true" property="og:description" content="Premier League top scorers'

In [55]:
type(response.content[:500])

bytes

In [56]:
soup = BeautifulSoup(response.content, 'html.parser')

In [57]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-GB">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title data-rh="true">
   Premier League Top Scorers - BBC Sport
  </title>
  <meta content="Premier League top scorers. Showing assists, time on pitch and the shots on and off target." data-rh="true" name="description"/>
  <meta content="#FFFFFF" data-rh="true" name="theme-color"/>
  <meta content="Premier League top scorers. Showing assists, time on pitch and the shots on and off target." data-rh="true" property="og:description"/>
  <meta content="https://static.files.bbci.co.uk/core/website/assets/static/sport/bbc-sport-logo.0da9386782.png" data-rh="true" property="og:image"/>
  <meta content="BBC Sport" data-rh="true" property="og:site_name"/>
  <meta content="Premier League Top Scorers - BBC Sport" data-rh="true" property="og:title"/>
  <meta content="article" data-rh="true" property="og:type"/>
  <meta content="https://www.b

In [58]:
soup.find_all('a')

[<a class="ssrcss-col7sy-NavigationLink-LogoLink eki2hvo11" href="https://www.bbc.com"><span class="ssrcss-13kjque-LogoIconWrapper eki2hvo6"><svg fill="currentColor" height="32" viewbox="0 0 112 32" width="112" xmlns="http://www.w3.org/2000/svg"><path d="M111.99999,4.44444577e-05 L111.99999,32.0000444 L79.9999905,32.0000444 L79.9999905,4.44444577e-05 L111.99999,4.44444577e-05 Z M72.0000119,-3.55271368e-15 L72.0000119,32 L40.0000119,32 L40.0000119,-3.55271368e-15 L72.0000119,-3.55271368e-15 Z M32,-3.55271368e-15 L32,32 L-1.13686838e-13,32 L-1.13686838e-13,-3.55271368e-15 L32,-3.55271368e-15 Z M97.469329,6.80826869 C96.0294397,6.80826869 94.7294393,7.02226876 93.5693278,7.44982444 C92.4089942,7.87782457 91.4137717,8.49471364 90.5841047,9.30049166 C89.7538823,10.1067141 89.1188821,11.07327 88.6785486,12.199937 C88.2378818,13.3269373 88.0177706,14.5896043 88.0177706,15.9876048 C88.0177706,17.4188274 88.2296596,18.7062722 88.6531042,19.8493837 C89.0763265,20.9929396 89.6861045,21.9591621 90