# Web Scraping with BeautifulSoup - Interactive Exercises

In this notebook, you'll learn how to parse HTML using **BeautifulSoup**, extract text, and search using different filters. Each exercise has a hint you can expand, and a final cell contains solutions for self-checking.

## Exercise 1: Import Libraries

Import the required libraries for web scraping:
- `BeautifulSoup` from `bs4`
- `urllib.request`
- `re` for regular expressions

In [None]:
# TODO: Import required libraries


<details>
<summary>Hint</summary>
Use:
```python
from bs4 import BeautifulSoup
import urllib.request
import re
```
</details>

## Exercise 2: Load HTML from a URL

Use `urllib.request.urlopen()` to fetch HTML from the URL `https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html`. Store the HTML content in a variable called `html`.

In [None]:
# TODO: Fetch HTML content and store in 'html'


<details>
<summary>Hint</summary>
Use a `with` statement:
```python
url = 'https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html'
with urllib.request.urlopen(url) as response:
    html = response.read()
```
</details>

## Exercise 3: Create a BeautifulSoup object

Create a `soup` object by parsing the HTML using the `html.parser`.

In [None]:
# TODO: Create BeautifulSoup object


<details>
<summary>Hint</summary>
```python
soup = BeautifulSoup(html, 'html.parser')
print(type(soup))
```
</details>

## Exercise 4: Prettify HTML

Print the first 100 characters of a prettified version of the HTML using `soup.prettify()`.

In [None]:
# TODO: Print first 100 characters of prettified HTML


<details>
<summary>Hint</summary>
```python
print(soup.prettify()[0:100])
```
</details>

## Exercise 5: Extract text from HTML

Use the `get_text()` method to extract all text content from the HTML and store it in `text_only`.

In [None]:
# TODO: Extract text content


<details>
<summary>Hint</summary>
```python
text_only = soup.get_text()
print(text_only)
```
</details>

## Exercise 6: Searching and retrieving data

Use `soup.find_all()` to search HTML elements. Practice the following:
1. Find all `<li>` tags.
2. Find tags with id=`link 7`.
3. Find all `<ol>` tags.
4. Find both `<ol>` and `<b>` tags.
5. Find all tags that contain letter 't' using regular expressions.
6. Find all tags using Boolean `True`.
7. Print all hyperlinks (`href`) from `<a>` tags.
8. Find all strings containing 'data' using regex.

In [None]:
# TODO: Use find_all() for the exercises above


<details>
<summary>Hint</summary>
Example code snippets for each task:
```python
# 1. Find all <li> tags
soup.find_all('li')

# 2. Find tags with id='link 7'
soup.find_all(id='link 7')

# 3. Find all <ol> tags
soup.find_all('ol')

# 4. Find both <ol> and <b> tags
soup.find_all(['ol','b'])

# 5. Regex: tags containing 't'
import re
pattern = re.compile('t')
for tag in soup.find_all(pattern):
    print(tag.name)

# 6. All tags using True
for tag in soup.find_all(True):
    print(tag.name)

# 7. Print hrefs
for link in soup.find_all('a'):
    print(link.get('href'))

# 8. Strings containing 'data'
soup.find_all(string=re.compile('data'))
```
</details>

## Solutions (Collapsed)

<details>
<summary>Click to view solutions</summary>

```python
# Exercise 1
from bs4 import BeautifulSoup
import urllib.request
import re

# Exercise 2
url = 'https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html'
with urllib.request.urlopen(url) as response:
    html = response.read()

# Exercise 3
soup = BeautifulSoup(html, 'html.parser')
print(type(soup))

# Exercise 4
print(soup.prettify()[0:100])

# Exercise 5
text_only = soup.get_text()
print(text_only)

# Exercise 6
soup.find_all('li')
soup.find_all(id='link 7')
soup.find_all('ol')
soup.find_all(['ol','b'])
pattern = re.compile('t')
for tag in soup.find_all(pattern):
    print(tag.name)
for tag in soup.find_all(True):
    print(tag.name)
for link in soup.find_all('a'):
    print(link.get('href'))
soup.find_all(string=re.compile('data'))
```
</details>