# Web Scraping and Data Parsing with BeautifulSoup

In this notebook, we will learn how to parse HTML data using **BeautifulSoup**. We will cover:
1. Parsing HTML data
2. Extracting data from a parse tree
3. Searching and retrieving data using various filters

## 1. Importing required libraries

We first import `BeautifulSoup` for parsing, `urllib` for fetching HTML data, and `re` for regular expressions.

In [None]:
from bs4 import BeautifulSoup
import urllib
import urllib.request
import re

## 2. Loading HTML data from a URL

We use `urllib.request.urlopen()` to open a web page and read its HTML content.

In [None]:
url = 'https://raw.githubusercontent.com/BigDataGal/Data-Mania-Demos/master/IoT-2018.html'
with urllib.request.urlopen(url) as response:
    html = response.read()

## 3. Creating a BeautifulSoup object

We create a **BeautifulSoup** object to parse the HTML content. The `html.parser` is used here by default.

In [None]:
soup = BeautifulSoup(html, 'html.parser')
type(soup)

## 4. Parsing and Prettifying HTML

Using the `prettify()` method, we can see a formatted view of the HTML document. Here, we print only the first 100 characters as a preview.

In [None]:
print(soup.prettify()[0:100])

## 5. Getting data from a parse tree

To extract only the text content (without HTML tags) from the page, we can use the `get_text()` method.

In [None]:
text_only = soup.get_text()
print(text_only)

## 6. Searching and retrieving data from a parse tree

BeautifulSoup provides the `find_all()` method to search for tags or strings in the HTML tree. 
You can filter using several types of arguments:
- **Name argument:** Search for tags by their tag name
- **Keyword argument:** Search for tags by their attributes
- **String argument:** Search by exact string content
- **Lists:** Search for multiple tags at once
- **Boolean values:** Search for all tags when True
- **Strings:** Search for weblinks (`<a>` tags) or text content
- **Regular expressions:** Search for patterns in tags or text

### 6.1 Retrieving tags by tag name (Name argument)

In [None]:
# Find all <li> tags
soup.find_all('li')

### 6.2 Retrieving tags by attributes (Keyword argument)

In [None]:
# Find tags with id='link 7'
soup.find_all(id='link 7')

### 6.3 Retrieving tags by exact string (String argument)

In [None]:
# Find all <ol> tags
soup.find_all('ol')

### 6.4 Retrieving multiple tags using lists

In [None]:
# Find both <ol> and <b> tags
soup.find_all(['ol', 'b'])

### 6.5 Retrieving tags using regular expressions

In [None]:
pattern = re.compile('t')
for tag in soup.find_all(pattern):
    print(tag.name)

### 6.6 Retrieving all tags using a Boolean value

In [None]:
for tag in soup.find_all(True):
    print(tag.name)

### 6.7 Retrieving weblinks using string objects

In [None]:
for link in soup.find_all('a'):
    print(link.get('href'))

### 6.8 Retrieving strings using regular expressions

In [None]:
soup.find_all(string=re.compile('data'))