# Module 6: Web Scraping with Python and BeautifulSoup
In this notebook, we explore how to extract textual data from web pages using Python libraries like `requests` and `BeautifulSoup`. We cover the process from sending a request to organizing the extracted data in a structured format using `pandas`.

## 1. What is Web Scraping?
**Web scraping** is the automated method of accessing and extracting data from websites. It is widely used in fields such as data analysis, NLP, and machine learning.

**Why scrape?**
- Collecting data for analysis (e.g., product reviews, stock prices)
- Training AI models
- Monitoring websites

**Ethical Considerations:**
- Always review the site’s `robots.txt`
- Respect the terms of service
- Avoid sending high volumes of requests

In [16]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## 2. Fetch HTML Content from a Website
We use the `requests` library to get the HTML content from a sample site: `http://quotes.toscrape.com/`.

In [8]:
url = 'http://quotes.toscrape.com/'
response = requests.get(url)
print("Status Code:", response.status_code)
html = response.text
 

Status Code: 200


## 3. Parse HTML Using BeautifulSoup
The `BeautifulSoup` library helps us parse the HTML content and navigate its structure.

In [10]:
soup = BeautifulSoup(html, 'html.parser')
print(soup.title.text)  # Display the title of the web page

Quotes to Scrape


## 4. Extract Quote Blocks
We’ll extract quotes, authors, and associated tags from the page.

In [11]:
quotes = soup.find_all('div', class_='quote')
print(f"Found {len(quotes)} quotes on the page")

Found 10 quotes on the page


## 5. Extract and Structure the Data
Each quote block contains a quote, author name, and list of tags. We'll extract them and store in a list.

In [14]:
data = []
for q in quotes:
    text = q.find('span', class_='text').text
    author = q.find('small', class_='author').text
    tags = [tag.text for tag in q.find_all('a', class_='tag')]
    data.append({
        'Quote': text,
        'Author': author,
        'Tags': ", ".join(tags)
    })

## 6. Convert Extracted Data to DataFrame
We will use `pandas` to convert the extracted data into a table-like format for easy manipulation.

In [15]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"


## 7. Summary
- We used `requests` to fetch HTML data.
- Parsed it with `BeautifulSoup`.
- Extracted relevant data like quotes, authors, and tags.
- Stored the data in a structured format using `pandas`.

**Try it yourself:**
Explore scraping other structured content like tables or try a different page on the same site!