## Web Scraping

Web scraping, the process of extracting data from websites, has emerged as a powerful technique to gather information from the vast expanse of the internet. 

**Beautiful Soup** is a popular Python library that makes it easy to scrape information from web pages.

---

### Importance of Web Scraping

#### 1. Data Collection and Aggregation
- **Market Research**: Gather insights about competitors, market trends, and customer preferences.
- **Price Monitoring**: E-commerce platforms can dynamically adjust prices based on competitor data.
- **News Aggregation**: Pull articles from various sources for centralized, real-time news coverage.

#### 2. Business Intelligence and Analytics
- **Customer Sentiment Analysis**: Extract reviews and social media comments to improve products/services.
- **Trend Analysis**: Identify industry patterns using scraped data from multiple sources.

#### 3. Content Extraction
- **Academic Research**: Automate data collection for analysis in research papers.
- **Data Journalism**: Support investigative journalism with structured and verifiable data.

#### 4. Lead Generation
- **Contact Information**: Extract emails and phone numbers for marketing campaigns.
- **Job Listings**: Aggregate listings across sites to assist job seekers and recruitment platforms.

#### 5. SEO and SEM Strategies
- **Keyword Research**: Analyze keywords used by competitors for SEO optimization.
- **Backlink Analysis**: Discover competitor backlink sources to improve your domain authority.

#### 6. Automating Repetitive Tasks
- **Data Entry**: Reduce manual effort and error by automating data capture.
- **Monitoring**: Track changes and updates to websites automatically.

#### 7. Personal Projects and Learning
- **Portfolio Projects**: Showcase your skills in data collection and processing.
- **Learning and Experimentation**: Practice and explore Python, HTML parsing, and data analysis.

---

### Ethical and Legal Considerations

While web scraping is powerful, it's essential to use it **responsibly and ethically**:

- **Respect Terms of Service**: Always check if the website permits scraping.
- **Respect `robots.txt`**: Follow the site’s crawling policies.
- **Data Privacy**: Avoid scraping personal data unless consent is given and comply with laws like **GDPR**.
- **Server Load**: Don’t overload servers with rapid, repeated scraping — it can lead to denial-of-service issues.

---

> ⚠️ Always scrape **politely** and **ethically**. Use user-agents, time delays, and handle retries/errors gracefully.


## Introduction to Beautiful Soup

**Beautiful Soup** is a Python library used to parse HTML and XML documents. It provides simple methods to **search**, **navigate**, and **modify** the parse tree. Beautiful Soup is designed to be easy and beginner-friendly, making it an excellent tool for web scraping projects.

With Beautiful Soup, you can extract data from websites to:
- Create reports
- Visualize data
- Perform analysis

---

### What is Data Parsing?

**Data parsing** refers to the process of converting raw data (such as HTML) into a different format that's easier to work with.

#### What Does a Data Parser Do?

A **data parser**:
1. Receives data in a certain format (e.g., HTML).
2. Reads the data and stores it as a string.
3. Extracts relevant information from the string.
4. Optionally cleans or processes the data.
5. Outputs it in formats like JSON, CSV, YAML, or stores it in databases.

---

### Example: Why Parsing is Useful

**Imagine you're building a price comparison tool**:
- It scrapes data from multiple e-commerce sites.
- Collects and compares product prices in real time.
- Helps users find the best deals.
- Increases your traffic and affiliate sales.

---

### Step-by-Step Guide to Web Scraping Using Beautiful Soup

#### Step 1: Install Required Libraries

Install using `pip`:
```bash
pip install bs4
pip install requests
```

#### Step 2: Import Libraries
```python 
import requests
from bs4 import BeautifulSoup
```

#### Step 3: Send an HTTP Request
```python 
url = 'http://example.com'
response = requests.get(url)
```

#### Step 4: Parse the HTML Content
```python
soup = BeautifulSoup(response.content, 'html.parser')
```

#### Step 5: Extract Data
- Example: Extract all ```<h1>``` tags
```python
h1_tags = soup.find_all('h1')
for h1 in h1_tags:
    print(h1.text)
```

#### Step 6: More Advanced Data Extraction
- Find by Tag Name

```python
title = soup.title
print(title.text)
```

- Find by Class Name

```python
articles = soup.find_all('div', class_='article')
for article in articles:
    print(article.text)
```

- Find by ID 
```python
main_content = soup.find(id='main-content')
print(main_content.text)
```

- Extract Attributes
```python
img_tags = soup.find_all('img')
for img in img_tags:
    print(img['src'])
```

### Example Project: Scraping an Online Bookstore

#### Step 1: Import Libraries

In [47]:
import requests   #Fetch content from the web.
from bs4 import BeautifulSoup # Parse HTML content.
import csv # Save data to a CSV files.

#### Step 2: Send HTTP Request

In [48]:
response = requests.get("http://books.toscrape.com/")
response

<Response [200]>

#### Step 3: Parse HTML Content

In [52]:
soup = BeautifulSoup(response.content, 'html.parser')

| Part                 | Meaning                                                                                                                                                                        |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `response.content`   | This is the **raw HTML content** (in bytes) returned from the website by the `requests.get()` call. It's what the browser would see when loading the page.                     |
| `'html.parser'`      | This is the **parser** used by Beautiful Soup to process the HTML. It tells Beautiful Soup how to interpret the structure of the HTML document.                                |
| `BeautifulSoup(...)` | This creates a **BeautifulSoup object**, which represents the entire HTML document. This object provides powerful methods to extract elements like `<div>`, `<p>`, `<a>`, etc. |
| `soup = ...`         | Stores the parsed result in a variable named `soup`. This is your main entry point to navigate and extract information from the webpage.                                       |


####  Step 4: Find Book Elements

In [54]:
# Find product containers using find_all()
books = soup.find_all('article', class_='product_pod')


#### Step 5: Extract Book Data

In [55]:
data = []
for book in books:
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    availability = book.find('p', class_="instock availability").text.strip()
    data.append([title, price, availability])
data

[['A Light in the Attic', '£51.77', 'In stock'],
 ['Tipping the Velvet', '£53.74', 'In stock'],
 ['Soumission', '£50.10', 'In stock'],
 ['Sharp Objects', '£47.82', 'In stock'],
 ['Sapiens: A Brief History of Humankind', '£54.23', 'In stock'],
 ['The Requiem Red', '£22.65', 'In stock'],
 ['The Dirty Little Secrets of Getting Your Dream Job', '£33.34', 'In stock'],
 ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  '£17.93',
  'In stock'],
 ['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  '£22.60',
  'In stock'],
 ['The Black Maria', '£52.15', 'In stock'],
 ['Starving Hearts (Triangular Trade Trilogy, #1)', '£13.99', 'In stock'],
 ["Shakespeare's Sonnets", '£20.66', 'In stock'],
 ['Set Me Free', '£17.46', 'In stock'],
 ["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)",
  '£52.29',
  'In stock'],
 ['Rip it Up and Start Again', '£35.02', 'In stock'],
 ['Our Band Could Be Your Life: Scen

#### Step 6: Write Data to CSV

In [6]:
with open('bookstore.csv', 'w') as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Price", "Availability"])
    writer.writerows(data)
print("Data has been written to bookstore.csv")

Data has been written to bookstore.csv


#### Advanced Extraction Techniques

- ```soup.find_all('div', class_='some_class')```
- Extracting links, images, and nested elements

In [56]:
images = soup.find_all('img')
for img in images:
    print(img['src'])  # Print the source URL of each image

media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg
media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg
media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg
media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg
media/cache/be/a5/bea5697f2534a2f86a3ef27b5a8c12a6.jpg
media/cache/68/33/68339b4c9bc034267e1da611ab3b34f8.jpg
media/cache/92/27/92274a95b7c251fea59a2b8a78275ab4.jpg
media/cache/3d/54/3d54940e57e662c4dd1f3ff00c78cc64.jpg
media/cache/66/88/66883b91f6804b2323c8369331cb7dd1.jpg
media/cache/58/46/5846057e28022268153beff6d352b06c.jpg
media/cache/be/f4/bef44da28c98f905a3ebec0b87be8530.jpg
media/cache/10/48/1048f63d3b5061cd2f424d20b3f9b666.jpg
media/cache/5b/88/5b88c52633f53cacf162c15f4f823153.jpg
media/cache/94/b1/94b1b8b244bce9677c2f29ccc890d4d2.jpg
media/cache/81/c4/81c4a973364e17d01f217e1188253d5e.jpg
media/cache/54/60/54607fe8945897cdcced0044103b10b6.jpg
media/cache/55/33/553310a7162dfbc2c6d19a84da0df9e1.jpg
media/cache/09/a3/09a3aef48557576e1a85ba7efea8ecb7.jpg
media/cach

### Scrape Quotes

In [40]:
# Send an HTTP Request
url = "http://quotes.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

In [41]:
title = soup.title.text
print(title)  # Print the title of the page

Quotes to Scrape


In [42]:
# Find All Quotes
# Each quote is inside a <div class="quote"> block.
quotes = soup.find_all('div', class_ = 'quote')

In [43]:
# Extracting quotes, authors, and tags
quotes_data = []
for quote in quotes:
    text = quote.find('span', class_='text').text
    author = quote.find('small', class_='author').text
    tags = [tag.text for tag in quote.find_all('a', class_='tag')]
    quotes_data.append({
        'text': text,
        'author': author,
        'tags': tags
    })

In [57]:
# save the data to a CSV file
with open('quotes.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Text', 'Author', 'Tags'])

    for quote in quotes_data:
        writer.writerow([quote['text'], quote['author'], ', '.join(quote['tags'])])
print("Quotes have been written to quotes.csv")

Quotes have been written to quotes.csv


```python
<li class="next">
  <a href="/page/2/">Next →</a>
</li>
```
- The href="/page/2/" is a relative URL, not the full link.
- It only gives the part after the domain.

#### Challenge:  Scraping multiple pages

In [61]:
import requests
from bs4 import BeautifulSoup
import csv

base_url = "http://quotes.toscrape.com"  # constant part of the site
url = "/"  # changing part like /, /page/2/

quotes_data = []

while url:
    full_url = base_url + url
    print(f"Scraping page: {full_url}")  # Message for each page

    # Send HTTP request to current page
    response = requests.get(full_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    # Extract quotes, authors, tags
    quotes = soup.find_all('div', class_='quote')
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        tags = [tag.text for tag in quote.find_all('a', class_='tag')]
        quotes_data.append({
            'text': text,
            'author': author,
            'tags': tags
        })

    # Check for the next page
    next_btn = soup.find('li', class_='next')
    if next_btn:
        url = next_btn.a['href']
    else:
        url = None

# Write to CSV
with open('quotes_all_pages.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Text', 'Author', 'Tags'])
    for quote in quotes_data:
        writer.writerow([quote['text'], quote['author'], ', '.join(quote['tags'])])

print("All quotes scraped and saved to quotes_all_pages.csv.")


Scraping page: http://quotes.toscrape.com/
Scraping page: http://quotes.toscrape.com/page/2/
Scraping page: http://quotes.toscrape.com/page/3/
Scraping page: http://quotes.toscrape.com/page/4/
Scraping page: http://quotes.toscrape.com/page/5/
Scraping page: http://quotes.toscrape.com/page/6/
Scraping page: http://quotes.toscrape.com/page/7/
Scraping page: http://quotes.toscrape.com/page/8/
Scraping page: http://quotes.toscrape.com/page/9/
Scraping page: http://quotes.toscrape.com/page/10/
All quotes scraped and saved to quotes_all_pages.csv.
