# Session 27: Web Scraping with Requests and BeautifulSoup (Part 1)

**Unit 3: Data Collection and Cleaning**
**Hour: 27**
**Mode: Practical Lab**

---

### 1. Objective

This is our first hands-on lab for data collection. We will learn the two-step process for scraping a simple website:
1.  Use the `requests` library to download the HTML content of a web page.
2.  Use the `BeautifulSoup` library to parse the HTML and make it searchable.

We will be scraping quotes from [quotes.toscrape.com](http://quotes.toscrape.com/), a website designed specifically for this purpose.

### 2. Setup

We need to import the two key libraries for this task.

In [None]:
import requests
from bs4 import BeautifulSoup

### 3. Step 1: Making the Request

We use the `requests.get()` function to send an HTTP GET request to the website's URL. This is like what your browser does when you type in a web address.

In [None]:
URL = 'http://quotes.toscrape.com/'
response = requests.get(URL)

The `response` object contains the server's response. We can check the `status_code` to see if our request was successful.

*   `200` means OK.
*   `404` means Not Found.
*   `500` means Server Error.

In [None]:
response.status_code

The raw HTML content is stored in the `.text` attribute of the response.

In [None]:
# Let's look at the first 500 characters of the HTML
html_content = response.text
print(html_content[:500])

This is messy and hard to read. We need a parser.

### 4. Step 2: Parsing with BeautifulSoup

We create a `BeautifulSoup` object by passing it our `html_content` and telling it which parser to use (`'html.parser'` is a good built-in choice).

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

The `soup` object is now a structured representation of the website. We can use its methods to find specific HTML elements.

For example, let's find the `<title>` tag of the page.

In [None]:
page_title = soup.title.text # .text extracts only the text content
print(page_title)

### 5. Finding a Single Element

To find elements, we need to inspect the web page's source code (in your browser, right-click -> "Inspect").

After inspecting [quotes.toscrape.com](http://quotes.toscrape.com/), we see that each quote is contained within a `<div>` that has the class `quote`.

Let's find the **first** quote on the page using the `.find()` method.

In [None]:
# Find the first div with class='quote'
first_quote_div = soup.find('div', class_='quote')

print(first_quote_div.prettify()) # .prettify() formats the HTML nicely

Now that we've isolated the quote's container, we can search *within* it to get the text and the author.

The quote text is in a `<span>` with `class='text'`. The author is in a `<small>` with `class='author'`.

In [None]:
quote_text = first_quote_div.find('span', class_='text').text
author_name = first_quote_div.find('small', class_='author').text

print(f"Quote: {quote_text}")
print(f"Author: {author_name}")

### 6. Conclusion

In this lab, you have learned the fundamental two-step process of web scraping:
1.  Using `requests` to download the HTML content of a page.
2.  Using `BeautifulSoup` to parse the HTML.
3.  Using the `.find()` method to locate the first occurrence of a specific HTML element based on its tag and class.
4.  Extracting the `.text` from a found element.

**Next Session:** We will build on this by learning how to find *all* matching elements on a page and how to organize the scraped data into a structured Pandas DataFrame.