# BeautifulSoup For Data Collection

## 1. What is BeautifulSoup?

- BeautifulSoup is a Python library used to parse HTML and XML documents.
- It creates a parse tree from page content, making it easy to extract data.
- It is often used with requests to scrape websites.

## 2. Installing BeautifulSoup

- Install both beautifulsoup4 and a parser like lxml:

In [1]:
# Go to anaconda prompt and type : conda install beautifulsoup4 lxml + Enter
# Whyyy both package we install i.e beautifulsoup4 and lxml
# beautifulsoup4 is a package helps for web scarping
# lxml is a another package helps to parse the html and to extract the data most of the cases we use a html.parser but sometimes as lxml
# kashi cells sequence ne run karayche te video madhe nit bagh nahitr sagla gandel in data science course of harry bhai

# 3. Creating a BeautifulSoup Object

- response.text: HTML content.
- "lxml": A fast and powerful parser (you can also use "html.parser").

In [2]:
from bs4 import BeautifulSoup

In [3]:
# Here we read the page1.html from htmls folder
with open("htmls/page1.html") as f:
    content = f.read()

In [4]:
# We make the soup from the html content that is page1.html Now, With the help of soup I will extract all things...
soup = BeautifulSoup(content,"html.parser")

In [5]:
# Now if I want all books name and price what I will do I will go to that website i.e https://books.toscrape.com/catalogue/page-1.html on that 1st book name I will go and I will inspect element
# So By inspect elemnt we get to know that book name is in h3 element tag, So what I will do now so from html page i will extract all h3 tags let's see from one by one method

# find_all() Method:

- Finds all matching elements:

In [6]:
# Now we will get a all h3 means all books name from page1.html in html form in OUTPUT:
h3s = soup.find_all("h3")

In [7]:
h3s

[<h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 <h3><a href="tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>,
 <h3><a href="soumission_998/index.html" title="Soumission">Soumission</a></h3>,
 <h3><a href="sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>,
 <h3><a href="sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>,
 <h3><a href="the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>,
 <h3><a href="the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>,
 <h3><a href="the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the Infamous Femi

In [8]:
for h3 in h3s:
    print(h3.find("a")['title'])

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


# Explanation for above code : 

Good question, Yash üëç Let‚Äôs break it down **step-by-step in the simplest way** üëá

---

### üß© Code:

```python
for h3 in h3s:
    print(h3.find("a")['title'])
```

---

### üí° Explanation:

1. **`h3s`** ‚Üí
   It is a list of all `<h3>` HTML tags you have found earlier using BeautifulSoup.
   Example:

   ```html
   <h3><a title="Restaurant ABC" href="link1.html"></a></h3>
   <h3><a title="Restaurant XYZ" href="link2.html"></a></h3>
   ```

2. **`for h3 in h3s:`** ‚Üí
   This means we are looping through each `<h3>` tag one by one.

3. **`h3.find("a")`** ‚Üí
   This looks **inside the `<h3>` tag** to find the first `<a>` tag (anchor tag).
   `<a>` tag in HTML is used for **links** (hyperlinks).

   Example:

   ```html
   <h3><a title="Restaurant ABC" href="link1.html"></a></h3>
   ```

   Here, `<a>` is the anchor tag.

4. **`['title']`** ‚Üí
   This extracts the **value of the ‚Äútitle‚Äù attribute** from the `<a>` tag.

   Example:
   From `<a title="Restaurant ABC" href="link1.html">`,
   it will return `"Restaurant ABC"`.

5. **`print(...)`** ‚Üí
   Finally, it prints the value of each title.

---

### ‚úÖ Final Output Example:

If the HTML looks like this:

```html
<h3><a title="KFC" href="kfc.html"></a></h3>
<h3><a title="Dominos" href="dominos.html"></a></h3>
```

Then the code prints:

```
KFC
Dominos
```

---

### üß† Summary:

| Code Part      | Meaning                       | Real-Life Example         |
| -------------- | ----------------------------- | ------------------------- |
| `h3s`          | List of all `<h3>` tags       | List of restaurant titles |
| `h3.find("a")` | Find link tag inside `<h3>`   | Finds the `<a>` link      |
| `['title']`    | Gets value of title attribute | "KFC", "Dominos"          |
| `print()`      | Shows result                  | Displays restaurant names |

In [12]:
# This is the sequence of the cell
# from bs4 import BeautifulSoup 

# with open("htmls/page1.html") as f:
#    content = f.read() 

# soup = BeautifulSoup(content,"html.parser")


# Ata mala product price kadayche mahanlay var mala parent tag pasna step by step div parynta yeva lagnr jetha product price ahai see inspect element click kelay var hover hot website la
articles = soup.select("article.product_pod") # This will print all articles if u print the articles variable

# And Then one by one sagle article gheun tay madhle title ghenr

for article in articles:
    title = article.find("h3").find("a")["title"] # This will print all title of books from page1.html if u print the title find_all not needed here cus we used in article.product_pod
    price = article.select_one("p.price_color").text.split("¬£")[1] # Go and see explanation for this code in explanation for bs4 code file, It means get the text content (like the actual price value) from the first `<p class="price_color">` tag inside each article. OUTPUT:√É‚Äö√Ç¬£51.77....so..on, Now We have to split it by ¬£ to remove the √É‚Äö√Ç and I will print text of list index 1 i.e 51.77,53.74,...so..on..  
    print(price)

51.77
53.74
50.10
47.82
54.23
22.65
33.34
17.93
22.60
52.15
13.99
20.66
17.46
52.29
35.02
57.25
23.88
37.59
51.33
45.17


In [24]:
items = [] # Here I have create an empty list of items so that at last I can append the title and price
# Ata mala product price kadayche mahanlay var mala parent tag pasna step by step div parynta yeva lagnr jetha product price ahai see inspect element click kelay var hover hot website la
articles = soup.select("article.product_pod") # This will print all articles if u print the articles variable

# And Then one by one sagle article gheun tay madhle title ghenr

for article in articles:
    title = article.find("h3").find("a")["title"] # This will print all title of books from page1.html if u print the title find_all not needed here cus we used in article.product_pod
    price = article.select_one("p.price_color").text.split("¬£")[1] # Go and see explanation for this code in explanation for bs4 code file, It means get the text content (like the actual price value) from the first `<p class="price_color">` tag inside each article. OUTPUT:√É‚Äö√Ç¬£51.77....so..on, Now We have to split it by ¬£ to remove the √É‚Äö√Ç and I will print text of list index 1 i.e 51.77,53.74,...so..on..


    # Let's Add the rating also for each book for page1.html
    rating_element = article.select_one("p.star-rating")
    rating = rating_element['class'][1]



    
    items.append([title,price,rating])

In [25]:
# [['A Light in the Attic', '51.77'],['Tipping the Velvet', '53.74'],.....so..on...]]
items # This is for page1.html,All Books names and price in a items list

[['A Light in the Attic', '51.77', 'Three'],
 ['Tipping the Velvet', '53.74', 'One'],
 ['Soumission', '50.10', 'One'],
 ['Sharp Objects', '47.82', 'Four'],
 ['Sapiens: A Brief History of Humankind', '54.23', 'Five'],
 ['The Requiem Red', '22.65', 'One'],
 ['The Dirty Little Secrets of Getting Your Dream Job', '33.34', 'Four'],
 ['The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull',
  '17.93',
  'Three'],
 ['The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics',
  '22.60',
  'Four'],
 ['The Black Maria', '52.15', 'One'],
 ['Starving Hearts (Triangular Trade Trilogy, #1)', '13.99', 'Two'],
 ["Shakespeare's Sonnets", '20.66', 'Four'],
 ['Set Me Free', '17.46', 'Five'],
 ["Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", '52.29', 'Five'],
 ['Rip it Up and Start Again', '35.02', 'Five'],
 ['Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991',
  '57.25',
  'Three'],
 ['Olio

In [26]:
# Now we can convert the above item list in a DataFrame also using a pandas library let see
import pandas as pd

In [27]:
df = pd.DataFrame(items,columns=['Book Title','Price','Rating'])

In [28]:
df

Unnamed: 0,Book Title,Price,Rating
0,A Light in the Attic,51.77,Three
1,Tipping the Velvet,53.74,One
2,Soumission,50.1,One
3,Sharp Objects,47.82,Four
4,Sapiens: A Brief History of Humankind,54.23,Five
5,The Requiem Red,22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,22.6,Four
9,The Black Maria,52.15,One


In [29]:
# This is what a Web Scraping is we have also format in a DataFrame in a Structral Table just for a single html page i.e htmls/page1.html sky is the limit u can do whatever u want u can also add the data in xl,csv after scraping it from website before that take all data of websites in this folder structure htmls/page1.html.....so..on

In [30]:
# We can export the df in csv file let's do that :
df.to_csv("data.csv",index=False) # In csv file index will not come cuz I have done index=False

In [31]:
# Tumhi xl madhe jaun filter laun faktha 5 star wale books pan visible karu shakta if tula thoda xl yet asel tr otherwise chatgpt ahich na bhau...ok,that set.ChatGpt la vechar na bhau.

# 9. Summary
- BeautifulSoup helps parse and navigate HTML easily.
- Use .find(), .find_all(), .select(), and .select_one() to locate data.
- Always inspect the website's structure before writing scraping logic.
- Combine BeautifulSoup with requests for full scraping workflows.