This notebook contains:
✔ Real web scraping (website: https://books.toscrape.com)
✔ Cleans messy fields (price, availability, text, categories)
✔ Handling missing data
✔ Cleaning HTML artifacts
✔ Regex validation
✔ Final clean dataset export(CSV)

In [1]:
import requests
from bs4 import BeautifulSoup #Python library used for web scraping and parsing HTML or XML
import pandas as pd
import numpy as np
import re

Fetching & parsing HTML

In [2]:
def get_page(url):
    try:
        html = requests.get(url, timeout=10) #wiating time 10 sec b4 failed attemp and moving to eaxception
        html.raise_for_status() #Does nothing if the page loads correctly.Throws an exception if the page returns an error
        return BeautifulSoup(html.text, "html.parser") #BeautifulSoup reads HTML text and converts it into a structured object.
    except Exception as e:
        print(f"Failed to recieve {url} → {e}")
        return None
    

Extracting Product Data from Category Pages

In [3]:
base = "https://books.toscrape.com/catalogue/page-{}.html" #{} will be replaced with page#
def scrape_page(page_num):
    url = base.format(page_num)
    soup = get_page(url)
    if soup is None:
        return[]
    prods = []
    for prod in soup.select(".product_pod"):  #.product_pod is the CSS selector for every book item on the page.
        title = prod.h3.a["title"]
        price = prod.select_one(".price_color").text
        stock = prod.select_one(".instock.availability").text.strip()
        link = prod.h3.a["href"].replace("../../", "https://books.toscrape.com/catalogue/")
        prods.append({
        "title": title,
        "price": price,
        "stock": stock,
        "link": link})
        return prods

Multiple page scrapping

In [4]:
all_prod = []
for i in range(1,11):  #first 10 page scrapping
    print(f"Scrapping page {i}:")
    items = scrape_page(i)
    all_prod.extend(items)
df_raw = pd.DataFrame(all_prod)
df_raw.head()

Scrapping page 1:
Scrapping page 2:
Scrapping page 3:
Scrapping page 4:
Scrapping page 5:
Scrapping page 6:
Scrapping page 7:
Scrapping page 8:
Scrapping page 9:
Scrapping page 10:


Unnamed: 0,title,price,stock,link
0,A Light in the Attic,Â£51.77,In stock,a-light-in-the-attic_1000/index.html
1,In Her Wake,Â£12.84,In stock,in-her-wake_980/index.html
2,Slow States of Collapse: Poems,Â£57.31,In stock,slow-states-of-collapse-poems_960/index.html
3,The Nameless City (The Nameless City #1),Â£38.16,In stock,the-nameless-city-the-nameless-city-1_940/inde...
4,"Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Pr...",Â£13.61,In stock,princess-jellyfish-2-in-1-omnibus-vol-01-princ...


Scrapping full product details:This function takes a product page URL, downloads it, and extracts:
1-Category
2-Description
3-Rating


In [5]:
def scrape_prod_details(url):
    soup = get_page(url)
    if soup is None:
        return{"Category": None, "Description": None, "Rtaing": None}
        print(soup.select_one("#product_description"))
#Category:
#Breadcrumb links are a secondary navigation system on a website. They provide a clear path back to previous sections by using a series of linked navigation elements
    breadcrumb= soup.select(".breadcrumb li a")  #".breadcrumb li a" is a CSS selector.It finds all <a> links inside the breadcrumb menu
    category = breadcrumb[2].text.strip() if len(breadcrumb) >=3 else None

#Description:
    desc_tag = soup.slect_one("prod_description") # find product description
    if desc_tag:
        desc = desc_tag.find_next("p").text.strip() #returning first paragraph of decription
    else:
        desc = None
        
#Rating:
    rating_map = {"one":1,"two":2,"three":3}
    #all_prod = []
    #for page in range (1,11):
     #   print(f"Scrapping page{page}:")
      #  products = scrap_page(page)
        # all_prod.extend(products))
    rating_tag = soup.select_one(".star_rating")
    rating_class = rating_tag["class"][1] if rating_tag else None
    rating = rating_map.get(rating_class, None)
    return{
        "category": category,
        "description": desc,
        "rating": rating
    }

detail_list = []
for link in df_raw["link"]:
    detail_list.append(scrape_prod_details(link))
df_details = pd.DataFrame(detail_list)
df = pd.concat([df_raw,df_details], axis = 1)
df.head()
    


Failed to recieve a-light-in-the-attic_1000/index.html → Invalid URL 'a-light-in-the-attic_1000/index.html': No scheme supplied. Perhaps you meant https://a-light-in-the-attic_1000/index.html?
Failed to recieve in-her-wake_980/index.html → Invalid URL 'in-her-wake_980/index.html': No scheme supplied. Perhaps you meant https://in-her-wake_980/index.html?
Failed to recieve slow-states-of-collapse-poems_960/index.html → Invalid URL 'slow-states-of-collapse-poems_960/index.html': No scheme supplied. Perhaps you meant https://slow-states-of-collapse-poems_960/index.html?
Failed to recieve the-nameless-city-the-nameless-city-1_940/index.html → Invalid URL 'the-nameless-city-the-nameless-city-1_940/index.html': No scheme supplied. Perhaps you meant https://the-nameless-city-the-nameless-city-1_940/index.html?
Failed to recieve princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2-in-1-omnibus-1_920/index.html → Invalid URL 'princess-jellyfish-2-in-1-omnibus-vol-01-princess-jellyfish-2

Unnamed: 0,title,price,stock,link,Category,Description,Rtaing
0,A Light in the Attic,Â£51.77,In stock,a-light-in-the-attic_1000/index.html,,,
1,In Her Wake,Â£12.84,In stock,in-her-wake_980/index.html,,,
2,Slow States of Collapse: Poems,Â£57.31,In stock,slow-states-of-collapse-poems_960/index.html,,,
3,The Nameless City (The Nameless City #1),Â£38.16,In stock,the-nameless-city-the-nameless-city-1_940/inde...,,,
4,"Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Pr...",Â£13.61,In stock,princess-jellyfish-2-in-1-omnibus-vol-01-princ...,,,


Cheking why we got "None" in Desription, Rating and Category columns for all

In [6]:
print(soup.select_one("#product_description"))


NameError: name 'soup' is not defined

Fixing the function

In [7]:
def scrape_prod_details(url):
    soup = get_page(url)
    if soup is None:
        return {"category": None, "description": None, "rating": None}

    # Category
    breadcrumb = soup.select(".breadcrumb li a")
    category = breadcrumb[2].text.strip() if len(breadcrumb) >= 3 else None

    # Description
    desc_tag = soup.select_one("#product_description")
    desc = desc_tag.find_next("p").text.strip() if desc_tag else None

    # Rating
    ratings_map = {"One":1,"Two":2,"Three":3,"Four":4,"Five":5}
    rating_tag = soup.select_one(".star-rating")
    rating_class = rating_tag["class"][1] if rating_tag and len(rating_tag["class"])>1 else None
    rating = ratings_map.get(rating_class, None)

    return {
        "category": category,
        "description": desc,
        "rating": rating
    }


#Scrape all products with full URLs
detail_list = []

base_url = "https://books.toscrape.com/catalogue/"

for link in df_raw["link"]:
    full_link = base_url + link
    detail_list.append(scrape_prod_details(full_link))

df_details = pd.DataFrame(detail_list)
df = pd.concat([df_raw, df_details], axis=1)
df.head()


Unnamed: 0,title,price,stock,link,category,description,rating
0,A Light in the Attic,Â£51.77,In stock,a-light-in-the-attic_1000/index.html,Poetry,It's hard to imagine a world without A Light i...,3
1,In Her Wake,Â£12.84,In stock,in-her-wake_980/index.html,Thriller,A perfect life â¦ until she discovered it was...,1
2,Slow States of Collapse: Poems,Â£57.31,In stock,slow-states-of-collapse-poems_960/index.html,Poetry,The eagerly anticipated debut from one of Cana...,3
3,The Nameless City (The Nameless City #1),Â£38.16,In stock,the-nameless-city-the-nameless-city-1_940/inde...,Sequential Art,Every nation that invades the City gives it a ...,4
4,"Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Pr...",Â£13.61,In stock,princess-jellyfish-2-in-1-omnibus-vol-01-princ...,Sequential Art,THE LONG-AWAITED STORY OF FANGIRLS TAKING ON T...,5


Data cleaning

In [12]:
#Price
def clean_price(price):
    if pd.isna(price):
        return np.nan
    number = re.findall(r"[\d\.]+", price)
    return float(number[0]) if number else np.nan
df["Price"] = df["price"].apply(clean_price)

def clean_stock(stock):
    if pd.isna(stock):
        return np.nan
    stock_clean = stock.strip()  # remove spaces/newlines
    num = re.findall(r"\d+", stock_clean)
    return int(num[0]) if num else 1
df["Stock"] = df["stock"].apply(clean_stock)

#Description
def clean_text(t):
    if pd.isna(t):
        return t
    t = re.sub(r"\s+"," ",t)

# "re.sub" is a function that:
#Searches for all matches of pattern in string
#Replaces each match with replacement
#Returns a new string (does not modify in place)
    return t.strip()
df["Description"] = df["description"].apply(clean_text)

#Title
df["Title"] = (df["title"].str.replace(r"[\n\r\t]"," ",regex=True)
              .str.replace(r"\s+"," ", regex = True)
              .str.strip())
    

Validation the fields

In [13]:
df["Price_Validation"] = df["price"].str.contains("£", na = False)
df["Rtaing_Validation"] = df["rating"].between(1,5,inclusive = "both")

#Removing duplicates
df_new = df.drop_duplicates(subset = ["Title","Price"])
df_new.head()

Unnamed: 0,title,price,stock,link,category,description,rating,Price,Description,Title,Price_Validation,Rtaing_Validation,Stock
0,A Light in the Attic,Â£51.77,In stock,a-light-in-the-attic_1000/index.html,Poetry,It's hard to imagine a world without A Light i...,3,51.77,It's hard to imagine a world without A Light i...,A Light in the Attic,True,True,1
1,In Her Wake,Â£12.84,In stock,in-her-wake_980/index.html,Thriller,A perfect life â¦ until she discovered it was...,1,12.84,A perfect life â¦ until she discovered it was...,In Her Wake,True,True,1
2,Slow States of Collapse: Poems,Â£57.31,In stock,slow-states-of-collapse-poems_960/index.html,Poetry,The eagerly anticipated debut from one of Cana...,3,57.31,The eagerly anticipated debut from one of Cana...,Slow States of Collapse: Poems,True,True,1
3,The Nameless City (The Nameless City #1),Â£38.16,In stock,the-nameless-city-the-nameless-city-1_940/inde...,Sequential Art,Every nation that invades the City gives it a ...,4,38.16,Every nation that invades the City gives it a ...,The Nameless City (The Nameless City #1),True,True,1
4,"Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Pr...",Â£13.61,In stock,princess-jellyfish-2-in-1-omnibus-vol-01-princ...,Sequential Art,THE LONG-AWAITED STORY OF FANGIRLS TAKING ON T...,5,13.61,THE LONG-AWAITED STORY OF FANGIRLS TAKING ON T...,"Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Pr...",True,True,1


Eliminating all extra Columns

In [16]:
DF = df_new[[
    "Title",
    "category",
    "Price",
    "Stock",
    "rating",
    "Description",
    "link"
]]

DF.head()


Unnamed: 0,Title,category,Price,Stock,rating,Description,link
0,A Light in the Attic,Poetry,51.77,1,3,It's hard to imagine a world without A Light i...,a-light-in-the-attic_1000/index.html
1,In Her Wake,Thriller,12.84,1,1,A perfect life â¦ until she discovered it was...,in-her-wake_980/index.html
2,Slow States of Collapse: Poems,Poetry,57.31,1,3,The eagerly anticipated debut from one of Cana...,slow-states-of-collapse-poems_960/index.html
3,The Nameless City (The Nameless City #1),Sequential Art,38.16,1,4,Every nation that invades the City gives it a ...,the-nameless-city-the-nameless-city-1_940/inde...
4,"Princess Jellyfish 2-in-1 Omnibus, Vol. 01 (Pr...",Sequential Art,13.61,1,5,THE LONG-AWAITED STORY OF FANGIRLS TAKING ON T...,princess-jellyfish-2-in-1-omnibus-vol-01-princ...


Exporting CSV

In [18]:
DF.to_csv("books_dataset.csv", index=False)
print("clean_books_dataset.csv saved!")


clean_books_dataset.csv saved!
