 Title: Web Scraping – CodeAlpha Internship


 In this project, I performed web scraping using Python and BeautifulSoup to extract data from"http://books.toscrape.com/catalogue/page-1.html". The data was cleaned and stored in CSV format for analysis.

In [19]:
pip install requests beautifulsoup4




In [20]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


In [21]:
url = "http://books.toscrape.com/catalogue/page-1.html"
response = requests.get(url)

# Check if request was successful
if response.status_code == 200:
    print("Page loaded successfully")
else:
    print("Failed to retrieve the page")


Page loaded successfully


In [23]:
soup = BeautifulSoup(response.text, 'html.parser')


We are scrapping only for page 1

In [24]:
books = soup.find_all('article', class_='product_pod')
print(f"Found {len(books)} books on the page")


Found 20 books on the page


In [25]:
book_list = []

for book in books:
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    availability = book.find('p', class_='instock availability').text.strip()

    # Rating is in the class of <p> tag, like "star-rating Three"
    rating_class = book.find('p', class_='star-rating')['class']
    rating = rating_class[1]  # second class is rating (One, Two, Three, etc.)

    book_list.append({
        'Title': title,
        'Price': price,
        'Availability': availability,
        'Rating': rating
    })

# Check first 3 books
book_list[:3]


[{'Title': 'A Light in the Attic',
  'Price': 'Â£51.77',
  'Availability': 'In stock',
  'Rating': 'Three'},
 {'Title': 'Tipping the Velvet',
  'Price': 'Â£53.74',
  'Availability': 'In stock',
  'Rating': 'One'},
 {'Title': 'Soumission',
  'Price': 'Â£50.10',
  'Availability': 'In stock',
  'Rating': 'One'}]

In [26]:
df = pd.DataFrame(book_list)
df.to_csv('books_page_1.csv', index=False)
print("Data saved to books_page_1.csv")


Data saved to books_page_1.csv


In [27]:
from google.colab import files
files.download('books_page_1.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

If we need to scrape multiple pages up to 50 pages following is the code.


In [28]:
book_list = []

for page_num in range(1, 51):
    url = f"http://books.toscrape.com/catalogue/page-{page_num}.html"
    response = requests.get(url)
    if response.status_code != 200:
        print(f"Page {page_num} not found, stopping.")
        break

    soup = BeautifulSoup(response.text, 'html.parser')
    books = soup.find_all('article', class_='product_pod')

    for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        availability = book.find('p', class_='instock availability').text.strip()
        rating_class = book.find('p', class_='star-rating')['class']
        rating = rating_class[1]

        book_list.append({
            'Title': title,
            'Price': price,
            'Availability': availability,
            'Rating': rating
        })

df = pd.DataFrame(book_list)
df.to_csv('all_books.csv', index=False)
print(f"Scraped {len(book_list)} books across {page_num} pages.")


Scraped 1000 books across 50 pages.


In [29]:
df.to_csv('all_books.csv', index=False)


In [30]:
from google.colab import files
files.download('all_books.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>