This notebook demonstrates how to scrape book cover images and titles from [Books to Scrape](http://books.toscrape.com) and save the data into a CSV. The CSV will have two columns: 
→ URL of the book cover image  
→ the metadata (title of the book)

We focus only on **5-star rated books**.

In [None]:
import requests
from bs4 import BeautifulSoup
import csv
import re

In [None]:
base_url = "http://books.toscrape.com/catalogue/page-{}.html"

In [None]:
books = []
page = 1

I will use a loop to access all pages on the website, since it now will only do page 1. To see its 'current scraping status', I create a little program. I loop through each page until there are no more pages (HTTP status code not 200 or no books found). For each book on a page:

1. Check the star rating and keep only 5-star books.  
2. Extract the title and clean it using UTF-8 encoding and regex to remove weird characters.  
3. Extract the relative image URL and convert it to an absolute URL.  
4. Store the image URL and the cleaned title in the list.

In [None]:
while True:
    url = base_url.format(page)
    response = requests.get(url)
    
    if response.status_code != 200:
        break  # this makes the program stop when a page does not exist (so on the last page)
    
    response.encoding = 'utf-8'
    soup = BeautifulSoup(response.text, "html.parser")
    
    book_elements = soup.select(".product_pod")
    if not book_elements:
        break  # stop if there are no more books to scrape
    
    for book in book_elements:
        rating_class = book.select_one(".star-rating")['class']
        if "Five" in rating_class:
            title = book.h3.a['title']
            title = title.encode('utf-8', errors='ignore').decode('utf-8')
            title = re.sub(r'[^\x20-\x7E]+', '', title).strip()
            
            img_relative_url = book.select_one("img")['src']
            img_url = "http://books.toscrape.com/" + img_relative_url.replace("../", "")
            
            books.append({
                "content": img_url,
                "metadata": title
            })
    
    print(f"Scraped page {page}...") #shows the scraping status
    page += 1 


Once all 5-star books have been collected, we save them into a CSV file. 

In [None]:
csv_filename = "books_5stars_covers.csv"

with open(csv_filename, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["content", "metadata"])
    writer.writeheader()
    writer.writerows(books)

# Notes
→ The `re.sub(r'[^\x20-\x7E]+', '', title)` step removes any weird characters that can appear in book titles.  
→ Encoding is enforced as UTF-8 to avoid issues with characters like `â`
→ The script automatically loops through all pages, so it scales to the entire website.  
→ The CSV is ready for corpus use, with image URLs in the `content` column and the book title as metadata.