# Amazon Bestseller Analysis: Data Collection

### Notebook 01: Python Data Collection

This is the first notebook in the Amazon Bestseller Analysis project where I will focus on the initial data collection phase and gather bestseller data from Amazon's website in an ethical and structured manner.

In this notebook I will:
1. Set up the scraping infrastructure
2. Collect bestseller data across multiple categories
3. Store the data in a structured format for further analysis

I will begin by importing the required libraries and setting up the environment.

In [132]:
# Import necessary libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
from tqdm import tqdm
import yaml
import logging
from datetime import datetime
import re

# Load configuration settings
with open('../config/config.yaml', 'r') as file:
    config = yaml.safe_load(file)

---
**Note**: This notebook is part of the Project Gutenberg Analysis portfolio project.

I've loaded the configuration file which contains important settings like:
- Request delays to ensure I don't overwhelm the server
- Headers to identify the scraper
- Category URLs I'll be analyzing

As a professional in children's publishing, I've chosen to analyze a variety of categories that are focussed on books I am currently working on, which will enable market insights across a range of children's publishing target audience. 

Amazon regularly publishes Bestseller data on select categories that have stable URLs that I can use as a basis for webscraping. For example, this is the page for Children's Classics.
![Amazon Bestseller page](../visualisations/Screenshots/Amazon_bestseller.png)
```
# Amazon category URLs for bestseller lists
amazon:
categories:
  - name: "Classic Books"
    url: "https://www.amazon.co.uk/gp/bestsellers/books/291745/ref=pd_zg_hrsr_books"
    type: "fiction"
    subtype: "classics"
    
  - name: "Sport and Outdoors"
    url: "https://www.amazon.co.uk/Best-Sellers-Books-Childrens-Books-on-Sports-the-Outdoors/zgbs/books/291772/ref=zg_bs_nav_books_2_69"
    type: "non_fiction"
    subtype: "sports"
    
  - name: "Teen and Young Adult"
    url: "https://www.amazon.co.uk/Best-Sellers-Teen-Young-Adult/zgbs/books/52"
    type: "fiction"
    subtype: "young_adult"
    default_age_range: "14-18"
    
  - name: "Comics and Graphic Novels"
    url: "https://www.amazon.co.uk/Best-Sellers-Books-Comics-Graphic-Novels-for-Children/zgbs/books/291767/ref=zg_bs_nav_books_2_69"
    type: "fiction"
    subtype: "graphic_novels"
    
  - name: "Science, Nature & How It Works"
    url: "https://www.amazon.co.uk/Best-Sellers-Books-Childrens-Books-on-Science-Nature-How-It-Works/zgbs/books/492622/ref=zg_bs_nav_books_2_69"
    type: "non_fiction"
    subtype: "science"
    
  - name: "TV, Movie & Video Game Adaptations"
    url: "https://www.amazon.co.uk/Best-Sellers-Books-TV-Movie-Video-Game-Adaptations-for-Children/zgbs/books/15512249031/ref=zg_bs_nav_books_3_15512039031"
    type: "fiction"
    subtype: "adaptations"
    
  - name: "Educational Books"
    url: "https://www.amazon.co.uk/Best-Sellers-Books-Education-Reference-for-Children/zgbs/books/291673/ref=zg_bs_nav_books_2_69"
    type: "education"
    subtype: "general"
    
  - name: "Reference Books"
    url: "https://www.amazon.co.uk/Best-Sellers-Books-Childrens-Reference-Books/zgbs/books/291707/ref=zg_bs_nav_books_3_291673"
    type: "education"
    subtype: "reference"
```

I included both Educational Fiction and Educational Reference categories as a personal favourite genre and also Graphic Novels as an acknowledgement of one of the fastest-growing segments in children's publishing.

My personal experience in children's publishing as both production controller and designer has shown me how important it is to understand the connection between format choices (hardback, paperback, digital) and different age groups, as format preferences can significantly impact both purchasing decisions and reading engagement. At a previous role in Egmont, I got to know Alison David, a consumer research expert and she shared research about the importance of parent-child reading in the early years and how book format has an effect on that experience.
 
Now, I will set up the logging to track the scraping process. The notebook you see here shows the final work of all the iterations of code and logging helped me work through the errors.

In [133]:
# Setup logging configuration
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('../logs/scraping.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

logger.info("Starting bestseller data collection process")

def print_data_status(df):
    """
    Print detailed status of data completeness
    """
    total = len(df)
    print("\nData Completeness Report:")
    print("-" * 50)
    for column in df.columns:
        missing = df[column].isna().sum()
        complete = total - missing
        percentage = (complete/total) * 100
        print(f"{column:20s}: {complete:4d}/{total:4d} ({percentage:6.2f}%)")

2025-01-19 11:19:52,860 - INFO - Starting bestseller data collection process


This will help me track the scraping process and debug any issues that might arise. Now I will define the core scraping functions. The following section pulls the list of the bestsellers and their rank, but not all the details that will make the final analysis.

In [134]:
def fetch_bestseller_page(url, headers, max_retries=3, backoff_factor=2):
    """
    Fetches a bestseller page with exponential backoff retry logic.
    """
    for attempt in range(max_retries):
        try:
            logger.info(f"Fetching data from {url} (Attempt {attempt + 1}/{max_retries})")
            response = requests.get(url, headers=headers)
            
            if response.status_code == 200:
                logger.info("Successfully retrieved page")
                return response.text
            elif response.status_code == 429:
                wait_time = (backoff_factor ** attempt) * 3  # Base delay of 3 seconds
                logger.warning(f"Rate limited. Waiting {wait_time} seconds before retry")
                time.sleep(wait_time)
                continue
            else:
                logger.warning(f"Received status code {response.status_code}")
                return None
                
        except Exception as e:
            logger.error(f"Error fetching page: {e}")
            if attempt < max_retries - 1:
                wait_time = (backoff_factor ** attempt) * 3
                time.sleep(wait_time)
            else:
                return None
    
    return None


Previous runs of the script showed me where the scripts were going wrong and that I had to be more precise about where in the html structure I was pulling from.

In [135]:
def analyze_html_structure(html):
    """
    Analyzes the HTML structure to help debug parsing issues.
    """
    soup = BeautifulSoup(html, 'html.parser')
    analysis = {
        "total_divs": len(soup.find_all('div')),
        "div_classes": {},
        "potential_containers": [],
        "links": len(soup.find_all('a')),
        "text_length": len(soup.get_text()),
        "title": soup.title.string if soup.title else None
    }
    
    # Analyze div classes
    for div in soup.find_all('div', class_=True):
        for class_name in div['class']:
            analysis["div_classes"][class_name] = analysis["div_classes"].get(class_name, 0) + 1
            
            # Look for promising container classes
            if any(term in class_name.lower() for term in ['item', 'product', 'book', 'result']):
                if class_name not in analysis["potential_containers"]:
                    analysis["potential_containers"].append(class_name)
    
    logger.info("\nHTML Structure Analysis:")
    logger.info(f"Page Title: {analysis['title']}")
    logger.info(f"Total Divs: {analysis['total_divs']}")
    logger.info(f"Total Links: {analysis['links']}")
    logger.info("\nMost Common Div Classes:")
    for class_name, count in sorted(analysis["div_classes"].items(), 
                                  key=lambda x: x[1], 
                                  reverse=True)[:10]:
        logger.info(f"{class_name}: {count}")
    logger.info("\nPotential Container Classes:")
    for container in analysis["potential_containers"]:
        logger.info(f"- {container}")
    
    return analysis


Parsing the data into a usable data form...

In [136]:
def parse_bestseller_data(html, base_url="https://www.amazon.co.uk"):
    """
    Enhanced parser for Amazon UK bestseller data
    """
    logger.info("Parsing bestseller data")
    soup = BeautifulSoup(html, 'html.parser')
    books = []
    
    # Find all book items in the grid
    items = soup.select('div[class*="_cDEzb_grid-cell_1uMOS"]')
    
    if not items:
        logger.warning("No items found with grid cell selector")
        return []
    
    logger.info(f"Found {len(items)} potential book items")
    
    for i, item in enumerate(items, 1):
        try:
            book = {}
            
            # Extract rank
            rank_elem = item.select_one('.zg-bdg-text')
            if rank_elem:
                rank_text = rank_elem.text.strip('#')
                try:
                    book['rank'] = int(rank_text)
                except ValueError:
                    logger.warning(f"Could not parse rank from {rank_text}")
            
            # Extract title
            title_elem = item.select_one('div[class*="_cDEzb_p13n-sc-css-line-clamp-1"]')
            if title_elem:
                book['title'] = title_elem.text.strip()
            
            # Extract author
            author_elem = item.select_one('a.a-size-small.a-link-child div')
            if author_elem:
                book['author'] = author_elem.text.strip()
            
            # Extract rating and review count
            rating_elem = item.select_one('.a-icon-star-small span.a-icon-alt')
            if rating_elem:
                rating_text = rating_elem.text
                try:
                    book['rating'] = float(rating_text.split(' ')[0])
                except (ValueError, IndexError):
                    logger.warning(f"Could not parse rating from {rating_text}")
            
            review_count_elem = item.select_one('.a-icon-row span.a-size-small')
            if review_count_elem:
                try:
                    book['review_count'] = int(review_count_elem.text.replace(',', ''))
                except ValueError:
                    logger.warning(f"Could not parse review count from {review_count_elem.text}")
            
            # Extract format
            format_elem = item.select_one('span.a-size-small.a-color-secondary')
            if format_elem:
                book['format'] = format_elem.text.strip()
            
            # Extract price
            price_elem = item.select_one('span[class*="_cDEzb_p13n-sc-price"]')
            if price_elem:
                price_text = price_elem.text.strip('£')
                try:
                    book['price'] = float(price_text)
                except ValueError:
                    logger.warning(f"Could not parse price from {price_text}")
            
            # Extract URLs
            link_elem = item.select_one('a[href*="/dp/"]')
            if link_elem:
                href = link_elem.get('href', '')
                if href:
                    # Clean up URL
                    url_parts = href.split('/ref=')[0]  # Remove tracking parameters
                    book['product_url'] = f"{base_url}{url_parts}" if not href.startswith('http') else href
                    
                    # Extract ASIN
                    asin_match = re.search(r'/dp/([A-Z0-9]{10})/?', href)
                    if asin_match:
                        book['asin'] = asin_match.group(1)
                        # Construct high-quality image URL
                        book['image_url'] = f"https://images-eu.ssl-images-amazon.com/images/P/{book['asin']}.01.L.jpg"
            
            # Extract image URL if not already set
            if 'image_url' not in book:
                img_elem = item.select_one('img[src*="images-amazon"]')
                if img_elem:
                    book['image_url'] = img_elem.get('src', '')
            
            # Add timestamp
            book['timestamp'] = datetime.now().isoformat()
            
            # Validate and add to results
            if validate_book_entry(book):
                books.append(book)
                logger.debug(f"Successfully processed book {i}: {book.get('title', 'Unknown Title')}")
            else:
                logger.warning(f"Skipping item {i} - failed validation")
            
        except Exception as e:
            logger.error(f"Error processing item {i}: {str(e)}")
            continue
    
    # Log success statistics
    if books:
        logger.info(f"Successfully parsed {len(books)} books")
        fields = ['rank', 'title', 'author', 'price', 'format', 'rating', 'review_count']
        for field in fields:
            count = sum(1 for book in books if field in book and book[field] is not None)
            logger.info(f"{field} found in {count}/{len(books)} books ({count/len(books)*100:.1f}%)")
    else:
        logger.warning("No books were successfully parsed")
    
    return books



The core functions are defined, and these helper functions will extract specific pieces of information from each book listing.

In [137]:
def extract_rank(item):
    """Extract bestseller rank"""
    rank_data = {
        'overall_rank': None,
        'category_ranks': []
    }
    
    # Main rank is now in zg-bdg-text
    rank_elem = item.select_one('.zg-bdg-text')
    if rank_elem:
        try:
            rank_text = rank_elem.text.strip('#')
            rank_num = int(rank_text)
            if 0 < rank_num <= 100:
                rank_data['overall_rank'] = rank_num
        except (ValueError, AttributeError):
            pass
    
    return rank_data

def extract_title_and_series(item):
    """Extract title and series information"""
    result = {
        'title': None,
        'series_name': None,
        'series_number': None
    }
    
    # Title is now in _cDEzb_p13n-sc-css-line-clamp-1_1Fn1y
    title_elem = item.select_one('div[class*="_cDEzb_p13n-sc-css-line-clamp-1"]')
    if title_elem:
        raw_title = title_elem.text.strip()
        if raw_title:
            # Series patterns
            series_patterns = [
                r'(.*?)\s*\((.*?)(?:Book|Volume|#)\s*(\d+)\)',
                r'(.*?):\s*(?:Book|Volume)\s*(\d+)\s*(?:of|in)\s*(.*?)\s*$'
            ]
            
            for pattern in series_patterns:
                match = re.search(pattern, raw_title, re.IGNORECASE)
                if match:
                    groups = match.groups()
                    result['title'] = groups[0].strip()
                    result['series_name'] = groups[1].strip() if len(groups) > 1 else None
                    result['series_number'] = int(groups[2]) if len(groups) > 2 else None
                    break
            else:
                result['title'] = raw_title
    
    return result

def extract_product_urls(item, base_url="https://www.amazon.co.uk"):
    """Extract product and image URLs"""
    urls = {
        'product_url': None,
        'image_url': None,
        'asin': None
    }
    
    # Product URL is in the main link
    link_elem = item.select_one('a[href*="/dp/"]')
    if link_elem:
        href = link_elem.get('href', '')
        if href:
            # Clean up URL
            url_parts = href.split('/ref=')[0]
            urls['product_url'] = f"{base_url}{url_parts}" if not href.startswith('http') else href
            
            # Extract ASIN
            asin_match = re.search(r'/dp/([A-Z0-9]{10})/?', href)
            if asin_match:
                urls['asin'] = asin_match.group(1)
                urls['image_url'] = f"https://images-eu.ssl-images-amazon.com/images/P/{urls['asin']}.01.L.jpg"
    
    # Direct image URL as backup
    if not urls['image_url']:
        img_elem = item.select_one('img[src*="images-amazon"]')
        if img_elem:
            urls['image_url'] = img_elem.get('src', '')
    
    return urls

def extract_format_details(item):
    """Extract format information"""
    format_info = {
        'format': None,
        'binding': None
    }
    
    # Format is in a-size-small a-color-secondary
    format_elem = item.select_one('span.a-size-small.a-color-secondary')
    if format_elem:
        format_text = format_elem.text.strip().lower()
        format_mapping = {
            'hardcover': {'format': 'Hardcover', 'binding': 'Hardbound'},
            'paperback': {'format': 'Paperback', 'binding': 'Perfect Bound'},
            'board book': {'format': 'Board Book', 'binding': 'Board'},
            'kindle': {'format': 'Digital', 'binding': None},
            'audio cd': {'format': 'Audio', 'binding': 'CD'}
        }
        
        for key, value in format_mapping.items():
            if key in format_text:
                format_info.update(value)
                break
    
    return format_info

def extract_rating_details(item):
    """Extract rating and review count"""
    rating_data = {
        'rating': None,
        'review_count': None
    }
    
    # Rating is in a-icon-alt
    rating_elem = item.select_one('.a-icon-star-small span.a-icon-alt')
    if rating_elem:
        rating_text = rating_elem.text.strip()
        try:
            rating = float(rating_text.split(' ')[0])
            if 0 <= rating <= 5:
                rating_data['rating'] = rating
        except (ValueError, IndexError):
            pass
    
    # Review count is in a-size-small
    review_elem = item.select_one('.a-icon-row span.a-size-small')
    if review_elem:
        try:
            count = int(review_elem.text.replace(',', ''))
            rating_data['review_count'] = count
        except ValueError:
            pass
    
    return rating_data

def extract_price(item):
    """Extract price information"""
    price_elem = item.select_one('span[class*="_cDEzb_p13n-sc-price"]')
    if price_elem:
        try:
            price_text = price_elem.text.strip('£')
            return float(price_text)
        except ValueError:
            return None
    return None

def extract_author(item):
    """Extract author information"""
    author_elem = item.select_one('a.a-size-small.a-link-child div')
    if author_elem:
        return author_elem.text.strip()
    return None

In [138]:
def validate_book_entry(book):
    """
    Validates book entry with detailed logging
    """
    logger.debug(f"Validating book entry: {book}")
    
    # Title is required
    if 'title' not in book or not book['title']:
        logger.debug("Validation failed: Missing title")
        return False
    
    # Need at least two additional pieces of information
    additional_fields = ['price', 'format', 'author', 'rank', 'rating']
    valid_fields = sum(1 for field in additional_fields if field in book and book[field] is not None)
    
    if valid_fields < 2:
        logger.debug(f"Validation failed: Only found {valid_fields} additional fields")
        return False
    
    # Price validation if present
    if 'price' in book and book['price'] is not None:
        try:
            price = float(book['price'])
            if price < 0:
                logger.debug(f"Validation failed: Invalid price {price}")
                return False
        except (ValueError, TypeError):
            logger.debug(f"Validation failed: Price conversion error for {book['price']}")
            return False
    
    logger.debug("Book validation passed")
    return True

Now that the parsing functions are set up, I will run the main scraping loop to collect data from all categories.

In [139]:
# Main scraping loop
bestseller_data = []

for category in tqdm(config['categories']):
    logger.info(f"Processing category: {category['name']}")
    
    html = fetch_bestseller_page(
        category['url'], 
        config['scraping']['headers'],
        max_retries=config['scraping']['max_retries'],
        backoff_factor=config['scraping']['backoff_factor']
    )
    
    if html:
        books = parse_bestseller_data(html)
        for book in books:
            # Add category metadata
            book.update({
                'category': category['name'],
                'category_type': category.get('type'),
                'category_subtype': category.get('subtype'),
                'target_age_range': category.get('age_range')
            })
            bestseller_data.append(book)


  0%|          | 0/8 [00:00<?, ?it/s]2025-01-19 11:19:53,175 - INFO - Processing category: Classic Books
2025-01-19 11:19:53,180 - INFO - Fetching data from https://www.amazon.co.uk/gp/bestsellers/books/291745/ref=pd_zg_hrsr_books (Attempt 1/3)


2025-01-19 11:19:54,085 - INFO - Successfully retrieved page
2025-01-19 11:19:54,116 - INFO - Parsing bestseller data
2025-01-19 11:19:54,452 - INFO - Found 30 potential book items
2025-01-19 11:19:54,723 - INFO - Successfully parsed 30 books
2025-01-19 11:19:54,726 - INFO - rank found in 30/30 books (100.0%)
2025-01-19 11:19:54,729 - INFO - title found in 30/30 books (100.0%)
2025-01-19 11:19:54,735 - INFO - author found in 29/30 books (96.7%)
2025-01-19 11:19:54,749 - INFO - price found in 30/30 books (100.0%)
2025-01-19 11:19:54,763 - INFO - format found in 30/30 books (100.0%)
2025-01-19 11:19:54,793 - INFO - rating found in 30/30 books (100.0%)
2025-01-19 11:19:54,824 - INFO - review_count found in 30/30 books (100.0%)
 12%|█▎        | 1/8 [00:01<00:11,  1.65s/it]2025-01-19 11:19:54,831 - INFO - Processing category: Sport and Outdoors
2025-01-19 11:19:54,841 - INFO - Fetching data from https://www.amazon.co.uk/Best-Sellers-Books-Childrens-Books-on-Sports-the-Outdoors/zgbs/books/29

In [140]:
# Column list
columns = [
    'rank', 
    'title', 
    'author',
    'price',
    'format',
    'rating',
    'review_count',
    'isbn10',
    'isbn13',
    'page_count',
    'category',
    'target_age_range',
    'timestamp',
    'asin',
    'product_url',
    'image_url',
    'category_ranks'
]

df = pd.DataFrame(bestseller_data)

# Ensure all columns exist
for col in columns:
    if col not in df.columns:
        df[col] = None

# Reorder columns
df = df[columns]

# Cross-reference data
def fill_missing_data(row):
    """
    Attempt to fill missing data by cross-referencing ASIN
    """
    if pd.isna(row['isbn10']) or pd.isna(row['isbn13']):
        if not pd.isna(row['asin']):
            # Could use ASIN to fetch additional details from product page
            pass
    
    if pd.isna(row['target_age_range']) and not pd.isna(row['category']):
        # Fill target age range based on category mapping
        category_age_mapping = {
            "Children's Books (Ages 6-8)": "6-8",
            "Middle Grade (Ages 9-12)": "9-12",
            "Young Teen (Ages 12-14)": "12-14",
            "Young Adult (Ages 14-18)": "14-18"
        }
        row['target_age_range'] = category_age_mapping.get(row['category'])
    
    return row

# Apply cross-referencing
df = df.apply(fill_missing_data, axis=1)

As I previously said, the main bestseller page doesn't give all the details needed. So, in the main scraping, I got the URL of the individual product page. Inspecting the details of the HTML like this, I could write code that targetted the right div and tags to extract the data.

![Details of webpage structure](../visualisations/Screenshots/Amazon_bestseller_details.png)

In [141]:
def parse_age_range(text):
    """
    Parse various age range formats
    """
    patterns = [
        (r'(\d+)\s*-\s*(\d+)\s*years', lambda m: f"{m.group(1)}-{m.group(2)}"),
        (r'(\d+)\+\s*years', lambda m: f"{m.group(1)}+"),
        (r'from\s*(\d+)\s*years', lambda m: f"{m.group(1)}+"),
        (r'ages\s*(\d+)(?:\s*-\s*(\d+))?', lambda m: f"{m.group(1)}-{m.group(2)}" if m.group(2) else f"{m.group(1)}+")
    ]
    
    for pattern, formatter in patterns:
        match = re.search(pattern, text, re.IGNORECASE)
        if match:
            return formatter(match)
    return None


def fetch_product_details(url):
    """
    Combined function to fetch ISBN, age range, and other missing details
    """
    try:
        logger.info(f"Fetching details from: {url}")
        response = requests.get(url, headers=config['scraping']['headers'])
        time.sleep(config['scraping']['delay'])
        
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            details = {}
            
            # Debug HTML structure
            debug_html = soup.select('#detailBullets_feature_div')
            logger.debug(f"Found detail bullets section: {bool(debug_html)}")
            
            # Process detail bullets
            bullets = soup.select('#detailBullets_feature_div .a-list-item')
            logger.debug(f"Found {len(bullets)} detail bullet points")
            
            for bullet in bullets:
                text = bullet.text.strip()
                logger.debug(f"Processing bullet: {text[:50]}...")  # First 50 chars for readability
                
                # ISBN-13
                if 'ISBN-13' in text:
                    spans = bullet.select('span')
                    if len(spans) >= 2:
                        isbn13 = ''.join(filter(str.isdigit, spans[-1].text))
                        if len(isbn13) == 13:
                            details['isbn13'] = isbn13
                            logger.info(f"Found ISBN-13: {isbn13}")
                
                # ISBN-10
                elif 'ISBN-10' in text:
                    spans = bullet.select('span')
                    if len(spans) >= 2:
                        isbn10 = ''.join(c for c in spans[-1].text if c.isdigit() or c == 'X')
                        if len(isbn10) == 10:
                            details['isbn10'] = isbn10
                            logger.info(f"Found ISBN-10: {isbn10}")
                
                # Reading Age
                elif 'Reading age' in text or 'Age range' in text:
                    spans = bullet.select('span')
                    if len(spans) >= 2:
                        age_text = spans[-1].text.strip()
                        details['target_age_range'] = parse_age_range(age_text)
                        logger.info(f"Found age range: {details.get('target_age_range')}")
                
                # Page Count
                elif 'pages' in text.lower():
                    match = re.search(r'(\d+)\s*pages', text)
                    if match:
                        details['page_count'] = int(match.group(1))
                        logger.info(f"Found page count: {details['page_count']}")
            
            return details
            
    except Exception as e:
        logger.error(f"Error fetching product details: {str(e)}")
        return None

def update_missing_data(df, batch_size=10):
    """
    Update DataFrame with missing information
    """
    # Track progress
    total_processed = 0
    total_updated = 0
    
    for i in range(0, len(df), batch_size):
        batch = df[i:i+batch_size]
        logger.info(f"Processing batch {i//batch_size + 1} of {len(df)//batch_size + 1}")
        
        for idx, row in batch.iterrows():
            if pd.notna(row['product_url']):
                # Check if we need to fetch details
                needs_update = (
                    pd.isna(row['isbn13']) or 
                    pd.isna(row['isbn10']) or 
                    pd.isna(row['target_age_range']) or 
                    pd.isna(row['page_count'])
                )
                
                if needs_update:
                    details = fetch_product_details(row['product_url'])
                    if details:
                        for key, value in details.items():
                            if pd.isna(df.at[idx, key]):
                                df.at[idx, key] = value
                                total_updated += 1
                
                total_processed += 1
                
                # Log progress
                if total_processed % 10 == 0:
                    logger.info(f"Processed {total_processed}/{len(df)} books. Updated {total_updated} fields.")
        
        # Save progress after each batch
        df.to_csv('bestsellers_enriched.csv', index=False)
        logger.info(f"Saved progress after {total_processed} books")
        
        # Delay between batches
        time.sleep(5)
    
    return df

def update_remaining_data(df, batch_size=10):
    """
    Targeted update for remaining missing data
    """
    try:
        total_processed = 0
        
        for i in range(0, len(df), batch_size):
            batch = df[i:i+batch_size]
            logger.info(f"Processing batch {i//batch_size + 1} of {len(df)//batch_size + 1}")
            
            for idx, row in batch.iterrows():
                if pd.notna(row['product_url']):
                    needs_update = (
                        pd.isna(row['category_ranks']) or 
                        pd.isna(row['author']) or
                        pd.isna(row['target_age_range']) or
                        pd.isna(row['isbn13']) or
                        pd.isna(row['page_count'])
                    )
                    
                    if needs_update:
                        try:
                            response = requests.get(row['product_url'], headers=config['scraping']['headers'])
                            soup = BeautifulSoup(response.text, 'html.parser')
                            
                            # Category ranks from Best Sellers section
                            rank_section = soup.select_one('ul.zg_hrsr')
                            if rank_section and pd.isna(row['category_ranks']):
                                category_ranks = []
                                for rank_item in rank_section.select('li'):
                                    rank_text = rank_item.text.strip()
                                    if rank_text:
                                        category_ranks.append(rank_text)
                                df.at[idx, 'category_ranks'] = ' | '.join(category_ranks)
                            
                            # Author (if missing)
                            if pd.isna(row['author']):
                                author_elem = soup.select_one('span.author a')
                                if author_elem:
                                    df.at[idx, 'author'] = author_elem.text.strip()
                            
                            # Additional data collection for remaining missing fields
                            details = fetch_product_details(row['product_url'])
                            if details:
                                for key, value in details.items():
                                    if pd.isna(df.at[idx, key]):
                                        df.at[idx, key] = value
                            
                            logger.info(f"Updated book {idx}: {row['title'][:50]}...")
                            
                        except Exception as e:
                            logger.error(f"Error processing {row['title'][:50]}: {str(e)}")
                
                total_processed += 1
                if total_processed % 10 == 0:
                    logger.info(f"Processed {total_processed}/{len(df)} books")
                    df.to_csv('bestsellers_final.csv', index=False)
            
            time.sleep(config['scraping']['delay'])
    
    except Exception as e:
        logger.error(f"Error in update process: {str(e)}")
    finally:
        df.to_csv('bestsellers_final.csv', index=False)
        print_data_status(df)
    
    return df

To check the effectiveness of the code, I displayed the completeness of data before and after running them.

In [142]:
# Before running the update
print_data_status(df)

# Run the update
df = update_missing_data(df)

# After running the update
print_data_status(df)

# Another round of updating
df = update_remaining_data(df)

# After running the update
print_data_status(df)


2025-01-19 11:20:03,317 - INFO - Processing batch 1 of 24
2025-01-19 11:20:03,328 - INFO - Fetching details from: https://www.amazon.co.uk/Guess-How-Much-Love-You/dp/1406358789



Data Completeness Report:
--------------------------------------------------
rank                :  239/ 239 (100.00%)
title               :  239/ 239 (100.00%)
author              :  181/ 239 ( 75.73%)
price               :  238/ 239 ( 99.58%)
format              :  239/ 239 (100.00%)
rating              :  232/ 239 ( 97.07%)
review_count        :  232/ 239 ( 97.07%)
isbn10              :    0/ 239 (  0.00%)
isbn13              :    0/ 239 (  0.00%)
page_count          :    0/ 239 (  0.00%)
category            :  239/ 239 (100.00%)
target_age_range    :    0/ 239 (  0.00%)
timestamp           :  239/ 239 (100.00%)
asin                :  239/ 239 (100.00%)
product_url         :  239/ 239 (100.00%)
image_url           :  239/ 239 (100.00%)
category_ranks      :    0/ 239 (  0.00%)


2025-01-19 11:20:09,633 - INFO - Found page count: 32
2025-01-19 11:20:09,634 - INFO - Found ISBN-13: 9781406358780
2025-01-19 11:20:09,639 - INFO - Found age range: None
2025-01-19 11:20:09,649 - INFO - Fetching details from: https://www.amazon.co.uk/Dear-Zoo-Anniversary-Rod-Campbell/dp/1529074932
2025-01-19 11:20:16,008 - INFO - Found page count: 18
2025-01-19 11:20:16,009 - INFO - Found ISBN-10: 1529074932
2025-01-19 11:20:16,017 - INFO - Found ISBN-13: 9781529074932
2025-01-19 11:20:16,021 - INFO - Found age range: 1-3
2025-01-19 11:20:16,029 - INFO - Fetching details from: https://www.amazon.co.uk/Tiger-Who-Came-Tea/dp/0007215991
2025-01-19 11:20:22,887 - INFO - Found page count: 32
2025-01-19 11:20:22,890 - INFO - Found ISBN-10: 0007368380
2025-01-19 11:20:22,894 - INFO - Found ISBN-13: 9780007215997
2025-01-19 11:20:22,899 - INFO - Found age range: 2-4
2025-01-19 11:20:22,907 - INFO - Fetching details from: https://www.amazon.co.uk/Were-Going-Bear-Michael-Rosen/dp/0744523230
202


Data Completeness Report:
--------------------------------------------------
rank                :  239/ 239 (100.00%)
title               :  239/ 239 (100.00%)
author              :  181/ 239 ( 75.73%)
price               :  238/ 239 ( 99.58%)
format              :  239/ 239 (100.00%)
rating              :  232/ 239 ( 97.07%)
review_count        :  232/ 239 ( 97.07%)
isbn10              :  181/ 239 ( 75.73%)
isbn13              :  200/ 239 ( 83.68%)
page_count          :  217/ 239 ( 90.79%)
category            :  239/ 239 (100.00%)
target_age_range    :  167/ 239 ( 69.87%)
timestamp           :  239/ 239 (100.00%)
asin                :  239/ 239 (100.00%)
product_url         :  239/ 239 (100.00%)
image_url           :  239/ 239 (100.00%)
category_ranks      :    0/ 239 (  0.00%)


2025-01-19 11:46:31,452 - INFO - Fetching details from: https://www.amazon.co.uk/Guess-How-Much-Love-You/dp/1406358789
2025-01-19 11:46:37,805 - INFO - Found page count: 32
2025-01-19 11:46:37,806 - INFO - Found ISBN-13: 9781406358780
2025-01-19 11:46:37,807 - INFO - Found age range: None
2025-01-19 11:46:37,817 - INFO - Updated book 0: Guess How Much I Love You: The beloved classic and...
2025-01-19 11:46:41,110 - INFO - Fetching details from: https://www.amazon.co.uk/Dear-Zoo-Anniversary-Rod-Campbell/dp/1529074932
2025-01-19 11:46:47,240 - INFO - Found page count: 18
2025-01-19 11:46:47,241 - INFO - Found ISBN-10: 1529074932
2025-01-19 11:46:47,242 - INFO - Found ISBN-13: 9781529074932
2025-01-19 11:46:47,248 - INFO - Found age range: 1-3
2025-01-19 11:46:47,255 - INFO - Updated book 1: Dear Zoo: The Lift-the-flap Preschool Classic...
2025-01-19 11:46:50,883 - INFO - Fetching details from: https://www.amazon.co.uk/Tiger-Who-Came-Tea/dp/0007215991
2025-01-19 11:46:57,281 - INFO - Foun


Data Completeness Report:
--------------------------------------------------
rank                :  239/ 239 (100.00%)
title               :  239/ 239 (100.00%)
author              :  212/ 239 ( 88.70%)
price               :  238/ 239 ( 99.58%)
format              :  239/ 239 (100.00%)
rating              :  232/ 239 ( 97.07%)
review_count        :  232/ 239 ( 97.07%)
isbn10              :  181/ 239 ( 75.73%)
isbn13              :  200/ 239 ( 83.68%)
page_count          :  217/ 239 ( 90.79%)
category            :  239/ 239 (100.00%)
target_age_range    :  167/ 239 ( 69.87%)
timestamp           :  239/ 239 (100.00%)
asin                :  239/ 239 (100.00%)
product_url         :  239/ 239 (100.00%)
image_url           :  239/ 239 (100.00%)
category_ranks      :  119/ 239 ( 49.79%)

Data Completeness Report:
--------------------------------------------------
rank                :  239/ 239 (100.00%)
title               :  239/ 239 (100.00%)
author              :  212/ 239 ( 88.70%)
pric

In [143]:
# Save to CSV
filename = f"../data/raw/daily_bestsellers/bestsellers_{datetime.now().strftime('%Y%m%d')}.csv"
df.to_csv(filename, index=False)

print("\nData Collection Summary:")
print(f"Total books collected: {len(df)}")
print("\nBooks per category:")
print(df['category'].value_counts())
print("\nMissing data summary:")
print(df.isnull().sum()) 


Data Collection Summary:
Total books collected: 239

Books per category:
category
Classic Books                         30
Sport and Outdoors                    30
Teen and Young Adult                  30
Comics and Graphic Novels             30
Science, Nature & How It Works        30
Educational Books                     30
Reference Books                       30
TV, Movie & Video Game Adaptations    29
Name: count, dtype: int64

Missing data summary:
rank                  0
title                 0
author               27
price                 1
format                0
rating                7
review_count          7
isbn10               58
isbn13               39
page_count           22
category              0
target_age_range     72
timestamp             0
asin                  0
product_url           0
image_url             0
category_ranks      120
dtype: int64


In [144]:
print(df)


     rank                                              title  \
0       1  Guess How Much I Love You: The beloved classic...   
1       2      Dear Zoo: The Lift-the-flap Preschool Classic   
2       3                          The Tiger Who Came to Tea   
3       4  We're Going on a Bear Hunt: The bestselling cl...   
4       5  Oh Dear!: A Lift-the-flap Farm Book from the C...   
..    ...                                                ...   
234    26                        Guinness World Records 2024   
235    27  Weird but true! 2024: Old edition (National Ge...   
236    28    Children's Bible Stories (Illustrated Treasury)   
237    29  Dinosaurs Find it! Explore it!: More than 250 ...   
238    30  First Words My Learning Library | Hinkler Buil...   

                       author  price      format  rating  review_count  \
0               Sam McBratney   4.00  Board book     4.8        8752.0   
1                Rod Campbell   4.00  Board book     4.8       30581.0   
2        

The data collection has successfully captured the core elements needed for Amazon bestseller analysis - particularly strong in areas critical for market trend analysis including ranks, titles, authors, prices, formats and categories. While I'm missing review counts and some auxiliary data like ISBNs and page counts, the dataset contains the essential metrics for the planned deliverables: price point analysis, genre popularity trends, format impact study and seasonal patterns.

### Next Steps
2. Data cleaning ([02_data_cleaning.ipynb](02_data_cleaning.ipynb)): Standardize formats, clean titles, handle the 3 missing format entries
3. Price analysis ([03_price_analysis.ipynb](03_price_analysis.ipynb)): Create visualizations of price distributions across formats and categories
4. Genre analysis ([04_genre_analysis.ipynb](04_genre_analysis.ipynb)): Analyze category performance and age-range segmentation
5. Seasonal analysis ([05_seasonal_analysis.ipynb](05_seasonal_analysis)): Begin tracking temporal patterns (will become more valuable as I collect more data over time)