# Amazon Best Selling Books - Data Exploration

# Introduction

Amazon, one of the largest e-commerce platforms globally, started its journey in a garage in Seattle. Under the leadership of Jeff Bezos, Amazon evolved from an online bookstore into a marketplace where customers can purchase nearly anything, both physical and digital. Despite its vast expansion, Amazon remains the largest online bookstore.

Launched in July 1995, Amazon promoted itself as "Earth's Biggest Bookstore" and quickly gained traction by leveraging partnerships with major book distributors and wholesalers to fulfill orders efficiently.

# Project Objective

This project aims to analyze the best-selling books on Amazon from 2009 to 2023. The dataset, scraped from the Amazon website, will help derive insights into key trends and performance metrics. The primary questions we seek to answer include:

Genre Distribution of Unique Best Sellers: How many Fiction versus Non-Fiction books were best sellers on Amazon between 2009 and 2023?

Genre Distribution by Year: How many Fiction and Non-Fiction best sellers were there each year?

Top 10 Authors: Which authors appeared most frequently on the best-seller list from 2009 to 2023?

Top 10 Books: Which books made the most appearances on the best-seller list during this period?

Relationship Between Reviews and Price: Is there a trend between the price of books and the reviews they receive? For example, do lower-priced books tend to receive better reviews?

Genre Performance Based on Reviews: Do Fiction books generally receive more ratings than Non-Fiction books?

Price Trends Over Time: How did book prices change between 2009 and 2023? Did prices increase or decrease over the years?

Top 20 Authors with Highest User Ratings: Which authors received the highest average user ratings?

Top 20 Authors with the Most Reviews: Which authors had the most customer reviews?

Book Title Length and User Ratings: Is there a correlation between the length of a book’s title and its user ratings? Do shorter titles receive higher ratings?

Top 10 Best-Selling Authors by Genre and Frequency: Who are the top-selling authors, categorized by genre and number of appearances?

Top 10 Books Based on User Ratings: Which books had the highest average user ratings?

Top 10 Books Based on Reviews: Which books received the most reviews?

Top 10 Best-Selling Books by Genre and Number of Reviews: Which books, sorted by genre, garnered the highest number of reviews?

In [2]:
# Import necessary packages for data analysis and web scraping

# Data manipulation and analysis packages
import numpy as np
import pandas as pd

# Web scraping packages
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.request import urlopen
from time import sleep

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

In [3]:

# Set display options for better DataFrame visualization
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Gathering the Data

###### Let's get the urls and the second page for each year beginning from 2009 to 2021

In [6]:
# Generate a list of URLs for Amazon's best-selling books for the years 2009-2022

# Initialize an empty list to store URLs
urls = []

# Create a list of years from 2009 to 2023 as strings
years = [str(i) for i in range(2009, 2024)]

# Loop through each year and generate the URLs for the first and second pages
for year in years:
    # URL for the first page of best sellers for the given year
    urls.append(f"https://www.amazon.com/gp/bestsellers/{year}/books")
    
    # URL for the second page of best sellers for the given year
    urls.append(f"https://www.amazon.com/gp/bestsellers/{year}/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2")

print(urls)

['https://www.amazon.com/gp/bestsellers/2009/books', 'https://www.amazon.com/gp/bestsellers/2009/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2010/books', 'https://www.amazon.com/gp/bestsellers/2010/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2011/books', 'https://www.amazon.com/gp/bestsellers/2011/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2012/books', 'https://www.amazon.com/gp/bestsellers/2012/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2013/books', 'https://www.amazon.com/gp/bestsellers/2013/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2014/books', 'https://www.amazon.com/gp/bestsellers/2014/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2015/books', 'https://www.amazon.com/gp/bestsellers/2015/books/ref=zg_bsar_pg_2?ie=UTF8&pg=2', 'https://www.amazon.com/gp/bestsellers/2016/books', 'https://

###### Let's use this function to get the details for each book in each year

###### Steps:
###### Access the URLs: Loop through the URLs for different years.
###### Scrape each book’s details: For each year, extract book details by calling the get_dir function.
###### Store the data: Append the results into a list or DataFrame.

In [7]:
def get_dir(book, year):
    '''Extracts details for each book for a given year'''
    
    # Helper function to extract text safely
    def safe_find(selector, class_name, slice_text=None):
        try:
            result = book.find(selector, class_=class_name).text
            if slice_text:
                result = result[slice_text]
            return result
        except Exception:
            return np.nan

    # Extracting book details using the helper function
    price = safe_find('span', "_cDEzb_p13n-sc-price_3mJ9Z", slice_text=slice(1, None))
    ranks = safe_find('span', 'zg-bdg-text', slice_text=slice(1, None))
    title = safe_find('div', "_cDEzb_p13n-sc-css-line-clamp-1_1Fn1y")
    ratings = safe_find('span', "a-icon-alt", slice_text=slice(0, 3))
    no_of_reviews = safe_find('span', "a-size-small")
    author = safe_find('a', "a-size-small a-link-child")
    cover_type = safe_find('span', "a-size-small a-color-secondary a-text-normal")
    
    # Return the details as a list
    return [price, ranks, title, no_of_reviews, ratings, author, cover_type, year]


###### Let's get the year for the first and second page

In [8]:
# List of years for first and second page URLs (2009 to 2023)
years = [str(i) for i in range(2009, 2024)] * 2  # Flattened list of years for URL generation


###### Let's get the books in every page(first and second) of every year from year 2009- 2021
###### Note that this cell takes about 25 minutes to run, so you will have to exercise patience

In [13]:
# List to store book data from both pages for each year
all_years_books = []

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup

# Loop through each URL and extract book data from the first and second pages
for url in urls:
    # Initialize WebDriver for Chrome (make sure to replace the path with the correct one for your system)
    driver = webdriver.Chrome("C:/webDrivers/chromedriver.exe")
    
    # Load the webpage
    driver.get(url)
    
    # Wait for the webpage to load completely (increase or decrease based on your network speed)
    sleep(30)
    
    # Parse the page content using BeautifulSoup
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Find all the books on the page by their element id
    books = soup.find_all(id='gridItemRoot')
    
    # Append the book data to the all_years_books list
    all_years_books.append(books)
    
    # Close the browser after extracting the data
    driver.quit()

# Print statement (optional) to check that the books were collected properly
# print(all_years_books)

AttributeError: 'str' object has no attribute 'capabilities'

In [12]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# Specify the path to the ChromeDriver executable
chrome_driver_path = "C:/webDrivers/chromedriver.exe"

# Create a service object for ChromeDriver
service = Service(executable_path=chrome_driver_path)

# Initialize the WebDriver with the service
driver = webdriver.Chrome(service=service)

# Now you can use the driver as usual


NoSuchDriverException: Message: Unable to obtain driver for chrome; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors/driver_location


In [6]:
len(all_year), len(years)  #to confirm you got all the years(first and second page)
#should be the same

(26, 26)

###### use the code below to get index and year so that looping through the files will be easier

In [7]:
year_index = (list(enumerate(years)))
dc = year_index

###### use the code to ge the observation for all the books in the top 100 for every year with the period of 2009-2021

In [8]:
data = [] #create an empty list for the observation for all the books in the top 100 for every year with the period of 2009-2022
for i in dc:   #loop through the year index in the cell above
    for books in all_year[i[0]]:             #loop through the books for all the years
        for book in books:                   #loop through the books for on each page
            data.append(get_dir(book,i[1]))  # get the details of each book and add to data(line 1)
            
#data #to print the data collected

In [9]:

# open file
with open('Amazon.txt', 'w+') as f:
     
    # write elements of list
    for items in data: 
        try:
            f.write('%s\n' %items)
        except Exception as e:
            f.write('%s\n' 'nothing')
     
    print("File written successfully")
 
 
# close the file
f.close()

File written successfully


###### This cell is for converting the data extracted in list format to a dataframe

In [10]:
best_selling_books= pd.DataFrame(data, columns = [
                         'price',
                         'ranks',
                         'title',
                         'no_of_reviews',
                         'ratings',
                         'author',
                       'cover_type',
                          'year'])


###### save the data to csv file

In [11]:
best_selling_books.to_csv('best_selling_books_2009-2021.csv')   #To save to csv

In [12]:
best_selling_books

Unnamed: 0,price,ranks,title,no_of_reviews,ratings,author,cover_type,year
0,12.81,1,The Lost Symbol,16129,4.4,Dan Brown,Hardcover,2009
1,10.43,2,The Shack: Where Tragedy Confronts Eternity,23398,4.7,William P. Young,Paperback,2009
2,9.93,3,Liberty and Tyranny: A Conservative Manifesto,5037,4.8,Mark R. Levin,Hardcover,2009
3,14.30,4,"Breaking Dawn (The Twilight Saga, Book 4)",16923,4.7,Stephenie Meyer,Hardcover,2009
4,9.99,5,Going Rogue: An American Life,1572,4.6,Sarah Palin,Hardcover,2009
...,...,...,...,...,...,...,...,...
1286,16.69,96,Will,Will Smith,4.8,,Hardcover,2021
1287,7.49,97,Think and Grow Rich: The Landmark Bestseller N...,83367,4.7,Napoleon Hill,Paperback,2021
1288,8.95,98,Dragons Love Tacos,15771,4.8,Adam Rubin,Hardcover,2021
1289,7.49,99,The Truth About COVID-19: Exposing The Great R...,Doctor Joseph Mercola,4.8,,Hardcover,2021


###### This will be the end of this notebook.

We will continue with the Analysis in Part Two Notebook. This is so that we won't keep scrapping the data from Amazon whenever we want to rerun the whole body of code for uniformity.

We would also want to respect the ethics of web scrapping.