# Amazon Web Scraping

## Introduction

The goal of this project is to scrape the Amazon website to get details of products based on the search results.

## Import Libraries

Libraries required for the project are imported.

In [3]:
# For webscraping
from bs4 import BeautifulSoup
# Chrome driver
from selenium import webdriver
# Chrome driver manager
from webdriver_manager.chrome import ChromeDriverManager
# For current data and time
from datetime import datetime
# For writing results to CSV file
import csv

## Initialize Webdriver

Web driver for Google Chrome is initialized since the scraping will be done on Google Chrome.

In [4]:
# Initialize webdriver for chrome
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [C:\Users\iammi\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


## Extract Data Collection 

Below steps are followed for extracting data from the website.

In [5]:
url = 'https://www.amazon.in/s?k=iphone+13'

# Navigate to the URL
driver.get(url)

In [9]:
# Initialize BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [10]:
# Get product data
product_data = soup.find_all('div', {'data-component-type': 's-search-result'})
# No. of products whose data extracted
len(product_data)

18

## Prototype Product Information

Since the identification of collection of records has been completed, the next step is to prototype the extraction of information from a single product item.

In [11]:
# Consider a single record
product = product_data[0]

In [12]:
# Extract product name
product_name = product.h2.a.text.strip()
print(product_name)

Apple iPhone 13 Mini (128GB) - Blue


In [13]:
# Extract product link
product_link = 'https://www.amazon.in' + product.h2.a.get('href')
print(product_link)

https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A08694961VBUR5DCN1K0W&url=%2FApple-iPhone-13-Mini-128GB%2Fdp%2FB09G99NBNQ%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Diphone%2B13%26qid%3D1640945983%26sr%3D8-1-spons%26psc%3D1&qualifier=1640945983&id=176615068607060&widgetName=sp_atf


In [15]:
# Extract product price
product_price = product.find('span', 'a-price-whole').text
print(product_price)

69,900


In [16]:
# Extract product ratings
product_ratings = product.find('span', 'a-icon-alt').text.strip(' out of 5 stars')
print(product_ratings)

4.4


In [17]:
# Extract no. of reviews
product_num_of_reviews = product.find('span', 'a-size-base').text
print(product_num_of_reviews)

280


## Formulate a Function for Extracting Product Information

A function is created for extracting information related to a product.

In [18]:
# Function for extracting product data
def extract_data(product):
    name = product.h2.a.text.strip()
    price = product.find('span', 'a-price-whole').text
    ratings = product.find('span', 'a-icon-alt').text.strip(' out of 5 stars')
    num_of_reviews = product.find('span', 'a-size-base').text
    link = 'https://www.amazon.in' + product.h2.a.get('href')
    
    info = (name, price, ratings, num_of_reviews, link)
    return info

In [20]:
# Testing extract_data function
print(extract_data(product_data[1]))

('New Apple iPhone 12 Mini (256GB) - (Product) RED', '₹64,999', '4.', '6,218', 'https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A03142682ROKNU8C01UJ5&url=%2FNew-Apple-iPhone-Mini-256GB%2Fdp%2FB08L5WKDFF%2Fref%3Dsr_1_2_sspa%3Fkeywords%3Diphone%2B13%26qid%3D1640945983%26sr%3D8-2-spons%26psc%3D1&qualifier=1640945983&id=176615068607060&widgetName=sp_atf')


The function is working as intended for a single product item. Now the working of this function for all the products needs to be checked.

In [21]:
# For storing multiple product information
product_info = []

# Get product data
product_data = soup.find_all('div', {'data-component-type': 's-search-result'})

# Extract information of each product and store them in product_info
for product in product_data:
    product_info.append(extract_data(product))

# Print product information
for product in product_info:
    for info in product:
        print(info)
    print()

AttributeError: 'NoneType' object has no attribute 'text'

An exception has occured since ratings and reviews of a product is not available. 

## Exception Handling

Since there are more products with no ratings and reviews, exception handling must be carried out such that the program should not crash while running.

In [22]:
# Function for extracting product data with exception handling
def extract_data(product):
    
    try:
        name = product.h2.a.text.strip()
        price = product.find('span', 'a-price-whole').text
        link = 'https://www.amazon.in' + product.h2.a.get('href')
    except:
        return
    
    try: 
        ratings = product.find('span', 'a-icon-alt').text.strip(' out of 5 stars')
        num_of_reviews = product.find('span', 'a-size-base').text
    except:
        ratings = ''
        num_of_reviews = ''
    
    info = (name, price, ratings, num_of_reviews, link)
    return info

Testing of this function is carried out.

In [23]:
# Extract information of each product and store them in a list
for product in product_data:
    info = extract_data(product)
    # Product info is added to list only if info is not empty
    if info:
        product_info.append(info)

# Print product information
for product in product_info:
    for info in product:
        print(info)
    print()

Apple iPhone 13 Mini (128GB) - Blue
₹69,900
4.4
280
https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A08694961VBUR5DCN1K0W&url=%2FApple-iPhone-13-Mini-128GB%2Fdp%2FB09G99NBNQ%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Diphone%2B13%26qid%3D1640945983%26sr%3D8-1-spons%26psc%3D1&qualifier=1640945983&id=176615068607060&widgetName=sp_atf

New Apple iPhone 12 Mini (256GB) - (Product) RED
₹64,999
4.
6,218
https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A03142682ROKNU8C01UJ5&url=%2FNew-Apple-iPhone-Mini-256GB%2Fdp%2FB08L5WKDFF%2Fref%3Dsr_1_2_sspa%3Fkeywords%3Diphone%2B13%26qid%3D1640945983%26sr%3D8-2-spons%26psc%3D1&qualifier=1640945983&id=176615068607060&widgetName=sp_atf

Apple iPhone 13 (256GB) - (Product) RED
₹89,900
4.6
553
https://www.amazon.in/Apple-iPhone-13-256GB-Product/dp/B09G9HDN4Q/ref=sr_1_3?keywords=iphone+13&qid=1640945983&sr=8-3

Apple iPhone 13 Mini (128GB) - (Product) RED
₹69,900
4.4
280
htt

The function is found to be working fine.

## Formulate a Function for Getting URL for given Search Input

The URL of the search result page depends directly on the search input given by the user. Thus a function is created to generate the required URL.

In [24]:
# Function for generating URL based on user search input
def get_url(search_input):
    search_input_mod = search_input.replace(' ', '+')
    url = 'https://www.amazon.in/s?k=' + search_input_mod
    return url

## Formulate a Function for Creating Soup for All Pages

The search result in Amazon extends upto 20 pages. The code needs to be written in order to get all the information of products stored across 20 pages.

In [25]:
product_info = []

search_input = 'iphone 13'
url = get_url(search_input)

# To get product data in each page of 20 pages long search result
for page_num in range(1, 21):
    url_full = url + '&page=' + str(page_num)
    driver.get(url_full)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    product_data = soup.find_all('div', {'data-component-type': 's-search-result'})
    for product in product_data:
        info = extract_data(product)
        if info:
            product_info.append(info)

for product in product_info:
    for info in product:
        print(info)
    print()

Apple iPhone 13 Mini (128GB) - Blue
₹69,900
4.4
280
https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A033021222S2YH4TY3CLK&url=%2FApple-iPhone-13-Mini-128GB%2Fdp%2FB09G99NBNQ%2Fref%3Dsr_1_1_sspa%3Fkeywords%3Diphone%2B13%26qid%3D1640946212%26sr%3D8-1-spons%26psc%3D1&qualifier=1640946212&id=2074233375148366&widgetName=sp_atf

New Apple iPhone 12 Mini (256GB) - (Product) RED
₹64,999
4.
6,218
https://www.amazon.in/gp/slredirect/picassoRedirect.html/ref=pa_sp_atf_aps_sr_pg1_1?ie=UTF8&adId=A03142682ROKNU8C01UJ5&url=%2FNew-Apple-iPhone-Mini-256GB%2Fdp%2FB08L5WKDFF%2Fref%3Dsr_1_2_sspa%3Fkeywords%3Diphone%2B13%26qid%3D1640946212%26sr%3D8-2-spons%26psc%3D1&qualifier=1640946212&id=2074233375148366&widgetName=sp_atf

Apple iPhone 13 (256GB) - (Product) RED
₹89,900
4.6
553
https://www.amazon.in/Apple-iPhone-13-256GB-Product/dp/B09G9HDN4Q/ref=sr_1_3?keywords=iphone+13&qid=1640946212&sr=8-3

Apple iPhone 13 Mini (128GB) - (Product) RED
₹69,900
4.4
280
h

## Formulate a Function for Creating a Spreadsheet with Product Information

A function is created for writing information of all products into a CSV file.

In [26]:
# Function for generating CSV file with product information
def generate_csv(search_input, product_info):
    search_input_mod = search_input.replace(' ', '_')
    current_datetime = datetime.now().strftime("%y%m%d%H%M%S")
    filename = search_input_mod + '_' + current_datetime + '.csv'
    
    with open(filename, 'w', newline = '', encoding = 'utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Product Name', 'Price (In Rupees)', 'Rating (Out of 5)', 'No. of Reviews', 'Product Link'])
        writer.writerows(product_info)
        print("Product info has been written to " + filename)

## Create Final Code

The last step is to put together all the components and build the final version of the code

In [2]:
# For webscraping
from bs4 import BeautifulSoup
# Chrome driver
from selenium import webdriver
# Chrome driver manager
from webdriver_manager.chrome import ChromeDriverManager
# For current data and time
from datetime import datetime
# For writing results to CSV file
import csv

# Function for generating CSV file with product information
def generate_csv(search_input, product_info):
    search_input_mod = search_input.replace(' ', '_')
    current_datetime = datetime.now().strftime("%y%m%d%H%M%S")
    filename = search_input_mod + '_' + current_datetime + '.csv'
    
    with open(filename, 'w', newline = '', encoding = 'utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Product Name', 'Price', 'Rating (Out of 5)', 'No. of Reviews', 'Product Link'])
        writer.writerows(product_info)
        print("Product info has been written to " + filename)

# Function for generating URL based on user search input
def get_url(search_input):
    search_input_mod = search_input.replace(' ', '+')
    url = 'https://www.amazon.in/s?k=' + search_input_mod
    return url

# Function for extracting product data with exception handling
def extract_data(product):
    
    try:
        name = product.h2.a.text.strip()
        price = product.find('span', 'a-price-whole').text
        link = 'https://www.amazon.in' + product.h2.a.get('href')
    except:
        return
    
    try: 
        ratings = product.find('span', 'a-icon-alt').text.strip(' out of 5 stars')
        num_of_reviews = product.find('span', 'a-size-base').text
    except:
        ratings = ''
        num_of_reviews = ''
    
    info = (name, price, ratings, num_of_reviews, link)
    return info

def main():
    # Run main program
    # Initialize webdriver for chrome
    driver = webdriver.Chrome(ChromeDriverManager().install())

    print('________________________')
    print('AMAZON.IN SEARCH RESULTS')
    print('************************')
    print('This program exports the')
    print('results of user search')
    print('query into a CSV file')

    continue_search = True

    while continue_search:
        product_info = []

        search_input = input('\nSearch for: ')
        url = get_url(search_input)

        # To get product data in each page of 20 pages long search result
        for page_num in range(1, 21):
            url_full = url + '&page=' + str(page_num)
            driver.get(url_full)
            soup = BeautifulSoup(driver.page_source, 'html.parser')
    
            product_data = soup.find_all('div', {'data-component-type': 's-search-result'})
            for product in product_data:
                info = extract_data(product)
                if info:
                    product_info.append(info)

        generate_csv(search_input, product_info)

        while True:
            continue_response = input('\nWould you like to search again? (Y/N): ')
            if continue_response.lower() == 'y':
                continue_search = True
                break
            elif continue_response.lower() == 'n':
                continue_search = False
                break
            else:
                print("Enter a valid response.")
                continue

if __name__ == "__main__":
    main()



Current google-chrome version is 96.0.4664
Get LATEST chromedriver version for 96.0.4664 google-chrome
Driver [C:\Users\iammi\.wdm\drivers\chromedriver\win32\96.0.4664.45\chromedriver.exe] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


________________________
AMAZON.IN SEARCH RESULTS
************************
This program exports the
results of user search
query into a CSV file

Search for: samsung galaxy
Product info has been written to samsung_galaxy_211231161246.csv

Would you like to search again? (Y/N): y

Search for: oneplus nord 2
Product info has been written to oneplus_nord_2_211231161331.csv

Would you like to search again? (Y/N): n
