<a href="https://colab.research.google.com/github/ayushwattal/AmazonWebScrapping/blob/main/AmazonWebScrapping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon WebScrapping using BeautifulSoup



This is an academic project for scrapping item listing from amazon. The data is not used for any commercial purpose. Before web scrapping check your local raw related to web scrapping.

In [None]:
# Installing BeautifulSoup
!pip install bs4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [41]:
# Import Libraries
from bs4 import BeautifulSoup
import requests
import time
import datetime
import smtplib
import csv
import pandas as pd
import random
import sys


In [42]:
# Replace User Agent with your User Agent
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36" , "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}


In [43]:

# Generate URL based on input search and for 20 pages
def generate_url(search_term):
  urlList = []
  template = 'https://www.amazon.com/s?k={}'
  search_term = search_term.replace(' ', '+')
  # Add query
  url = template.format(search_term)
  # Add pages
  url += '&page={}&qid=1656114267&ref=sr_pg_{}'
  # generate pages
  for i in range (1,21):
    urlList.append(url.format(i,i))
  return urlList

# Method to get Soup by passing url
def getSoup(url):
    page = requests.get(url, headers = headers)
    soup1 = BeautifulSoup(page.content,"html.parser")
    soup2 = BeautifulSoup(soup1.prettify(),"html.parser")
    return soup2

# Method to extract data from each listed product
def get_item_data(item):

  # Product Title and URL
  title = item.h2.a.text.strip()
  url = 'https://www.amazon.com' + item.h2.a.get('href')

  # Prodict Rating and Customer Reviews
  try:
    rating = item.find('span', {'class' : 'a-icon-alt'}).getText().strip()[:-14]
    customerReview = item.find('span', {'class' : 'a-size-base s-underline-text'}).getText().strip()
  except AttributeError:
    rating = '0'
    customerReview = '0'
  # Product Price 
  try:
    price = item.find('span', {'class' : 'a-price'}).find('span', {'class' : 'a-offscreen'}).text.strip()[1:] 
  except AttributeError:
     price = '0' 
  result = (title, price, rating, customerReview,url )
 
  return result

In [44]:
# Searching for mobile phones

url_List = generate_url('mobile phones')
itemList = []

# Getting item data from all the pages
for url in url_List:
  page = requests.get(url, headers = headers)
  soup1 = BeautifulSoup(page.content,"html.parser")
  soup2 = BeautifulSoup(soup1.prettify(),"html.parser")
  totalItems = soup2.find_all('div', {'data-component-type':'s-search-result'})
  for item in totalItems:
    itemData = get_item_data(item)
    if itemData:
      itemList.append(itemData)


In case number of pages for the product is less than 20, change the loop count or else items in page number 1 are repeated resulting in duplicates.

In [45]:
# Creating a dataset
dataset = pd.DataFrame(itemList, columns = ['Title', 'Price($)', 'Rating(Out of 5)', 'Customer_Review','URL'] )
dataset

Unnamed: 0,Title,Price($),Rating(Out of 5),Customer_Review,URL
0,"SAMSUNG Galaxy A13 5G Cell Phone, Factory Unlo...",249.99,4.3,1523,https://www.amazon.com/gp/slredirect/picassoRe...
1,"SAMSUNG Galaxy S20 FE 5G Cell Phone, Factory U...",499.99,4.6,10726,https://www.amazon.com/Samsung-Factory-Unlocke...
2,"Samsung Galaxy A12 (SM-A125F/DS) Dual SIM,128 ...",187.50,4.1,1032,https://www.amazon.com/Samsung-SM-A125F-Factor...
3,Moto G stylus | 2020 | Unlocked | Made for US ...,252.00,4.5,6413,https://www.amazon.com/Stylus-Unlocked-Motorol...
4,"Apple iPhone 8, 64GB, Gold - Unlocked (Renewed)",169.99,4.4,45107,https://www.amazon.com/Apple-iPhone-Fully-Unlo...
...,...,...,...,...,...
421,PLMOKN Mobile Phone Live Selfie Ring ​Lights L...,16.99,0,0,https://www.amazon.com/gp/slredirect/picassoRe...
422,Mobile Phone Bracket Gravity Linkage Elastic B...,16.90,0,0,https://www.amazon.com/gp/slredirect/picassoRe...
423,Cel-Fi GO X | 100 dB 4G/5G Cell Phone Signal B...,1099.99,4.6,317,https://www.amazon.com/gp/slredirect/picassoRe...
424,SNKINE New 360° Rotatable and Retractable Car ...,16.99,0,0,https://www.amazon.com/gp/slredirect/picassoRe...


In [46]:
# Get dataset information
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426 entries, 0 to 425
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Title             426 non-null    object
 1   Price($)          426 non-null    object
 2   Rating(Out of 5)  426 non-null    object
 3   Customer_Review   426 non-null    object
 4   URL               426 non-null    object
dtypes: object(5)
memory usage: 16.8+ KB


In [47]:
# Dropping dulicates rows if any
dataset = dataset.drop_duplicates(subset=["Title", "Price($)", "Rating(Out of 5)", "Customer_Review"], keep=False)
dataset.reset_index(drop=True)

Unnamed: 0,Title,Price($),Rating(Out of 5),Customer_Review,URL
0,"SAMSUNG Galaxy A13 5G Cell Phone, Factory Unlo...",249.99,4.3,1523,https://www.amazon.com/gp/slredirect/picassoRe...
1,"SAMSUNG Galaxy S20 FE 5G Cell Phone, Factory U...",499.99,4.6,10726,https://www.amazon.com/Samsung-Factory-Unlocke...
2,"Samsung Galaxy A12 (SM-A125F/DS) Dual SIM,128 ...",187.50,4.1,1032,https://www.amazon.com/Samsung-SM-A125F-Factor...
3,Moto G stylus | 2020 | Unlocked | Made for US ...,252.00,4.5,6413,https://www.amazon.com/Stylus-Unlocked-Motorol...
4,"Apple iPhone 8, 64GB, Gold - Unlocked (Renewed)",169.99,4.4,45107,https://www.amazon.com/Apple-iPhone-Fully-Unlo...
...,...,...,...,...,...
184,"Rugged Smartphone, Blackview BV4900 Pro Waterp...",149.99,4.0,487,https://www.amazon.com/Smartphone-Blackview-BV...
185,"Smartphone,5.45 inch HD Full Screen face Recog...",78.99,2.8,63,https://www.amazon.com/Smartphone-Screen-Recog...
186,"Unlocked Rugged Phone, Ulefone Armor X10 Andro...",127.49,4.4,4,https://www.amazon.com/Unlocked-Ulefone-Androi...
187,SNKINE New 360° Rotatable and Retractable Car ...,16.99,0,0,https://www.amazon.com/gp/slredirect/picassoRe...


In [None]:
# Converting Price($) columns from string to float and int
dataset['Price($)'] = dataset['Price($)'].str.replace(',','')
dataset['Price($)'] = dataset['Price($)'].astype(float)

# Converting Customer_Review columns from string to int
dataset['Customer_Review'] = dataset['Customer_Review'].str.replace(',', '')
dataset['Customer_Review'] = dataset['Customer_Review'].astype(int)

# Converting 'Rating(Out of 5)' columns from string to float
dataset['Rating(Out of 5)'] = dataset['Rating(Out of 5)'].astype(float)

In [51]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189 entries, 0 to 188
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Title             189 non-null    object 
 1   Price($)          189 non-null    float64
 2   Rating(Out of 5)  189 non-null    float64
 3   Customer_Review   189 non-null    int64  
 4   URL               189 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 7.5+ KB


In [49]:
#Sorting rows based on most customer reviews and ratings
dataset = dataset.sort_values(by=['Customer_Review', 'Rating(Out of 5)'],ascending=False)
dataset = dataset.reset_index(drop=True)
dataset

Unnamed: 0,Title,Price($),Rating(Out of 5),Customer_Review,URL
0,"Apple iPhone 7, 32GB, Gold - Fully Unlocked (R...",95.86,4.2,48681,https://www.amazon.com/Apple-iPhone-Fully-Unlo...
1,"Apple iPhone 8, 64GB, Gold - Unlocked (Renewed)",169.99,4.4,45107,https://www.amazon.com/Apple-iPhone-Fully-Unlo...
2,"OnePlus Nord CE 2,​ 5G Unlocked Android Smartp...",347.00,4.3,18699,https://www.amazon.com/OnePlus-Unlocked-Androi...
3,"Samsung Galaxy Note 9, 128GB, Lavender Purple ...",214.99,4.4,15036,https://www.amazon.com/Samsung-Unlocked-Warran...
4,"Samsung Galaxy S10+, 128GB, Prism Black - Unlo...",217.00,4.4,12616,https://www.amazon.com/Samsung-Factory-Unlocke...
...,...,...,...,...,...
184,Evil Eye Beaded Phone Charm Pearl Beaded Phone...,7.90,0.0,0,https://www.amazon.com/Beaded-Mobile-Bracelet-...
185,WWJ 4G Smartphone SIM Free Android Phones Unlo...,118.95,0.0,0,https://www.amazon.com/WWJ-Smartphone-Android-...
186,"Unlocked Phone KXD 6C Android Cell Phones 5.5""...",74.99,0.0,0,https://www.amazon.com/Unlocked-KXD-6C-Android...
187,Realme C35 128GB 4GB RAM Factory Unlocked (GSM...,204.00,0.0,0,https://www.amazon.com/Realme-128GB-Factory-Un...


In [50]:
# Saving dataset to Excel Format
dataset.to_excel("Item_Listing.xlsx",index = False)

References: -

*   https://beautiful-soup-4.readthedocs.io/en/latest/
*   https://github.com/jhnwr/amazon-pagination
*   https://github.com/AlexTheAnalyst/PortfolioProjects/blob/main/Amazon%20Web%20Scraper%20Project.ipynb




