<a href="https://colab.research.google.com/github/Vishalksinghh/Web_Scraping/blob/main/Amazon_Web_Scraping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Scraping items on Amazon using Selenium Python</center>

![banner-image](https://i.imgur.com/75ngH1z.jpeg)


[Amazon.com](https://www.amazon.com) is an American multinational technology company which focuses on e-commerce, cloud computing, digital streaming, and artificial intelligence. It has been referred to as "one of the most influential economic and cultural forces in the world,"[5] and is one of the world's most valuable brands.[6] It is one of the Big Five American information technology companies, alongside Alphabet, Apple, Meta, and Microsoft.

In this project we'll retrive information of items on Amazon using _web scraping_: the process of extracting information from a website in an automated fashion using code. To achieve that we'll use Python library [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), selenium.webdriver and kora.selenium to fetch, parse and extract the information we need from the web page.



## Web Scraping 

>### Q1. What is Web Scraping?
>Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing [HTML documents](https://developer.mozilla.org/en-US/docs/Web/HTML), some platforms also offer [REST APIs](https://www.smashingmagazine.com/2018/01/understanding-using-rest-api/) to retrieve information in a machine-readable format like [JSON](https://www.digitalocean.com/community/tutorials/an-introduction-to-json).
>
>### Q2. How does web scraping work?
![](https://i.imgur.com/iv6RhmW.png)
>
>To understand web scraping, it’s important to first understand that web pages are built with text-based mark-up languages – the most common being `HTML`.
>
>A mark-up language defines the structure of a website’s content. Since there are universal components and tags of mark-up languages, this makes it much easier for web scrapers to pull all the information that it needs.
Once the HTML is parsed, the scraper then extracts the necessary data and stores it.  
**Note**  : Not all websites allow Web Scraping, especially when personal information of the users is involved, so we should always ensure that we do not explore too much, and don't get our hands on information which might belong to someone else.
Websites generally have protections at place, and they would block our access to the website if they see us scraping a large amount of data from their website.

### Project Goal

The project goal is to build a web scraper that retrive information about products on Amazon by giving the search term and assemble them into a single CSV. The format of the output CSV file is shown below:

![](https://i.imgur.com/SkJo9yB.jpeg)



Here's an outline of the steps we'll follow:

1. Install and import required libraries
2. Parse the HTML source code using beautiful soup
3. Extract item descriptions, price, rating, review_count and url from webpage
4. Compile extracted information into Python lists and dictionaries
5. Save the extracted information to a CSV file.





# Install and import required project libraries

In [None]:
!pip install kora --upgrade --quiet
!pip install pandas --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet

In [None]:
# Set up Selenium Webdriver on Colab
from kora.selenium import wd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By

In [None]:
options = Options()
options.add_argument('--no-sandbox')
options.add_argument('--headless')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome('chromedriver',options=options)

In [None]:
url = 'https://www.amazon.com'

# Using the .get() method of the driver to load the url
driver.get(url) 

# Once the page loads successfully, We can use the .title attribute to access the textual title of the webpage
print(driver.title)

Amazon.com. Spend less. Smile more.


In [None]:
def get_url(search_term):

  """Generate a url from search term"""

  template  = 'https://www.amazon.in/s?k={}'
  search_term = search_term.replace(' ', '+')

  return template.format(search_term) 

### Getting url from search term


![](https://imgur.com/g6PgHJL.jpeg)

In [None]:
url  = get_url('apple watch')

In [None]:
print(url)

https://www.amazon.in/s?k=smart+wathces+for+men


In [None]:
driver.get(url)

# Extract the collection

### **Let's create a soup object which will parse HTML content from the page source** 
*Try to identify something unique about a record that will unable us to extract all the records from this page as a collection. The best way to find this is to use the document inspector. Right click on the item that we want to inspect such as the heading and click on inspect*

![](https://imgur.com/blB8RLG.jpeg)

In [None]:
from bs4 import BeautifulSoup

In [None]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

In [None]:
results = soup.find_all('div', {'data-component-type': 's-search-result'})

In [None]:
len(results)

16

# Prototype the record

Let's fetch required data from first item in the results list. And, later on we will use this record prototype of fetch information of all items in results list   

In [None]:
item = results[0]

In [None]:
atag = item.h2.a

In [None]:
description = atag.text.strip()

In [None]:
url = 'https://www.amazon.in' + atag.get('href')

In [None]:
price_parent = item.find('span', 'a-price')

In [None]:
price = price_parent.find('span', 'a-offscreen').text

In [None]:
rating = item.i.text

In [None]:
review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text

# Generalize the pattern

We'll write a function to extract descrition, price, rank and rating and URL

In [None]:
def extract_record(item):
  """Extract and return data from a single record"""

  # description and url
  atag = item.h2.a
  description = atag.text.strip()
  url = 'https://www.amazon.in' + atag.get('href')

  # price
  price_parent = item.find('span', 'a-price')
  price = price_parent.find('span', 'a-offscreen').text

  # rank and rating 
  rating = item.i.text
  review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text

  return{'Description': description,
         'Price': price,
         'Rating': rating,
         'Review_count': review_count,
         'URL': url 
         }

In [None]:
records = [extract_record(item) for item in results]

**_We've got an error here and it is beacuse  our model assumes that these information is available for each result however there are records without prices, without rakings or ratings and what happens is it's looking for the text in a empty object there's nothing there So it gives an attribute error_**

![](https://imgur.com/d1jo2WG.jpeg)

# Error Handling

In [None]:
def extract_record(item):
  """Extract and return data from a single record"""

  # description and url
  atag = item.h2.a
  description = atag.text.strip()
  url = 'https://www.amazon.in' + atag.get('href')
  

  try:
    # price
    price_parent = item.find('span', 'a-price')
    price = price_parent.find('span', 'a-offscreen').text
  except AttributeError:
      return

  try:
    # rank and rating 
    rating = item.i.text
    review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text
  except AttributeError:
    rating = ''
    review_count = ''

  return{'Description': description,
         'Price': price,
         'Rating': rating,
         'Review_count': review_count,
         'URL': url 
         }

In [None]:
records = [extract_record(item) for item in results]

In [None]:
records[:5]

[{'Description': 'boAt Xtend Smartwatch with Alexa Built-in, 1.69” HD Display, Multiple Watch Faces, Stress Monitor, Heart & SpO2 Monitoring, 14 Sports Modes, Sleep Monitor, 5 ATM & 7 Days Battery(Pitch Black)',
  'Price': '₹2,999',
  'Rating': '4.2 out of 5 stars',
  'Review_count': '97,713',
  'URL': 'https://www.amazon.in/boAt-Smartwatch-Multiple-Monitoring-Resistance/dp/B096VF5YYF/ref=sr_1_1?keywords=smart+watches+for+men&qid=1661092984&sr=8-1'},
 {'Description': 'Noise ColorFit Pulse Spo2 Smart Watch with 10 days battery life, 60+ Watch Faces, 1.4" Full Touch HD Display Smartwatch, 24*7 Heart Rate Monitor Smart Band, Sleep Monitoring Smart Watches for Men and Women & IP68 Waterproof (Jet Black)',
  'Price': '₹1,499',
  'Rating': '4.0 out of 5 stars',
  'Review_count': '53,098',
  'URL': 'https://www.amazon.in/Noise-ColorFit-Smartwatch-Monitoring-Waterproof/dp/B097R25DP7/ref=sr_1_2?keywords=smart+watches+for+men&qid=1661092984&sr=8-2'},
 {'Description': "Fire-Boltt India's No 1 Sma

# Getting the next page

We'll modify the get_url fuction used above to get next page

In [None]:
def get_url(search_term):
  """Generate a url from search term"""
  template  = 'https://www.amazon.in/s?k={}'
  search_term = search_term.replace(' ', '+')

  # add term query to url
  url = template.format(search_term)

  # add page query placeholder
  url += '&page{}'

  return url

# Putting All code together

Let's write a final code from the codes written above and create main function where we enter the search_term as the product or item that we wish to search on [Amazon](https://www.Amazon.com). 

In [None]:
import csv
import pandas as pd
from bs4 import BeautifulSoup

from kora.selenium import wd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By



def get_url(search_term):
  """Generate a url from search term"""
  template  = 'https://www.amazon.in/s?k={}'
  search_term = search_term.replace(' ', '+')

  # add term query to url
  url = template.format(search_term)

  # add page query placeholder
  url += '&page={}'

  return url


def extract_record(item):
  """Extract and return data from a single record"""

  # description and url
  atag = item.h2.a
  description = atag.text.strip()
  url = 'https://www.amazon.in' + atag.get('href')
  

  try:
    # price
    price_parent = item.find('span', 'a-price')
    price = price_parent.find('span', 'a-offscreen').text
  except AttributeError:
      return

  try:
    # rank and rating 
    rating = item.i.text
    review_count = item.find('span', {'class': 'a-size-base s-underline-text'}).text
  except AttributeError:
    rating = ''
    review_count = ''

  return{'description': description,
         'Price': price,
         'Rating': rating,
         'Review_count': review_count,
         'URL': url 
         } 

def main(search_term):
  """Run main program routine"""

  # startup the webdriver  
  options = Options()
  options.add_argument('--no-sandbox')
  options.add_argument('--headless')
  options.add_argument('--disable-dev-shm-usage')
  options.add_argument('--disable-blink-features=AutomationControlled')
  driver = webdriver.Chrome('chromedriver',options=options)

  records = []
  url = get_url(search_term)

  for page in range(1, 21):
    driver.get(url.format(page))
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    results = soup.find_all('div', {'data-component-type': 's-search-result'})

    for item in results:
      record = extract_record(item)
      if record:
        records.append(record)


  # save the data to a csv file
  search_records = pd.DataFrame(records)
  search_records.to_csv('records.csv', index=None)

In [None]:
main('smart wathces for men')

In [None]:
search_records = pd.DataFrame(records)
search_records

Unnamed: 0,Description,Price,Rating,Review_count,URL
0,"boAt Xtend Smartwatch with Alexa Built-in, 1.6...","₹2,999",4.2 out of 5 stars,97713,https://www.amazon.in/boAt-Smartwatch-Multiple...
1,Noise ColorFit Pulse Spo2 Smart Watch with 10 ...,"₹1,499",4.0 out of 5 stars,53098,https://www.amazon.in/Noise-ColorFit-Smartwatc...
2,Fire-Boltt India's No 1 Smartwatch Brand Talk ...,"₹2,999",4.3 out of 5 stars,10228,https://www.amazon.in/Fire-Boltt-Smartwatch-Bl...
3,boAt Flash Edition Smart Watch with Activity T...,"₹2,499",4.0 out of 5 stars,23067,https://www.amazon.in/boAt-Flash-Smartwatch-Re...
4,"boAt Xtend Smartwatch with Alexa Built-in, 1.6...","₹2,999",4.2 out of 5 stars,97713,https://www.amazon.in/boAt-Display-Multiple-Mo...
5,"boAt Xtend Smart Watch with Alexa Built-in, 1....","₹2,999",4.2 out of 5 stars,97713,https://www.amazon.in/boAt-Smartwatch-Multiple...
6,"boAt Wave Lite Smartwatch with 1.69"" HD Displa...","₹1,799",4.0 out of 5 stars,9627,https://www.amazon.in/boAt-Wave-Lite-Smartwatc...
7,Noise ColorFit Pulse Grand Smart Watch with 1....,"₹1,999",4.0 out of 5 stars,15202,https://www.amazon.in/Noise-ColorFit-Display-M...
8,boAt Wave Call Smart Watch with Bluetooth Call...,"₹2,999",4.0 out of 5 stars,1231,https://www.amazon.in/boAt-Wave-Call-Bluetooth...
9,Fire-Boltt Ninja 2 SpO2 Full Touch Smartwatch ...,"₹1,699",4.2 out of 5 stars,32226,https://www.amazon.in/Fire-Boltt-Smartwatch-Wo...


# Summary

Here is what we covered so far :


*   Downloaded the webpage using driver.get() method of selenium
*   Parsed the HTML source code using beautifulsoup
*   Extracted the product information by creating the python functions 
*   Saved the extracted information to a CSV file





# Future Work


*   We can fetch the information of any amazon product by just giving the search_term in main() function created above

*   We can import sqlalchemy and do data analysis to get best product out of the list of the list of products by using sql query on rating, price, and review_count columns 







# References



* [Introduction to Web Scraping and REST APIs
](https://jovian.ai/vishal20jun/python-web-scraping-and-rest-api-53308)
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
* [Selenium tutorial](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbW85bmtPc25kREpFc2RJVU1BWXEwTWRXUGYyUXxBQ3Jtc0tsQlJsanZaUzl4WHNTYm9iV0xXcFBaNldTS3N6bU1NVzdhZllGVTV2NXE4d19QUEwwMlQ2RWExOFZILWZVYlgyZkZKbjhYRnFkNzF3UnR6M0Jialo1a2x2TXB1SGI0dFZRdVJZUTJVcGpJeXlHa0ZEaw&q=https%3A%2F%2Fwww.browserstack.com%2Fguide%2Fpython-selenium-to-run-web-automation-test&v=FcW-AXsirBE)
* [Selenium Tutorial for Beginners with Deployment to AWS Lambda](https://www.youtube.com/watch?v=FcW-AXsirBE&t=2904s)








In [None]:
jovian.commit()

[jovian] Detected Colab notebook...[0m
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/vishal20jun/project


'https://jovian.ai/vishal20jun/project'