## Automated Web Scraper For Amazon.com

In this tutorial we will build an Automated Web Scraper to extract data from amazon.com that we can use for any data analysis,data science or machine learning project.

Before we get started let me make it clear that Amazon has a tight security for their platform and some of the things you can easily do on other webpages wont work on Amozon platform.

Previously, we could have used Beautiful Soup and Request to easily get titles from the page, but things have changed little bit. We will still use Beautiful Soup but in a different way.

Let's see how we can do things differently now.

### Installation
We will be using:

1. Selenium
2. BeautifulSoup
3. Webdrivers  

Uncomment the following cells to run the installation

In [1]:
#!pip install selenium
#!pip install msedge-selenium-tools
#!pip install bs4

In [2]:
# For Chrome 
#  !pip install chromedriver_binary==87.0.4280.88 #install the version that corresponds to the version of your browser

### Setting up Selenium web Browser for Mozilla Firefox

step 1 > goto :: https://github.com/mozilla/geckodriver/releases
Step 2 > Install GeckoDriver as per ur os ( Mine is win 64.zip
step 3 > Right click on Gecko extracted file > go into properties > Copy the location
step 4 > 

In [1]:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary

#driver = webdriver.Firefox(executable_path=r'E:\Data Science Project\Selinium Driver\geckodriver.exe') # instance of web browser 

# driver.get('http://www.youtube.com') # Tells browser to goto specified website >> driver.get('http://Website URL)

#path='E:\Data Science Project\Selinium Driver\geckodriver.exe'

#binary = FirefoxBinary('path/to/installed firefox binary')
# browser = webdriver.Firefox(firefox_binary=binary)
#browser = webdriver.Firefox()

# import chromedriver_binary

#for moicrosoft edge
#>> from msedge.selenium_tools import Edge, EdgeOptions

import csv

In [2]:
url= 'https://www.amazon.in/'

In [3]:
 # instance of web browser 
driver = webdriver.Firefox(executable_path=r'E:\Data Science Project\Selinium Driver\geckodriver.exe')
# Tells browser to goto specified website >> driver.get('http://Website URL)
driver.get(url) 

If we type any keyword in the Amazon.com search bar, we realise that the search term is embedded in the search url.Using ths pattern, we can create a generic function that will build the required url for our driver to to actually retrieve.

#### We will define a function which we will later by just passing in our keyword to search on amazon.com as shown below

In [4]:
def my_url(keyword):
    
#     temp = 'https://www.amazon.com/s?k=phone+case&ref=nb_sb_noss_1' #let's get rid of the 'phone+case' and replace it with {} to make the url generic.
      temp = 'https://www.amazon.in/s?k={}&ref=nb_sb_noss_1' # a template url
      keyword = keyword.replace(' ', '+')
      return temp.format(keyword)

In [5]:
#we can pass our keyword in the function to get the correct url of the keyword on amazon.com

url=my_url('mobile') 
url

'https://www.amazon.in/s?k=mobile&ref=nb_sb_noss_1'

In [6]:
#we can even pass multiple keywords
url1=my_url('samsung mobile')
url1

'https://www.amazon.in/s?k=samsung+mobile&ref=nb_sb_noss_1'

In [7]:
#this will open in your browser and return the page for your keyword
driver.get(url)

#### Beautifulsoup

Now let's study the amazon.com a little bit.

We can realise that the page is quite structured, although there are some few records that we need to deal with. What we want to do is to extract data from each record. There are also multiple pages (e.g 1-20 pages can be returned for a single keyword search).

Just like we did earlier on, we need to access the html of the page in order to extract the required data. We will create a Beutiful Soup object for this.

In [8]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(driver.page_source,'html.parser')

Just like we did earlier on, we can find a tag that contains the data we want. We can right-click on say a title on the search results page to inspect. We found that the div tag with the data-component-type=s-search-results contains the data we need.

Something like what you can see below:

In [9]:
soup_results=soup.find_all('div',{'data-component-type':'s-search-result'})

In [10]:
len(soup_results)

18

We will now create a template which we can based on to make a generic request to extract the data that we need.

In [11]:
# we will assign our first result to an obj

obj=soup_results[0]

In [12]:
atag = obj.h2.a #create the h2 tag variable

In [13]:
des = atag.text.strip()

In [14]:
 #we can see below that we have the title correctly scraped
des

'Tecno Spark Go 2021 (Galaxy Blue 2GB RAM, 32GB Storage)|5000mAh| 6.52" Display Smartphone'

In [15]:
#let's now create a generic url

url='https://www.amazon.in/'+atag.get('href')

### Get the Price

In [16]:
#let's get the price same way we searched for the title by looking for the div tag, in this case, we will look for the tag that contains the price of the item.

#we will get this from the 'span' which contains the a-price and then use the 'span' which contains 'a-offscreen' to obtain the actual price.

parent=obj.find('span','a-price')

price=parent.find('span','a-offscreen').text

price

'₹7,299'

### Get the Reviews

In [17]:
#We will do the same thing for the Reviews


rate=obj.find('span','a-icon-alt').text
rate

'4.3 out of 5 stars'

In [18]:
# alternatively
obj.i.text

'4.3 out of 5 stars'

### Get the review Counts

In [31]:
#we can get the number of customers who have reviewed the item as well

counts_review=obj.find('span',{'class':'a-size-base','dir':'auto'}).text
counts_review

AttributeError: 'NoneType' object has no attribute 'text'

In [20]:
obj.img

<img alt='Sponsored Ad - Tecno Spark Go 2021 (Galaxy Blue 2GB RAM, 32GB Storage)|5000mAh| 6.52" Display Smartphone' class="s-image" data-image-index="1" data-image-latency="s-product-image" data-image-load="" data-image-source-density="1" src="https://m.media-amazon.com/images/I/71tFDYqv1zL._AC_UY218_.jpg" srcset="https://m.media-amazon.com/images/I/71tFDYqv1zL._AC_UY218_.jpg 1x, https://m.media-amazon.com/images/I/71tFDYqv1zL._AC_UY327_QL65_.jpg 1.5x, https://m.media-amazon.com/images/I/71tFDYqv1zL._AC_UY436_QL65_.jpg 2x, https://m.media-amazon.com/images/I/71tFDYqv1zL._AC_UY545_QL65_.jpg 2.5x, https://m.media-amazon.com/images/I/71tFDYqv1zL._AC_UY654_QL65_.jpg 3x"/>

## Generic Fuction
We will now create a generic fuction for the sub codes to extract all the data at once.

In [34]:
from selenium import webdriver
import chromedriver_binary
from bs4 import BeautifulSoup
#for moicrosoft edge
from msedge.selenium_tools import Edge, EdgeOptions
import csv

#We will be using functions to achieve this

def my_url(keyword):
    temp = 'https://www.amazon.in/s?k={}&ref=nb_sb_noss_1'
    keyword = keyword.replace(' ', '+')
    
    # Add Term Query To URL
    url = temp.format(keyword)
    
    # Add Page Query Placeholder
    url += '&page{}'
    
    return url

def extract_record(obj):
    atag = obj.h2.a
    description = atag.text.strip()
    url = 'https://www.amazon.in' + atag.get('href')
    
    #it is possible that some items on amazom.com might not be having one of the items we are looking for(e.g. some items might not be having ratings or price), we will be getting error if we dont take care of that. We will therefore add some error handlers
    #if there are no price,probably the item is out of stock or not available, then we will ignore the item, but if there are no reviews yet, it's fine, we will still want to extract the item.
    try:
        parent=obj.find('span','a-price')
        price=parent.find('span','a-offscreen').text
    except AttributeError: #we are excepting the error if it occurs so that we can move to extract the next item, else the program will stop running and gives error
        return
    
    try:
        rate=obj.find('span','a-icon-alt').text
        counts_review = obj.find('span', {'class': 'a-size-base', 'dir': 'auto'}).text
    except AttributeError:
        #assigning empty string to ratings and 
        rate = ''
        counts_review = ''
    
    image = obj.find('img', {'class': 's-image'}).get('src') 
    
    #let's create a tuple that will contain all these items and assign it to a result variable
    result = (description, price, rate, counts_review, url,image)
    return result

'''Run Main Program Routine'''
def main(keyword):
    # Startup The Webdriver
    driver = webdriver.Chrome()
#     options = EdgeOptions()
#     options.use_chromium =True
#     driver = Edge(options=options)
    
    records = []  #an empty records list to contain all of our extracted records
    url =my_url(keyword)
    
    for page in range(1, 50):
        driver.get(url.format(page))
        soup =BeautifulSoup(driver.page_source, 'html.parser')
        results=soup.find_all('div',{'data-component-type':'s-search-result'})
#         results=soup.find_all('div',{'data-component-type': 's-search-result'}) #same as we did above

        
#we will like to check if what we have return from the extract_record function is empty or not
        for item in results:
            record = extract_record(item) 
            if record: #if the record has something in it append to records list
                records.append(record) 
                
#         driver.quit()
    
#     # Save Results To CSV File
        with open('Results.csv', 'w', newline='', encoding='utf-8') as f:
            writer = csv.writer(f)
            writer.writerow(['Description', 'Price', 'Rating', 'Reviews Count', 'URL','Image link'])
            writer.writerows(records)


In [35]:
#we can search for any item to extract data
main('mobile')

You can scrape as many data points as possible with any keyword, oh yes !!

You can load the dataset you have scraped and perform and use it for your project

In [23]:
import pandas as pd

In [36]:
df=pd.read_csv(r"E:\Data Science Project\Untitled Folder\Results.csv")
df

Unnamed: 0,Description,Price,Rating,Reviews Count,URL,Image link
0,"Tecno Spark Go 2021 (Galaxy Blue 2GB RAM, 32GB...","₹7,299",,,https://www.amazon.in/gp/slredirect/picassoRed...,https://m.media-amazon.com/images/I/71tFDYqv1z...
1,"Redmi 9A (Nature Green, 2GB RAM, 32GB Storage)...","₹6,999",,,https://www.amazon.in/Redmi-9A-2GB-32GB-Storag...,https://m.media-amazon.com/images/I/71sxlhYhKW...
2,"Redmi 9 (Sky Blue, 4GB RAM, 64GB Storage) | 2....","₹9,499",,,https://www.amazon.in/Redmi-Sky-Blue-64GB-Stor...,https://m.media-amazon.com/images/I/71A9Vo1Bat...
3,Redmi 9 Power (Mighty Black 4GB RAM 64GB Stora...,"₹11,499",,,https://www.amazon.in/Test-Exclusive_2020_1112...,https://m.media-amazon.com/images/I/71hEzQGO5q...
4,"OPPO A31 (Fantasy White, 6GB RAM, 128GB Storag...","₹12,490",,,https://www.amazon.in/Oppo-Fantasy-Storage-Add...,https://m.media-amazon.com/images/I/61CnyJ-IbM...
...,...,...,...,...,...,...
871,"Redmi 9A (Midnight Black, 2GB RAM, 32GB Storag...","₹6,999",,,https://www.amazon.in/Redmi-9A-Midnight-2GB-32...,https://m.media-amazon.com/images/I/71sxlhYhKW...
872,"Samsung Galaxy M31 (Ocean Blue, 8GB RAM, 128GB...","₹16,999",,,https://www.amazon.in/Samsung-Galaxy-Ocean-128...,https://m.media-amazon.com/images/I/71-Su4Wr0H...
873,Redmi 9A (Sea Blue 2GB RAM 32GB Storage) | 2GH...,"₹6,999",,,https://www.amazon.in/Redmi-9A-2GB-32GB-Storag...,https://m.media-amazon.com/images/I/71sxlhYhKW...
874,"Tecno Spark 7T(Jewel Blue, 4GB RAM, 64GB Stora...","₹9,499",,,https://www.amazon.in/Tecno-Spark-Storage-Batt...,https://m.media-amazon.com/images/I/81aWyRY67S...


In [37]:
df.describe()

Unnamed: 0,Rating,Reviews Count
count,0.0,0.0
mean,,
std,,
min,,
25%,,
50%,,
75%,,
max,,


In [38]:
df.isnull().sum()

Description        0
Price              0
Rating           876
Reviews Count    876
URL                0
Image link         0
dtype: int64

There Was an Error
> Solution: The problem is with the string "C:\Users\Eric\Desktop\beeline.txt"
Here, \U in "C:\Users... starts an eight-character Unicode escape, such as \U00014321. In your code, the escape is followed by the character 's', which is invalid.  
You either need to duplicate all backslashes:  
"C:\\Users\\Eric\\Desktop\\beeline.txt"  
Or prefix the string with r (to produce a raw string):  
r"C:\Users\Eric\Desktop\beeline.txt"  

