Scrape Amazon

Some of the practical applications of web scraping could be:

Gathering resume of candidates with a specific skill,
Extracting tweets from twitter with specific hashtags,
Lead generation in marketing,
Scraping product details and reviews from e-commerce websites such as Amazon, the focus of this tutorial
Apart from the above use-cases, web scraping is widely used in natural language processing for extracting text from the websites for training a deep learning model.

Potential Challenges of Web Scraping

One of the challenges you would come across while scraping information from websites is the various structures of websites. Meaning, the templates of websites will differ and will be unique; hence, generalizing across websites could be a challenge.

Another challenge could be longevity. Since the web developers keep updating their websites, you cannot certainly rely on one scraper for too long. Even though the modifications might be minor, but they still might create a hindrance for you while fetching the data.

Hence, to address the above challenges, there could be various possible solutions. One would be to follow continuous integration & development (CI/CD) and constant maintenance as the website modifications would be dynamic.

Another more realistic approach is to use Application Programming Interfaces (APIs) offered by various websites & platforms. For example, Facebook and twitter provide you API's specially designed for developers who want to experiment with their data or would like extract information to let's say related to all friends & mutual friends and draw a connection graph of it. The format of the data when using APIs is different from usual web scraping i.e., JSON or XML, while in standard web scraping, you mainly deal with data in HTML format.

What is Beautiful Soup?

Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse data from HTML and XML files. It acts as a helper module and interacts with HTML in a similar and better way as to how you would interact with a web page using other available developer tools.

It usually saves programmers hours or days of work since it works with your favorite parsers like lxml and html5lib to provide organic Python ways of navigating, searching, and modifying the parse tree.

Another powerful and useful feature of beautiful soup is its intelligence to convert the documents being fetched to Unicode and outgoing documents to UTF-8. As a developer, you do not have to take care of that unless the document intrinsic doesn't specify an encoding or Beautiful Soup is unable to detect one.

It is also considered to be faster when compared to other general parsing or scraping techniques.

In [None]:
!pip3 install beautifulsoup4


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import time
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as ticker
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests


Scraping the Amazon Best Selling Books

This URL that you are going to scrape is the following: https://www.amazon.in/gp/bestsellers/books/. The page argument can be modified to access data for each page. Hence, to access all the pages you will need to loop through all the pages to get the necessary dataset, but first, you need to find out the number of pages from the website.

To connect to the URL and fetch the HTML content following things are required:

Define a get_data function which will input the page numbers as an argument,
Define a user-agent which will help in bypassing the detection as a scraper,
Specify the URL to requests.get and pass the user-agent header as an argument,
Extract the content from requests.get,
Scrape the specified page and assign it to soup variable,
Next and the important step is to identify the parent tag under which all the data you need will reside. The data that you are going to extract is:

Book Name
Author
Rating
Customers Rated
Price
The below image shows where the parent tag is located, and when you hover over it, all the required elements are highlighted.

In [None]:
no_pages = 2

def get_data(pageNo):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}

    r = requests.get('https://www.amazon.in/gp/bestsellers/books/ref=zg_bs_pg_'+str(pageNo)+'?ie=UTF8&pg='+str(pageNo), headers=headers)#, proxies=proxies)
    content = r.content
    soup = BeautifulSoup(content)
    #print(soup)

    alls = []
    for d in soup.findAll('div', attrs={'class':'a-section a-spacing-none aok-relative'}):
        #print(d)
        name = d.find('span', attrs={'class':'zg-text-center-align'})
        n = name.find_all('img', alt=True)
        #print(n[0]['alt'])
        author = d.find('a', attrs={'class':'a-size-small a-link-child'})
        rating = d.find('span', attrs={'class':'a-icon-alt'})
        users_rated = d.find('a', attrs={'class':'a-size-small a-link-normal'})
        price = d.find('span', attrs={'class':'p13n-sc-price'})

        all1=[]

        if name is not None:
            #print(n[0]['alt'])
            all1.append(n[0]['alt'])
        else:
            all1.append("unknown-product")

        if author is not None:
            #print(author.text)
            all1.append(author.text)
        elif author is None:
            author = d.find('span', attrs={'class':'a-size-small a-color-base'})
            if author is not None:
                all1.append(author.text)
            else:
                all1.append('0')

        if rating is not None:
            #print(rating.text)
            all1.append(rating.text)
        else:
            all1.append('-1')

        if users_rated is not None:
            #print(price.text)
            all1.append(users_rated.text)
        else:
            all1.append('0')

        if price is not None:
            #print(price.text)
            all1.append(price.text)
        else:
            all1.append('0')
        alls.append(all1)
    return alls


In [None]:
results = []
for i in range(1, no_pages+1):
    results.append(get_data(i))
flatten = lambda l: [item for sublist in l for item in sublist]
df = pd.DataFrame(flatten(results),columns=['Book Name','Author','Rating','Customers_Rated', 'Price'])
df.to_csv('amazon_products.csv', index=False, encoding='utf-8')


In [None]:
df = pd.read_csv("amazon_products.csv")


In [None]:
df.shape


In [None]:
df.head(61)


In [None]:
df['Rating'] = df['Rating'].apply(lambda x: x.split()[0])


In [None]:
df['Rating'] = pd.to_numeric(df['Rating'])


In [None]:
df["Price"] = df["Price"].str.replace('₹', '')


In [None]:
df["Price"] = df["Price"].str.replace(',', '')


In [None]:
df['Price'] = df['Price'].apply(lambda x: x.split('.')[0])


In [None]:
df['Price'] = df['Price'].astype(int)


In [None]:
df["Customers_Rated"] = df["Customers_Rated"].str.replace(',', '')


In [None]:
df['Customers_Rated'] = pd.to_numeric(df['Customers_Rated'], errors='ignore')


In [None]:
df.head()


In [None]:
df.dtypes


In [None]:
df.replace(str(0), np.nan, inplace=True)
df.replace(0, np.nan, inplace=True)


In [None]:
count_nan = len(df) - df.count()


In [None]:
count_nan


In [None]:
df = df.dropna()


In [None]:
data = df.sort_values(["Price"], axis=0, ascending=False)[:15]


In [None]:
from bokeh.models import ColumnDataSource
from bokeh.transform import dodge
import math
from bokeh.io import curdoc
curdoc().clear()
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.transform import factor_cmap
from bokeh.models import Legend
output_notebook()


In [None]:
p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=550, title="Authors Highest Priced Book", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,4], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2


In [None]:
show(p)


In [None]:
data = df[df['Customers_Rated'] > 1000]


In [None]:
data = data.sort_values(['Rating'],axis=0, ascending=False)[:15]


In [None]:
p = figure(x_range=data.iloc[:,0], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,0], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2


In [None]:
show(p)


In [None]:
p = figure(x_range=data.iloc[:,1], plot_width=800, plot_height=600, title="Top Rated Books with more than 1000 Customers Rating", toolbar_location=None, tools="")

p.vbar(x=data.iloc[:,1], top=data.iloc[:,2], width=0.9)

p.xgrid.grid_line_color = None
p.y_range.start = 0
p.xaxis.major_label_orientation = math.pi/2


In [None]:
show(p)


In [None]:
data = df.sort_values(["Customers_Rated"], axis=0, ascending=False)[:20]


In [None]:
from bokeh.transform import factor_cmap
from bokeh.models import Legend
from bokeh.palettes import Dark2_5 as palette
import itertools
from bokeh.palettes import d3
#colors has a list of colors which can be used in plots
colors = itertools.cycle(palette)

palette = d3['Category20'][20]


In [None]:
index_cmap = factor_cmap('Author', palette=palette,
                         factors=data["Author"])


In [None]:
p = figure(plot_width=700, plot_height=700, title = "Top Authors: Rating vs. Customers Rated")
p.scatter('Rating','Customers_Rated',source=data,fill_alpha=0.6, fill_color=index_cmap,size=20,legend='Author')
p.xaxis.axis_label = 'RATING'
p.yaxis.axis_label = 'CUSTOMERS RATED'
p.legend.location = 'top_left'


In [None]:
show(p)
