# Webscraping Data from Amazon

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. 

While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.

We would be using few python libraries for this purpose which are :-

- BeautifulSoup
- Requests
- Pandas

The website we are going to be scraping is https://www.amazon.in/gp/bestsellers/luggage.

In [1]:
#Importing the libraries 
from bs4 import BeautifulSoup
import requests
import pandas as pd

Using the request library, what we are doing is sending request to the server to retrieve the information from the HTML page essentially and using the BeautifulSoup, we are able to parse the data to see the data clearly. 

BeautifulSoup is a powerful and useful library that gives many methods to navigate through the HTML page in order to get to the data the user is looking for.

In [2]:
request = requests.get('https://www.amazon.in/gp/bestsellers/luggage')
soup = BeautifulSoup(request.text,'html.parser')

In the cell below we have imported all the data that has the `div` class and we are only pulling in the text data. We later on created a variable Product that is essentially a blank list and appending all the text data to the Product list and removing any unwanted newlines or special characters by using a FOR loop.

In [3]:
xyz = soup.find('div', text = "")

In [4]:
product = xyz.find_all('div', class_='p13n-sc-truncate p13n-sc-line-clamp-2')

In [5]:
Product = []
for prod in product:
    Product.append(prod.text.strip())
    print(prod.text.strip())

NAPA HIDE Black Leather Wallet for Men
WILDHORN® Carter Leather Wallet for Men (Black Croco)
Storite PU Leather 9 Slot Vertical Credit Debit Card Holder Money Wallet Zipper Coin Purse for Men Women - Chocolate Brown
GLUN Bolt Electronic Portable Fishing Hook Type Digital LED Screen Luggage Weighing Scale, 50 kg/110 Lb (Black)
URBAN FOREST Black Leather Men's Card Holder With Pen Combo (UBF126BLK10208)
M MEDLER Epoch Nylon 55 litres Waterproof Strolley Duffle Bag- 2 Wheels - Luggage Bag - (Navy Blue)
WildHorn® RFID Protected Genuine High Quality Leather Wallet for Men (Black MATT)
Trajectory Men's and Women's Neck Pillow Travel headrest Accessory in Black with Plane Flight Bus and Office (Black)
American Tourister Casual Backpack
Urban Forest Oliver Black RFID Blocking Leather Wallet for Men
GoTrippin Metal Luggage Weighing Scale Digital (Silver_ELS)
SAFARI 15 Ltrs Sea Blue Casual/School/College Backpack (DAYPACKNEO15CBSEB) & SAFARI 15 Ltrs Cherry Red Casual/School/College Backpack (DAY

We have on created a variable Price that is essentially a blank list and appending all the text data to the Price list and removing any unwanted newlines or special characters by using a `for loop`.

In [6]:
price = xyz.find_all('a', class_ = 'a-link-normal a-text-normal')

In [7]:
Price = []
for pri in price:
    Price.append(pri.text.strip().replace("₹", ""))
    print(pri.text.strip().replace("₹", ""))

320.00 - 640.00
407.00 - 2,099.00
449.00 - 849.00
299.00
455.00 - 699.00
569.00 - 640.00
299.00 - 899.00
298.00 - 598.00
1,099.00 - 2,300.00
455.00 - 699.00
790.00
295.00 - 658.00
364.00 - 475.00
320.00 - 3,999.00
499.00
219.00 - 303.00
299.00 - 930.00
199.00 - 503.00
2,969.00 - 8,630.00
449.00 - 949.00
339.00 - 369.00
1,399.00
495.00 - 499.00
195.00 - 999.00
899.00
3,699.00 - 7,098.00
799.00 - 899.00
174.00 - 299.00
322.00 - 619.00
305.00
269.00 - 949.00
1 offer from 449.00
366.00 - 431.00
337.00
398.00 - 698.00
1,399.00
449.00
357.00 - 359.00
702.00 - 867.00
497.00 - 1,409.00
449.00 - 899.00
261.00 - 711.00
170.00 - 282.00
395.00 - 495.00
163.00
940.00 - 1,945.00
359.00 - 718.00
3,599.00
2,759.00
229.00 - 459.00


Let's go ahead and pull out some more data that may be useful. We are going to pulling out

- Product Rating given by the users
- How many people have rated the product?

We will follow the same process and then store the data in either `CSV` or `dataframe` as per our requirements.

In [8]:
rating = xyz.find_all('span', class_ = 'a-icon-alt')

In [9]:
Rating = []
for rate in rating:
    Rating.append(rate.text[0:3])
    print(rate.text[0:3])

4.0
4.0
4.1
3.9
4.3
3.6
4.0
3.9
4.1
4.3
4.4
4.0
3.9
4.1
4.1
3.8
4.3
3.7
4.2
4.0
4.6
4.0
4.1
3.9
4.3
4.2
4.2
3.8
4.2
3.9
4.0
4.1
4.1
4.2
4.0
4.0
4.4
4.3
4.0
4.1
4.1
4.5
4.0
4.3
3.9
4.0
4.1
4.3
4.2
3.9


In [10]:
reviewercount = xyz.find_all('div', class_ = 'a-icon-row')

In [11]:
RatingCount = []
for reviewer in reviewercount:
    RatingCount.append(reviewer.text[20:].replace("\n", ""))
    print(reviewer.text[20:].replace("\n", ""))

20,223
15,736
3,010
11,063
28,149
14,004
22,272
4,530
33,966
5,388
5,125
10,370
6,778
7,491
4,328
2,537
5,090
5,670
8,122
8,285
325
4,471
6,938
2,766
2,052
5,344
3,744
4,481
2,632
1,877
5,323
1,593
1,320
532
4,158
451
2,229
2,746
5,707
3,719
2,111
696
381
662
1,212
943
6,691
4,047
2,485
715


Putting all the information into a single dataframe here. VOILA!!!

In [12]:
df = pd.DataFrame({'Product': Product,'Price (INR)': Price, 'Rating' : Rating, 'Count' : RatingCount})
df

Unnamed: 0,Product,Price,Rating,Count
0,NAPA HIDE Black Leather Wallet for Men,320.00 - 640.00,4.0,20223
1,WILDHORN® Carter Leather Wallet for Men (Black...,"407.00 - 2,099.00",4.0,15736
2,Storite PU Leather 9 Slot Vertical Credit Debi...,449.00 - 849.00,4.1,3010
3,GLUN Bolt Electronic Portable Fishing Hook Typ...,299.00,3.9,11063
4,URBAN FOREST Black Leather Men's Card Holder W...,455.00 - 699.00,4.3,28149
5,M MEDLER Epoch Nylon 55 litres Waterproof Stro...,569.00 - 640.00,3.6,14004
6,WildHorn® RFID Protected Genuine High Quality ...,299.00 - 899.00,4.0,22272
7,Trajectory Men's and Women's Neck Pillow Trave...,298.00 - 598.00,3.9,4530
8,American Tourister Casual Backpack,"1,099.00 - 2,300.00",4.1,33966
9,Urban Forest Oliver Black RFID Blocking Leathe...,455.00 - 699.00,4.3,5388


We have converted our dataframe into a CSV file so that the file can be used by senior management or people not much familiar with coding.

In [13]:
df.to_csv('AmazonWebscraping.csv', index = False)

In [14]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "bhatnagar91/webscrapingamazon" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/bhatnagar91/webscrapingamazon[0m


'https://jovian.ai/bhatnagar91/webscrapingamazon'