#                 Scraping the Best Selling Smartphones on Amazon

![Amazon.com Best Sellers](https://i.imgur.com/l3DRy0O.png)

**Amazon.in is the Indian subsidiary of American multinational technology company specializing in e-commerce, cloud computing, artificial intelligence. The platform is among the best in the industry, where many variety of items can be purchased.
    Amazon has listed best selling smartphones that could be found in Amazon Best Sellers section. The page provides a list of smartphones based on sales which is updated hourly. In this project, we are going to retrieve amazon best seller smartphones using web scraping.**

Here is an outline of the steps we will follow:

* Install and import libraries.
* Download and Parse the Bestseller Mobiles HTML page source code using Requests and Beautifulsoup.
* Repeat previous step for each of the first 8 pages.
* Extract information from each page.
* Combine the Extracted information from each page's data into a Python Dictionary.
* Create a Dictionary of Lists for all the items.
* Convert the Dictionary into a Pandas Dataframe using Pandas library.
* Save the Pandas Dataframe to a CSV file. 
By the end of the project, we’ll create a CSV file in the following format:
#### Description, Rating, No of Reviews, Price, Link of the product
Samsung Galaxy M33 5G (Deep Ocean Blue, 6GB, 128GB Storage),4.0,12843,,https://www.amazon.in/Samsung-Storage-6000mAh-Purchased-Separately/dp/B09TWDYSWQ/ref=zg_bs_1805560031_sccl_1/259-2762699-7186560?pd_rd_i=B09TWDYSWQ&psc=1
Samsung Galaxy M33 5G (Mystique Green, 6GB, 128GB Storage),4.0,12843,14499,https://www.amazon.in/Samsung-Mystique-Storage-Purchased-Separately/dp/B09TWGDY4W/ref=zg_bs_1805560031_sccl_2/259-2762699-7186560?pd_rd_i=B09TWGDY4W&psc=1
realme narzo 50A (Oxygen Blue , 4GB RAM + 64 GB Storage) Helio G85 Processor,4.2,37829,9999,https://www.amazon.in/realme-Oxygen-Storage-Processor-Battery/dp/B09FKD67CS/ref=zg_bs_1805560031_sccl_4/259-2762699-7186560?pd_rd_i=B09FKD67CS&psc=1

How to Run the Code

You can execute the code using the “Run” button at the top of the page and selecting “ Run on Binder “. You can make changes and save your version of the notebook in Jovian by executing the following cells.

In [20]:
import jovian

In [21]:
jovian.commit(project="Scraping Best Selling Mobiles on Amazon")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/scraping-best-selling-mobiles-on-amazon" on https://jovian.ai/[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/scraping-best-selling-mobiles-on-amazon[0m


'https://jovian.ai/ankyhunk-bg4/scraping-best-selling-mobiles-on-amazon'

## First import the required libraries.

In [22]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from time import sleep

## Fetch and parse the HTML Source from the web page using requests and BeautifulSoup.

In [23]:
def get_mobile_page(page_url):
    # Header to identify the application, operating system, vendor, and/or version of the requesting.
    HEADERS = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept-Encoding":"gzip, deflate", "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "DNT":"1","Connection":"close", "Upgrade-Insecure-Requests":"1"}
    response=requests.get(page_url,headers=HEADERS)
    # To avoid denial of page access, sleeping time between each request.
    time.sleep(10)
    # Status Code=200 means request is successful otherwise raise Exception.
    if response.status_code!=200:
        raise Exception("Failed to load page {}".format(page_url))
    # Parsing the response text using Html parser using Beautiful Soup library.
    mobile_doc = BeautifulSoup(response.text,'html.parser')
    return mobile_doc

## Pass the page numbers from 1-8 to get a list of URLs for accessing web pages.

In [24]:
def get_page_url():
    pages=[]
    #There are only 8 pages available for Best-selling smartphones
    for i in range(1,9):
    #Passing the page-no in the url
        url='https://www.amazon.in/gp/bestsellers/electronics/1805560031/ref=zg_bs_pg_{0}?ie=UTF8&pg={0}'.format(*str(i))
        #print(url)
    #Adding to a list of URLs
        pages.append(url)
    return pages

## Extract the details of each item using tags. 

![Inspect](https://i.imgur.com/j1RnU3W.png)

### Get the Mobile Description

In [25]:
def get_mobile_desc(doc):
    #The div class contains the details of each item. This can be checked using Inspect. There are 30 per page.
    item_tag=doc.find_all('div',{'class':'a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc'})
    description=[]
    #The item tag contains all the details of an item.
    for tag in item_tag:
        #The second span tag contains description of item.
        desc_tag=tag.find_all('span')[1]
        # Here, we are accessing the text before '|' of the description.
        description.append(desc_tag.text.split('|')[0].strip())
    return description

### Get the Mobile Rating

In [26]:
def get_mobile_rating(doc):
    item_tag=doc.find_all('div',{'class':'a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc'})
    
    rating=[]
    for tag in item_tag:
        #The rating tag within item tag is found
        rating_tag=tag.find('span',{'class':"a-icon-alt"})
        #The if condition checks whether rating for item is present.
        if rating_tag:
            #The rating tag contains rating as'4.5 out of 5 stars'. Hence we are splitting the text to 4.5   
            rating.append(float(rating_tag.text.split(' ')[0]))
        else:
            #Since rating for item is not present, we provide zeros.
            rating.append(float(0))
    return rating

### Get the Mobile Review

In [27]:
def get_mobile_review(doc):
    item_tag=doc.find_all('div',{'class':'a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc'})
    
    review=[]
    for tag in item_tag:
        #The review tag within item tag is found
        review_tag=tag.find('span',{'class':"a-size-small"})
        #The if condition checks whether review for item is present.
        if review_tag:
            #The review tag contains the Total No of Reviews. We are removing the ',' in the number.   
            review.append(int(review_tag.text.replace(',','')))
        else:
            # Since review for item is not present, we provide zeros. 
            review.append(int(0))
    return review

### Get the Mobile Price

In [28]:
def get_mobile_price(doc):
    item_tag=doc.find_all('div',{'class':'a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc'})
    price=[]
    for tag in item_tag:
        #The price tag within item tag is found  
        price_tag=tag.find('span',{'class':"a-size-base a-color-price"})
        
        #The if condition checks whether the price is present.    
        if price_tag:  
            #From the price tag, we remove the '₹' symbol and the ',' from the number and cast to int to remove decimals.
            price.append(int(float(price_tag.text.replace('₹','').replace(',',''))))
        else:
            # Since price is not present, we provide zeros.
            price.append(int(0))   
    return price

### Get the Mobile Link

In [29]:
def get_mobile_link(doc):
    item_tag=doc.find_all('div',{'class':'a-column a-span12 a-text-center _cDEzb_grid-column_2hIsc'})
    link=[]
    #The Base URL is specified.
    base_url = 'https://www.amazon.in'
    for tag in item_tag:
        #The link of the item is found in 'a' tag in the item_tag variable.
        link_tag=tag.find('a')
        #The base_url is concatenated with the link of the item found in the link tag.
        link.append(base_url+link_tag['href'])
    return link

### Combine the details into a dictionary.

In [30]:
def parse_mobile_details(doc):
    #The dictionary containing the required details is created.
    mobile_dict={
        'Description':get_mobile_desc(doc), 
        'Rating':get_mobile_rating(doc),
        'Reviews':get_mobile_review(doc), 
        'Price':get_mobile_price(doc), 
        'Link': get_mobile_link(doc)
    }
        
    return mobile_dict

## Create a Dictionary of Lists for all the items.

## Convert the Dictionary into a Pandas Dataframe using Pandas library.

In [31]:
def scrape_mobiles():
    #The URLs sequentially generated by function get_page_url() is stored in urls.
    urls=get_page_url()
    top_mobiles={'Description':[],'Rating':[],'Reviews':[],'Price':[],'Link': []}
    #The for loop to check each webpage present in the urls.
    for mobile_url in urls:
        #The parsed doc containing HTML obtained using get_mobile_page.
        mobile_doc=get_mobile_page(mobile_url)
        #The dictionary containing details of mobiles is obtained.
        mobile_details = parse_mobile_details(mobile_doc)
        #The dictionary of lists is extended with the details of mobiles from each page.
        top_mobiles['Description'].extend(get_mobile_desc(mobile_doc))
        top_mobiles['Rating'].extend(get_mobile_rating(mobile_doc))
        top_mobiles['Reviews'].extend(get_mobile_review(mobile_doc))
        top_mobiles['Price'].extend(get_mobile_price(mobile_doc))
        top_mobiles['Link'].extend(get_mobile_link(mobile_doc))
       
    # The final dictionary is converted to a DataFrame.
    return pd.DataFrame(top_mobiles)   

In [32]:
mobiles_df=scrape_mobiles()
mobiles_df.head()

Unnamed: 0,Description,Rating,Reviews,Price,Link
0,"Redmi 10A (Slate Grey, 4GB RAM, 64GB Storage)",3.9,5245,0,https://www.amazon.in/Redmi-Storage-Battery-Fi...
1,"Samsung Galaxy M13 (Aqua Green, 4GB, 64GB Stor...",4.0,4015,9499,https://www.amazon.in/Samsung-Galaxy-Storage-6...
2,"Redmi A1 (Black, 2GB RAM, 32GB Storage)",3.7,68,6299,https://www.amazon.in/Redmi-Storage-Battery-Le...
3,"Redmi 10A (Sea Blue, 4GB RAM, 64GB Storage)",3.9,5245,0,https://www.amazon.in/Redmi-Storage-Battery-Fi...
4,"Redmi Note 11 (Horizon Blue, 4GB RAM, 64GB Sto...",4.1,39415,12099,https://www.amazon.in/Redmi-Horizon-Qualcomm%C...


## Save the Pandas Dataframe to a CSV file.

In [33]:
#The DataFrame is written to a CSV file with the required details.
mobiles_df.to_csv('Amazon_mobiles.csv', index=None)

## Wrapping up the project and Submission

### Summary
* The Scraping was done using Python libraries such as Requests, Beautiful Soup for extracting and Pandas for saving the data.
* Scraping Top 30 Smartphone details from 8 pages on Amazon website like Description, Rating, Reviews, Price and HyperLink.
* Saved all the scraped data into a csv file containing 30 rows and 5 columns for each Web-page and a total of 240 rows and 5 columns.

### Future work
* Extracting more details of the smartphone by accessing the Mobile link.
* Code optimization.
* Improving the documentation.
* Explore about Scraping Data using Selenium and Scrapy, as the web page is dynamic and frequently changes data.

### References
Learn more about libraries used:
* __[requests](https://requests.readthedocs.io/en/latest/)__
* __[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)__
* __[time](https://docs.python.org/3/library/time.html)__
* __[Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)__

Tutorials on Web Scraping:
* __[Python Web Scraping](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=1104s)__
* __[Amazon Web Scraping](https://www.datacamp.com/tutorial/amazon-web-scraping-using-beautifulsoup)__
* __[5 Web Scraping Projects](https://amankharwal.medium.com/5-web-scraping-projects-with-python-4bcc25ff039)__

In [34]:
jovian.commit(files=[], outputs=["Amazon_mobiles.csv"])

<IPython.core.display.Javascript object>

[jovian] Updating notebook "ankyhunk-bg4/scraping-best-selling-mobiles-on-amazon" on https://jovian.ai/[0m
[jovian] Uploading additional outputs...[0m
[jovian] Committed successfully! https://jovian.ai/ankyhunk-bg4/scraping-best-selling-mobiles-on-amazon[0m


'https://jovian.ai/ankyhunk-bg4/scraping-best-selling-mobiles-on-amazon'