# Flipkart Web Scraping

Data scraping is one of the most used ways to collect data. In simple terms it means, to get HTML code for a webpage and scan it for data.

**[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)** is the most used package for scanning/scraping data.  
In this notebook we'll see how to use Beautiful Soup and get a set of reviews and its associated metadata posted by the customers on its website for 2 of the headphones and create a dataset out of it.  

## Importing modules
**[Request](https://requests.readthedocs.io/en/master/)** Module is used to get the HTML code for the URL given.

**Note**: *Not all webpages can be requested. For example most social media does not allow to scrape data due to privacy issues. These pages require special access of Developer APIs to scrape data.*

In [210]:
# ctrl + / to comment the code
from time import sleep
from random import random
import pandas as pd
import requests
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
import urllib.parse as urlparse
from urllib.parse import parse_qs

In [211]:
# Constants
# link='https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c74a29'
link = 'https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c74a29&page=1'
page = requests.get(link)

In [226]:
# for page in range(1,11):
#     link = 'https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c74a29&page=' + str(page)
#     print(link)

https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c74a29&page=1
https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c74a29&page=2
https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c74a29&page=3
https://www.flipkart.com/search?q=tv&as=on&as-show=on&otracker=AS_Query_TrendingAutoSuggest_8_0_na_na_na&otracker1=AS_Query_TrendingAutoSuggest_8_0_na_na_na&as-pos=8&as-type=TRENDING&suggestionId=tv&requestId=9c9fa553-b7e5-454b-a65b-bbb7a9c7

In [212]:
soup = bs(page.content, 'html.parser')
#it gives us the visual representation of data
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="https://rukminim1.flixcart.com" rel="preconnect"/>
  <link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css" rel="stylesheet"/>
  <link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.a47a6a.css" rel="stylesheet"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="102988293558" property="fb:page_id"/>
  <meta content="658873552,624500995,100000233612389" property="fb:admins"/>
  <meta content="noodp" name="robots"/>
  <link href="https:///www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/>
  <link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/>
  <meta content="website" property="og:type"/>
  <meta content="Flipkart.com" name="og_site_name" property="og:site_name"/>
  <link href="/apple-touch-icon-57x

Extracting the Name of the Product

In [213]:
name = soup.find('div', class_="_4rR01T")
print(name)
name.text

<div class="_4rR01T">Longway 81 cm (32 inch) HD Ready LED Smart TV</div>


'Longway 81 cm (32 inch) HD Ready LED Smart TV'

Extracting the Rating
of the Product

In [214]:
#get rating of a product
rating = soup.find('div', class_="_3LWZlK")
print(rating)
name.text, rating.text

<div class="_3LWZlK">5</div>


('Longway 81 cm (32 inch) HD Ready LED Smart TV', '5')

Extracting Other specifications of the product

In [215]:
#get other details and specifications of the product
specification = soup.find('div', class_="fMghEO")
print(specification)
specification.text

<div class="fMghEO"><ul class="_1xgFaf"><li class="rgWa7D">HD Ready 1366 x 768 Pixels</li><li class="rgWa7D">1 Year Limited Domestic Brand Warranty</li></ul></div>


'HD Ready 1366 x 768 Pixels1 Year Limited Domestic Brand Warranty'

In [217]:
for each in specification:
    spec = each.find_all('li', class_='rgWa7D')
    print(name.text)
    print(rating.text)
    print(spec[0].text)
    print(spec[1].text)
#     print(spec[2].text)

Longway 81 cm (32 inch) HD Ready LED Smart TV
5
HD Ready 1366 x 768 Pixels
1 Year Limited Domestic Brand Warranty


Extracting Price of the Product

In [219]:
#get price of the product
price=soup.find('div', class_='_30jeq3 _1_WHN1')
# print(price)

print(name.text)
print(rating.text)
print(spec[0].text)
print(spec[1].text)
# print(spec[2].text)
print(price.text)

Longway 81 cm (32 inch) HD Ready LED Smart TV
5
HD Ready 1366 x 768 Pixels
1 Year Limited Domestic Brand Warranty
₹8,290


multiple products

In [220]:
# defining the lists to store the value of each feature

products = []              #List to store the name of the product
prices = []                #List to store price of the product
ratings = []               #List to store rating of the product               
os = []                    #List to store operating system
hd = []                    #List to store resolution
wr = []                    #List to store sound output

In [221]:
for data in soup.findAll('div',class_='_3pLy-c row'):
        names=data.find('div', attrs={'class':'_4rR01T'})
        price=data.find('div', attrs={'class':'_30jeq3 _1_WHN1'})
        rating=data.find('div', attrs={'class':'_3LWZlK'})
        specification = data.find('div', attrs={'class':'fMghEO'})
        
        for each in specification:
            col=each.find_all('li', attrs={'class':'rgWa7D'})
            os_ =col[0].text
            hd_ = col[1].text
            if len(col)>2:
                wr_ = col[2].text
            else:
                wr_ = 'None'
        products.append(names.text) # Add product name to list
        prices.append(price.text) # Add price to list
        os.append(os_) # Add operating system specifications to list
        hd.append(hd_) # Add resolution specifications to list
        wr.append(wr_) # Add resolution specifications to list
        ratings.append(rating.text)   #Add rating specifications to list

In [222]:
# printing the length of list
print(len(products))
print(len(ratings))
print(len(prices))
print(len(os))
print(len(hd))
print(len(wr))

24
24
24
24
24
24


In [231]:
n=1
print(products[n])
print(ratings[n])
print(prices[n])
print(os[n])
print(wr[n])
print(hd[n])

SAMSUNG The Frame 138 cm (55 inch) QLED Ultra HD (4K) Smart Tizen TV Get Frame TV Bezel Worth Rs. 8.98...
4.3
₹86,990
Operating System: Tizen
1 Year Comprehensive Warranty on Product and 1 Year Additional on Panel
Ultra HD (4K) 3840 x 2160 Pixels


In [224]:
# print(products[n])
# print(ratings[n])
# print(prices[n])

# df = pd.DataFrame()

['₹8,290',
 '₹86,990',
 '₹13,990',
 '₹13,999',
 '₹8,399',
 '₹13,990',
 '₹15,999',
 '₹54,990',
 '₹21,999',
 '₹9,499',
 '₹8,499',
 '₹10,990',
 '₹14,999',
 '₹24,999',
 '₹21,999',
 '₹25,499',
 '₹39,999',
 '₹28,999',
 '₹24,999',
 '₹11,499',
 '₹13,999',
 '₹6,699',
 '₹32,990',
 '₹31,990']

<!-- # Extracting data

A website can be divided into many components and sub components. At times it is a complex grid structure which needs to decoded.  
1. You can easily view the structure by `Ctrl + Shift + C`
2. Now if you hover on any review, you'll notice that each block has name `col._2wzgFH.K0kLPL`
![](https://github.com/kabirnagpal/Web-Scraping/blob/main/Images/div-name.png?raw=true)

3. Further this is divided into mutiple rows. The first row contains the rating, while the second contains the actual review. 
![](https://github.com/kabirnagpal/Web-Scraping/blob/main/Images/rating.png?raw=true)
![](https://github.com/kabirnagpal/Web-Scraping/blob/main/Images/review.png?raw=true)
We'll follow exact same approach to extract data. -->

# Collected data to a dataframe

In [227]:
dataset = []
for idx in range(0, len(products)):
    dataset.append({'product_name' : products[idx], 
                    'ratings' : ratings[idx], 
                    'prices' : prices[idx], 
                    'Operating System': os[idx], 
                    'Warranty': wr[idx], 
                    'picture_quality': hd[idx]})

In [230]:
# dataset

In [228]:
df = pd.DataFrame(dataset)
with pd.option_context('display.max_colwidth', -1):
    display(df.head(5))
    display(df.tail(5))

Unnamed: 0,product_name,ratings,prices,Operating System,Warranty,picture_quality
0,Longway 81 cm (32 inch) HD Ready LED Smart TV,5.0,"₹8,290",HD Ready 1366 x 768 Pixels,,1 Year Limited Domestic Brand Warranty
1,SAMSUNG The Frame 138 cm (55 inch) QLED Ultra HD (4K) Smart Tizen TV Get Frame TV Bezel Worth Rs. 8.98...,4.3,"₹86,990",Operating System: Tizen,1 Year Comprehensive Warranty on Product and 1 Year Additional on Panel,Ultra HD (4K) 3840 x 2160 Pixels
2,SAMSUNG 80 cm (32 Inch) HD Ready LED Smart Tizen TV with 2022 Model,4.4,"₹13,990",Operating System: Tizen,1 Year Comprehensive Warranty on Product and 1 Year Additional on Panel,HD Ready 1366 x 768 Pixels
3,Mi 5A 80 cm (32 inch) HD Ready LED Smart Android TV with Dolby Audio (2022 Model),4.4,"₹13,999",Operating System: Android,1 Year Warranty on Product and 2 Years Warranty on Panel. OEM warranty activation starts from the date of delivery.,HD Ready 1366 x 768 Pixels
4,Thomson Alpha 80 cm (32 inch) HD Ready LED Smart Linux TV with 30 W Sound Output & Bezel-Less Design,4.5,"₹8,399",Operating System: Linux,1 Year Warranty on Product and 6 Months on Accessories,HD Ready 1366 x 768 Pixels


Unnamed: 0,product_name,ratings,prices,Operating System,Warranty,picture_quality
19,"acer I Series 80 cm (32 inch) HD Ready LED Smart Android TV with Android 11, 1.5GB RAM (2022 Model)",4.4,"₹11,499",Operating System: Android,1 Year Warranty,HD Ready 1366 x 768 Pixels
20,realme 80 cm (32 inch) HD Ready LED Smart Android TV,4.3,"₹13,999",Operating System: Android,"1 Year Domestic Warranty, 2 Years on Panel",HD Ready 1366 x 768 Pixels
21,BeethoSOL 80 cm (32 inch) HD Ready LED TV,4.3,"₹6,699",HD Ready 1366 x 768 Pixels,,1 Year Warranty
22,SAMSUNG Crystal 4K Neo Series 108 cm (43 inch) Ultra HD (4K) LED Smart Tizen TV with (Black) (2022 Mod...,4.4,"₹32,990",Operating System: Tizen,1 Year Comprehensive Warranty on Product and 1 Year Additional on Panel,Ultra HD (4K) Crystal 4K FE UHD (3840 x 2160) Pixels
23,LG UQ7500 108 cm (43 inch) Ultra HD (4K) LED Smart WebOS TV 2022 Edition,4.4,"₹31,990",Operating System: WebOS,1 Year LG India Comprehensive Warranty and additional 1 year Warranty is applicable on Panel/Module from the date of purchase.,Ultra HD (4K) 3840 x 2160 Pixels


In [233]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      24 non-null     object
 1   ratings           24 non-null     object
 2   prices            24 non-null     object
 3   Operating System  24 non-null     object
 4   Warranty          24 non-null     object
 5   picture_quality   24 non-null     object
dtypes: object(6)
memory usage: 1.2+ KB


In [176]:
print(f"Count of products:{df.shape[0]}")

Count of products:24


# Serialize the dataframe to a csv file

In [177]:
df.to_csv("flipkart_TV_dataset.csv", index=False)