# Webscraping using BeautifulSoup 

Webscraping of ACs from https://www.flipkart.com/search?q=ac using BeautifulSoup<br><br>
The project is divided into 5 parts:
- 1: Importing Libraries 
- 2: Scrapping first page
- 3: Scrapping all the pages
- 4: Data Cleaning
    - 4.1: Creating dataframe
    - 4.2 Creating new Columns from Name and Features Column
    - 4.3 Cleaning Brand column
- 5: Saving data in csv
<br><br><br>
Details of columns saved in csv:
Brand: Brand of AC <br>
Price: Price of AC <br>
Ton: Cooling Capacity of AC in ton <br>
BEE_rating: Bureau of Energy Efficiency star rating given out of 5 for AC <br>
Avg_Rating: Average star rating given by users to the AC <br>
Reviews: Total number of reviews given by users to the AC <br>
Rated_reviews: Total number of reviews alongwith star rating given by users to the AC <br>
Type: Window or Split<br>
Inverter: Whether AC uses inverter technology or not<br>
Convertible: Whether AC is has convertible cooling options or not<br>
Condenser Coil: Material of condenser coil of AC<br>
Power Consumption: Power consumption of in Watts<br>
Noise level: Noise leve of AC in decibel(dB)<br>
Refrigerant: DuPont refrigerant name of AC<br>
Ambient Temperature: Ambient temperature in degree celsius outside for working of AC on its max performance<br>
WiFi Enabled: Whether AC is WiFi enabled or not<br>
Warranty: Warranty of AC and its parts<br>

## Part 1: Importing Libraries

In [1]:
#importing libraries
import requests #to make website get request
from bs4 import BeautifulSoup as bs # to scrape through website contents
import re # for regex
import time
import datetime 
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', None)

## Part 2: Scrapping first page

In [2]:
#intialising website string
flipkart_url="https://www.flipkart.com/search?q=" + "ac"

In [3]:
#making get request
html_text=requests.get(flipkart_url).text

In [4]:
#scraping through url page contents
soup=bs(html_text,'html.parser')

In [5]:
# finding total number of pages
pages=int(soup.find('div',class_='_2MImiq').span.text.split()[-1])
pages

33

## Part 3: Scrapping all the pages

In [6]:
def page_urls(page):    
    """function to make list of urls strings to be scraped
        Args:
        page (int): max number of pages to be scraped
        Returns
        a(list): list of urls strings to be scraped
        
    """
    a=[]
    for i in range(1,page+1):
        a.append(flipkart_url+"&page="+str(i))
    return a

In [7]:
page_urls(pages)# call to function page_urls with 33 as argument

['https://www.flipkart.com/search?q=ac&page=1',
 'https://www.flipkart.com/search?q=ac&page=2',
 'https://www.flipkart.com/search?q=ac&page=3',
 'https://www.flipkart.com/search?q=ac&page=4',
 'https://www.flipkart.com/search?q=ac&page=5',
 'https://www.flipkart.com/search?q=ac&page=6',
 'https://www.flipkart.com/search?q=ac&page=7',
 'https://www.flipkart.com/search?q=ac&page=8',
 'https://www.flipkart.com/search?q=ac&page=9',
 'https://www.flipkart.com/search?q=ac&page=10',
 'https://www.flipkart.com/search?q=ac&page=11',
 'https://www.flipkart.com/search?q=ac&page=12',
 'https://www.flipkart.com/search?q=ac&page=13',
 'https://www.flipkart.com/search?q=ac&page=14',
 'https://www.flipkart.com/search?q=ac&page=15',
 'https://www.flipkart.com/search?q=ac&page=16',
 'https://www.flipkart.com/search?q=ac&page=17',
 'https://www.flipkart.com/search?q=ac&page=18',
 'https://www.flipkart.com/search?q=ac&page=19',
 'https://www.flipkart.com/search?q=ac&page=20',
 'https://www.flipkart.com/se

In [8]:
# intializing various feature lists which will be finally converted into dataframe columns
brand_name,ton,star_rating,price,user_rating,a_c_n,reviews,ratings,feature_list_main=[],[],[],[],[],[],[],[],[]

In [9]:
def ac_page(acs):
    """function to append various feature lists from the data scrapped form particular class of particular page url
        Args:
        acs (bs4.element.ResultSet): scraped relevant data of all ACs in particular page url
                
    """
    
    c=0 #counter for AC items in particular page url
    for ac in acs:
        c+=1
        print(f"Scraping AC no. - {c}") #printing counter for AC items
        ac_name1=ac.find('div',class_="_4rR01T").text.title() # scrapping AC name and coverting into title case
        a_c_n.append(ac_name1) # appending to the name list
        acn=re.findall(r'.*?(?=\d)',ac_name1) # using regex lookahead to search to find brand name in AC name 
        brand_name.append(acn[0])# appending first item as brand name
        ac_name=ac.find('div',class_="_4rR01T").text.split()# splitting AC name into list
        ac_name=ac_name[1:] # slicing list starting from
        p,u,r=str(),str(),str() # intialising string variables
        for j in range (len(ac_name)):
            if ac_name[j]=='Ton':ton.append(ac_name[j-1])#appending ton list
            if j!=0 and ac_name[j]=='Star':star_rating.append(ac_name[j-1])# appending bee star rating list
        
        p=ac.find('div',class_="_30jeq3 _1_WHN1").text #searching for AC price
        if len(p)!=0:price.append(p)# appending price list
        else:price.append("Not Available")
            
        u=ac.find('div',class_='gUuXy-')# searching for user star rating
        if u is None:user_rating.append("Not Available")#appending Not Available to user_rating if object not found
        else:user_rating.append(u.div.text) # appending to user_rating
       
            
        r=ac.find('span',class_='_2_R_DZ')# searching for user rating and reviews class
        if r is None:
            reviews.append("Not Available")#appending Not Available if object not found
            ratings.append("Not Available")
        else:
            r1=r.span.text.split()#splitting the variable
            reviews.append(r1[0])#appending to reviews list
            ratings.append(r1[-2])#appending to ratings list 
            
            

        
        features=ac.find_all('li',class_='rgWa7D')# finding all features of AC
        feature_list=[]
        for f in features:
            feature_list.append(f.text)#creating feature list for particular ac
        feature_list_main.append(feature_list)#appending features of one AC into main feature list

    

In [10]:
c1=0 #counter for particular page url
for i in page_urls(pages):
    c1+=1
    print(f"Scraping ACs from page No. - {c1}")
    time.sleep(5)
    html_text_page=requests.get(i).text#making get request for particular page url
    soup_page=bs(html_text_page,'html.parser')#scraping through page url contents
    acs=soup_page.find_all('div',class_="_3pLy-c row")# scrapping all AC classes
    ac_page(acs)# making call to ac_page function 

Scraping ACs from page No. - 1
Scraping AC no. - 1
Scraping AC no. - 2
Scraping AC no. - 3
Scraping AC no. - 4
Scraping AC no. - 5
Scraping AC no. - 6
Scraping AC no. - 7
Scraping AC no. - 8
Scraping AC no. - 9
Scraping AC no. - 10
Scraping AC no. - 11
Scraping AC no. - 12
Scraping AC no. - 13
Scraping AC no. - 14
Scraping AC no. - 15
Scraping AC no. - 16
Scraping AC no. - 17
Scraping AC no. - 18
Scraping AC no. - 19
Scraping AC no. - 20
Scraping AC no. - 21
Scraping AC no. - 22
Scraping AC no. - 23
Scraping AC no. - 24
Scraping ACs from page No. - 2
Scraping AC no. - 1
Scraping AC no. - 2
Scraping AC no. - 3
Scraping AC no. - 4
Scraping AC no. - 5
Scraping AC no. - 6
Scraping AC no. - 7
Scraping AC no. - 8
Scraping AC no. - 9
Scraping AC no. - 10
Scraping AC no. - 11
Scraping AC no. - 12
Scraping AC no. - 13
Scraping AC no. - 14
Scraping AC no. - 15
Scraping AC no. - 16
Scraping AC no. - 17
Scraping AC no. - 18
Scraping AC no. - 19
Scraping AC no. - 20
Scraping AC no. - 21
Scraping AC

Scraping AC no. - 1
Scraping AC no. - 2
Scraping AC no. - 3
Scraping AC no. - 4
Scraping AC no. - 5
Scraping AC no. - 6
Scraping AC no. - 7
Scraping AC no. - 8
Scraping AC no. - 9
Scraping AC no. - 10
Scraping AC no. - 11
Scraping AC no. - 12
Scraping AC no. - 13
Scraping AC no. - 14
Scraping AC no. - 15
Scraping AC no. - 16
Scraping AC no. - 17
Scraping AC no. - 18
Scraping AC no. - 19
Scraping AC no. - 20
Scraping AC no. - 21
Scraping AC no. - 22
Scraping AC no. - 23
Scraping AC no. - 24
Scraping ACs from page No. - 18
Scraping AC no. - 1
Scraping AC no. - 2
Scraping AC no. - 3
Scraping AC no. - 4
Scraping AC no. - 5
Scraping AC no. - 6
Scraping AC no. - 7
Scraping AC no. - 8
Scraping AC no. - 9
Scraping AC no. - 10
Scraping AC no. - 11
Scraping AC no. - 12
Scraping AC no. - 13
Scraping AC no. - 14
Scraping AC no. - 15
Scraping AC no. - 16
Scraping AC no. - 17
Scraping AC no. - 18
Scraping AC no. - 19
Scraping AC no. - 20
Scraping AC no. - 21
Scraping AC no. - 22
Scraping AC no. - 23

Scraping AC no. - 1
Scraping AC no. - 2
Scraping AC no. - 3
Scraping AC no. - 4
Scraping AC no. - 5
Scraping AC no. - 6
Scraping AC no. - 7
Scraping AC no. - 8
Scraping AC no. - 9
Scraping AC no. - 10
Scraping AC no. - 11
Scraping AC no. - 12
Scraping AC no. - 13
Scraping AC no. - 14
Scraping AC no. - 15
Scraping AC no. - 16


## Part 4: Data Cleaning

### 4.1: Creating dataframe

In [11]:
#creating dataframe from various feature lists
df1 = pd.DataFrame(list(zip(a_c_n,brand_name,price,ton,star_rating,user_rating,reviews,ratings,feature_list_main)),
               columns =['Name','Brand','Price','Ton','BEE_rating','Avg_Rating','Reviews','Rated_reviews','Features'])
df1

Unnamed: 0,Name,Brand,Price,Ton,BEE_rating,Avg_Rating,Reviews,Rated_reviews,Features
0,Voltas 1.5 Ton 5 Star Split Inverter Ac - White,Voltas,"₹37,999",1.5,5,4.1,1405,138,"[Condenser Coil: Copper, Power Consumption: 1450 W, Noise level: 46 dB, Refrigerant: R32, Ambient Temperature: 52 DegreeC, Wi-Fi Enabled: No, 1 Year Warranty on Product and 5 Years on Compressor]"
1,Whirlpool 4 In 1 Convertible Cooling 1.5 Ton 5 Star Split Inverter Ac - White,Whirlpool,"₹36,490",1.5,5,4,4598,524,"[Condenser Coil: Copper, Power Consumption: 1325 W, Noise level: 35 dB, Refrigerant: R - 32, Wi-Fi Enabled: No, 1 year comprehensive+5year on compressor]"
2,Voltas 1.5 Ton 3 Star Split Inverter Adjustible Ac - White,Voltas,"₹32,399",1.5,3,4.1,715,64,"[Condenser Coil: Copper, Power Consumption: 1675 W, Noise level: 47 dB, Refrigerant: R32, Ambient Temperature: 52 DegreeC, Wi-Fi Enabled: No, 1 Year Warranty on Product and 10 Years on Compressor]"
3,Lg 1.5 Ton 3 Star Split Dual Inverter Convertible 5-In-1 Cooling Hd Filter With Anti-Virus Protection ...,Lg,"₹36,499",1.5,3,4,3175,350,"[Condenser Coil: Copper, Power Consumption: 962.65 kWh, Noise level: 26 dB, Refrigerant: R - 32, Wi-Fi Enabled: No, 1 Year on Product, 5 Years on PCB and 10 Years on Compressor with Gas Charging]"
4,Whirlpool 4 In 1 Convertible Cooling 1.5 Ton 3 Star Split Inverter Ac - White,Whirlpool,"₹32,990",1.5,3,4,4598,524,"[Condenser Coil: Copper, Power Consumption: 1570 W, Noise level: 37.5 dB, Refrigerant: R - 32, Wi-Fi Enabled: No, 1 year comprehensive+5year on compressor]"
...,...,...,...,...,...,...,...,...,...
761,Lloyd 1.5 Ton 3 Star Split Ac - White,Lloyd,"₹32,990",1,3,3.9,23,4,"[Condenser Coil: Copper, Power Consumption: 1700 W, Noise level: 38 dB, Wi-Fi Enabled: No, 1 Year Warranty and 1 Year Compressor Warranty on Lloyd]"
762,Samsung 1 Ton 3 Star Split Ac - White,Samsung,"₹30,699",1.5,3,3.8,8,0,"[Condenser Coil: Copper, Power Consumption: 711.4 kWh, Noise level: 42 dB, Wi-Fi Enabled: No, 1 Year Warranty on Product, 5 Years on Condenser and 10 Years on Compressor]"
763,Voltas 1.5 Ton 5 Star Window Ac - White,Voltas,"₹34,449",2,3,4.3,2104,248,"[Condenser Coil: Copper, Power Consumption: 1525 W, Noise level: 54 dB, Refrigerant: R-32, Wi-Fi Enabled: No, 1 Year on Product and 4 Years on Compressor]"
764,Haier 2 Ton 3 Star Hot And Cold Tower Inverter Ac With Wi-Fi Connect - Silver,Haier,"₹1,23,500",2,3,3.3,4,2,[Wi-Fi Enabled: Yes]


In [12]:
#checking for null values
df1.isnull().sum()

Name             0
Brand            0
Price            0
Ton              0
BEE_rating       0
Avg_Rating       0
Reviews          0
Rated_reviews    0
Features         0
dtype: int64

In [13]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 766 entries, 0 to 765
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           766 non-null    object
 1   Brand          766 non-null    object
 2   Price          766 non-null    object
 3   Ton            766 non-null    object
 4   BEE_rating     766 non-null    object
 5   Avg_Rating     766 non-null    object
 6   Reviews        766 non-null    object
 7   Rated_reviews  766 non-null    object
 8   Features       766 non-null    object
dtypes: object(9)
memory usage: 54.0+ KB


In [14]:
df1['Features'] = df1['Features'].astype(str)# changing datatype of Features column to str from list
#removing unnecessary characters from Price and Features column
df1['Price']=df1['Price'].str.replace(",", "").str.replace("₹", "")
df1['Features']=df1['Features'].str.replace("[", "").str.replace("]", "").str.replace("'", "").str.replace(" - ", "").str.replace("-", "")
df1.head(5)

Unnamed: 0,Name,Brand,Price,Ton,BEE_rating,Avg_Rating,Reviews,Rated_reviews,Features
0,Voltas 1.5 Ton 5 Star Split Inverter Ac - White,Voltas,37999,1.5,5,4.1,1405,138,"Condenser Coil: Copper, Power Consumption: 1450 W, Noise level: 46 dB, Refrigerant: R32, Ambient Temperature: 52 DegreeC, WiFi Enabled: No, 1 Year Warranty on Product and 5 Years on Compressor"
1,Whirlpool 4 In 1 Convertible Cooling 1.5 Ton 5 Star Split Inverter Ac - White,Whirlpool,36490,1.5,5,4.0,4598,524,"Condenser Coil: Copper, Power Consumption: 1325 W, Noise level: 35 dB, Refrigerant: R32, WiFi Enabled: No, 1 year comprehensive+5year on compressor"
2,Voltas 1.5 Ton 3 Star Split Inverter Adjustible Ac - White,Voltas,32399,1.5,3,4.1,715,64,"Condenser Coil: Copper, Power Consumption: 1675 W, Noise level: 47 dB, Refrigerant: R32, Ambient Temperature: 52 DegreeC, WiFi Enabled: No, 1 Year Warranty on Product and 10 Years on Compressor"
3,Lg 1.5 Ton 3 Star Split Dual Inverter Convertible 5-In-1 Cooling Hd Filter With Anti-Virus Protection ...,Lg,36499,1.5,3,4.0,3175,350,"Condenser Coil: Copper, Power Consumption: 962.65 kWh, Noise level: 26 dB, Refrigerant: R32, WiFi Enabled: No, 1 Year on Product, 5 Years on PCB and 10 Years on Compressor with Gas Charging"
4,Whirlpool 4 In 1 Convertible Cooling 1.5 Ton 3 Star Split Inverter Ac - White,Whirlpool,32990,1.5,3,4.0,4598,524,"Condenser Coil: Copper, Power Consumption: 1570 W, Noise level: 37.5 dB, Refrigerant: R32, WiFi Enabled: No, 1 year comprehensive+5year on compressor"


### 4.2 Creating new Columns from Name and Features Column

In [15]:
#searching for word window or split in Name column
search1,search2,search3 = [],[],[]
for values in df1['Name']:
    if 'Split' in values:search1.append('Split')
    elif 'Window' in values:search1.append('Window')
    else: search1.append('Not Available')

#searching for word Inverter in Name column 
for values in df1['Name']:
    if 'Inverter' in values:search2.append('Yes')
    elif 'Non Inverter' in values:search2.append('No')
    else: search2.append('No')

#searching for word Convertible in Name column
for values in df1['Name']:
    if 'Convertible' in values:search3.append('Yes')
    elif 'Non Convertible' in values:search2.append('No')
    else: search3.append('No')

df1['Type'] = search1#creating new column Type
df1['Inverter'] = search2#creating new column Inverter
df1['Convertible'] = search3#creating new column Convertible
del df1['Name']#deleting Name column

In [16]:
def features_explode(a):
    """function to make new columns from Features column
        Args:
        a (str): words to be searched in Feature column , this same word would become header of new column
        
        
    """    
    search = []  
    for values in df1['Features']:
        e=re.findall(r'(?<=)'+a+": "'(\w+)',values)# using regex lookbehind 
        if len(e)==0:e.append('Not Available')#append Not Available if not found
        search.append(e[0])

    df1[a] = search#creating new column with header name a

In [17]:
features_explode("Condenser Coil")# call to function with Condenser Coil as argumnet
features_explode("Power Consumption")# call to function with Power Consumption as argumnet
features_explode("Noise level")# call to function with Noise level as argumnet
features_explode("Refrigerant")# call to function with Refrigerant as argumnet
features_explode("Ambient Temperature")# call to function with Ambient Temperature as argumnet
features_explode("WiFi Enabled")# call to function with Wi-Fi Enabled as argumnet
#warranty column
search = []  
for values in df1['Features']:
    e = re.findall(r"\d+ Year.+",values)#searching for word Year in Features column
    if len(e)==0:e.append('Not Available')
    search.append(e[0])

df1['Warranty'] = search # creating new column with header name Warranty

del df1['Features'] # delting Features column 

In [18]:
df1.head(5)

Unnamed: 0,Brand,Price,Ton,BEE_rating,Avg_Rating,Reviews,Rated_reviews,Type,Inverter,Convertible,Condenser Coil,Power Consumption,Noise level,Refrigerant,Ambient Temperature,WiFi Enabled,Warranty
0,Voltas,37999,1.5,5,4.1,1405,138,Split,Yes,No,Copper,1450,46,R32,52,No,1 Year Warranty on Product and 5 Years on Compressor
1,Whirlpool,36490,1.5,5,4.0,4598,524,Split,Yes,Yes,Copper,1325,35,R32,Not Available,No,Not Available
2,Voltas,32399,1.5,3,4.1,715,64,Split,Yes,No,Copper,1675,47,R32,52,No,1 Year Warranty on Product and 10 Years on Compressor
3,Lg,36499,1.5,3,4.0,3175,350,Split,Yes,Yes,Copper,962,26,R32,Not Available,No,"1 Year on Product, 5 Years on PCB and 10 Years on Compressor with Gas Charging"
4,Whirlpool,32990,1.5,3,4.0,4598,524,Split,Yes,Yes,Copper,1570,37,R32,Not Available,No,Not Available


### 4.3 Cleaning Brand column

In [19]:
df1.Brand.unique()

array(['Voltas ', 'Whirlpool ', 'Lg ', 'Blue Star Convertible ',
       'Samsung ', 'Carrier Flexicool Convertible ', 'Panasonic ',
       'Lg Super Convertible ', 'Realme Techlife ', 'Lloyd ', 'Daikin ',
       'Hitachi ', 'Onida ', 'Croma ', 'Marq By Flipkart Convertible ',
       'Godrej ', 'Blue Star ', 'Carrier ', 'Midea ',
       'Samsung Super Convertible ', 'Ifb ', 'Lg Convertible ',
       'Hisense ', 'Motorola Multi-Convertible ', 'Haier ', 'Thomson ',
       'Marq By Flipkart ', 'Motorola ', 'Candy ', 'Nokia ', 'Lumx ',
       'Toshiba ', 'Lloyd . ', 'O General ', 'Sansui ', 'Livpure ', 'Vg ',
       'Micromax ', 'Samsung Windfree ', 'Impex ',
       'Vizio Vision Beyond Imagination ', 'Hyundai ', 'Gazhal '],
      dtype=object)

In [20]:
#removing unnecessary words and spaces from Brand column
df1['Brand']=df1['Brand'].str.replace("Convertible", "").str.replace("Super", "").str.replace("Windfree", "").str.replace("Multi-","").str.replace("Flexicool","")
df1['Brand']=df1['Brand'].str.strip()

In [21]:
df1.Brand.unique()

array(['Voltas', 'Whirlpool', 'Lg', 'Blue Star', 'Samsung', 'Carrier',
       'Panasonic', 'Realme Techlife', 'Lloyd', 'Daikin', 'Hitachi',
       'Onida', 'Croma', 'Marq By Flipkart', 'Godrej', 'Midea', 'Ifb',
       'Hisense', 'Motorola', 'Haier', 'Thomson', 'Candy', 'Nokia',
       'Lumx', 'Toshiba', 'Lloyd .', 'O General', 'Sansui', 'Livpure',
       'Vg', 'Micromax', 'Impex', 'Vizio Vision Beyond Imagination',
       'Hyundai', 'Gazhal'], dtype=object)

## Part 5: Saving data in csv

In [22]:
# Saving the dataframe into CSV file
df1.to_csv("Flipkart_AC_Webscraping.csv",index=False)

In [23]:
currDate = datetime.datetime.now() 
print(f"Scrapping done on {currDate}")

Scrapping done on 2022-06-23 11:56:28.352433
