-------------------------------------------------------------------------------------------------------------------------------

# Web Scrapping

Importing basics libraries that are used for web scrapping.
<li>Requests - A Python library used to send an HTTP request to a website and store the response object within a variable.</li>
<li>BeautifulSoup - A Python library used to extract the data from an HTML or XML document.</li>
<li>Pandas - A Python library used for Data Analysis. Which will be used to create a dataframe in the following project.</li>

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [2]:
url = 'https://www.flipkart.com/mobiles/pr?sid=tyy,4io&otracker=categorytree'
r = requests.get(url)
s = BeautifulSoup(r.content, 'html.parser')

The status_code is used to indicate whether or not the request we made for the above URL received a successful response. 
<li>200 - "OK"</li>
<li>404 - "NOT FOUND"</li>
<li>403 - "FORBIDDEN"</li>
<li>500 - "INTERNAL SERVER ERROR"</li>

In [3]:
r.status_code

200

-------------------------------------------------------------------------------------------------------------------------------

The variable "s" contains the HTML code for the site. It is possible to access a specific chunk of code by identifying the tag with the "find" or "find_all" method. For a specific tag the class or id of that tag can be specified within the method.<br>
<li>.find() - Finds the first occurrence of the specified parameter values</li>
<li>.find_all() - Finds all the occurrence of the specified parameter values</li>

In [4]:
MoblieDetails = []
for text in s.find_all('div',{'class':'_1UoZlX'}):
    b = text
    MoblieDetails.append(b)

In [5]:
print(MoblieDetails[0].prettify())

<div class="_1UoZlX">
 <a class="_31qSD5" href="/redmi-8-onyx-black-64-gb/p/itmebd23d8a2ed1b?pid=MOBFKPYDZJQHGJXA&amp;lid=LSTMOBFKPYDZJQHGJXAPMDXOB&amp;marketplace=FLIPKART&amp;srno=b_1_1&amp;otracker=browse&amp;fm=organic&amp;iid=330c5703-3380-4940-a6c8-5613395a6b2f.MOBFKPYDZJQHGJXA.SEARCH&amp;ssid=04xavwjhkw0000001595947281529" rel="noopener noreferrer" target="_blank">
  <div class="_3SQWE6">
   <div class="_1OCn9C">
    <div>
     <div class="_3BTv9X" style="height:200px;width:200px">
      <img alt="Redmi 8 (Onyx Black, 64 GB)" class="_1Nyybr" src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/placeholder_9951d0.svg"/>
     </div>
    </div>
   </div>
   <div class="_2lesQu">
    <div class="_1O_CiZ">
     <span class="_1iHA1p">
      <div class="_2kFyHg">
       <label>
        <input class="_3uUUD5" readonly="" type="checkbox"/>
        <div class="_1p7h2j">
        </div>
       </label>
      </div>
     </span>
     <label class="_10TB-Q">
      <span>
       Add to Compar

Before creating a number of lists to hold the values, it is always a good practice to verify whether we have specified the right tags, classes and id. 

In [6]:
Check = MoblieDetails[0]

ModelName = Check.find('div',{'class':'_3wU53n'}).get_text()
DiscountedPrice = Check.find('div',{'class':'_1vC4OE _2rQ-NK'}).get_text()
Price = Check.find('div',{'class':'_3auQ3N _2GcJzG'}).get_text()
RateRev = Check.find('span',{'class':'_38sUEc'}).get_text()
Stars = Check.find('div',{'class':'hGSR34'}).get_text()

In [7]:
print("*"*len(RateRev))
print(ModelName)
print(DiscountedPrice)
print(Price)
print(RateRev)
print(Stars)
print("*"*len(RateRev))

*********************************
Redmi 8 (Onyx Black, 64 GB)
₹9,799
₹10,999
6,70,107 Ratings & 49,811 Reviews
4.4
*********************************


Finally, we can now scrape the data from the site into a number of lists.<br>
while analyzing the site we realize certain points which are as follows:<br>
<li>The price of all the products have the Rupee (₹) symbol which needs to be dealt with. </li>
<li>There are few products that do not offer any discount, thus we need to display the original list price for such products.</li>
<li>Ratings and Reviews are a string that we need to separate.</li>
<li>Each product is given a star concerning their popularity/performance. There are additional products suggested by the site. As we are concerned with only &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;24 products we need to be sure to scrape only the stars that belong to these 24 products.</li>
<li>We need to extract the RAM and ROM specifications from the details. Also while scrapping the details we must make sure to return a null value if there are &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;no specifictions available.</li><br>
(<b>Note: We are only displaying the first 5 records</b>)

In [8]:
ModelName = s.find_all('div',{'class':'_3wU53n'})
ModelName = [pt.get_text() for pt in ModelName]
ModelName[:5] 

['Redmi 8 (Onyx Black, 64 GB)',
 'Redmi 8 (Sapphire Blue, 64 GB)',
 'Realme 6 (Comet Blue, 64 GB)',
 'Realme Narzo 10A (So White, 32 GB)',
 'Realme Narzo 10A (So White, 64 GB)']

In [10]:
SalesPrice = s.find_all('div',{'class':'_1vC4OE _2rQ-NK'})
SalesPrice = [pt.get_text().replace("₹",'').replace(",",'') for pt in SalesPrice]
SalesPrice[:5]

['9799', '9799', '15999', '8999', '9999']

In [11]:
ListPrice = []
for i in s.find_all('div',{'class':'_1uv9Cb'})[:24]:
    if len(list(i.children)) == 1:
        ListPrice.append(i.find('div',{'class':'_1vC4OE _2rQ-NK'}).text.replace("₹",'').replace(",",''))
    else:
        ListPrice.append(i.find('div',{'class':'_3auQ3N _2GcJzG'}).text.replace("₹",'').replace(",",''))
ListPrice[:5]

['10999', '10999', '17999', '9999', '10999']

In [12]:
Ratings = s.find_all('span',{'class':'_38sUEc'})
Ratings = [pt.get_text().replace("\xa0"," ").split(" & ")[0].replace(" Ratings","").replace(",",'') for pt in Ratings]
Ratings[:5]

['670107', '670107', '29837', '42472', '16351']

In [13]:
Reviews = s.find_all('span',{'class':'_38sUEc'})
Reviews = [pt.get_text().replace("\xa0"," ").split(" & ")[1].replace(" Reviews","").replace(",",'') for pt in Reviews]
Reviews[:5]

['49811', '49811', '2944', '2903', '1069']

In [14]:
Stars = s.find_all('div',{'class':'hGSR34'})
Stars = [pt.get_text() for pt in Stars]
Stars = Stars[:24]
Stars[:5]

['4.4', '4.4', '4.4', '4.6', '4.6']

In [15]:
RAM = []
for text in s.find_all('div',{'class':'_3ULzGw'}):
    length = len(text.get_text().split('RAM'))
    if length == 2:
        RAM.append(text.get_text().split(" |")[0].replace(" GB RAM",""))
    else:
        RAM.append('0')
RAM[:5]

['4', '4', '6', '3', '4']

In [16]:
ROM = []
for text in s.find_all('div',{'class':'_3ULzGw'}):
        
    text = text.get_text().split('ROM')[0]
    length = len(text.split('|'))
    if length == 1:
        ROM.append(text.split('ROM')[0].replace(" GB ",""))
    elif length == 2:
        ROM.append(text.split(" | ")[1].replace(" GB ",""))
    else:
        ROM.append('0')
ROM[:5]

['64', '64', '64', '32', '64']

-------------------------------------------------------------------------------------------------------------------------------

Once we have all our data scraped into each list we can build a data frame with these lists. While building the data frame we covert the data which is in string format into integer and float, along with replacing any data having null values with 0.

In [17]:
import pandas as pd

MobilePhones = pd.DataFrame({
        "MobileName": pd.Series(ModelName).fillna('NAN'),
        "RAM_GB": pd.Series(RAM).fillna(0).astype('int'),
        "ROM_GB": pd.Series(ROM).fillna(0).astype('int'),
        "Ratings": pd.Series(Ratings).fillna(0).astype('int'),
        "Reviews": pd.Series(Reviews).fillna(0).astype('int'),
        "Stars": pd.Series(Stars).fillna(0).astype('float'),
        "ListPrice": pd.Series(ListPrice).fillna(0).astype('int'),
        "SalesPrice": pd.Series(SalesPrice).fillna(0).astype('int')
        })

In [18]:
MobilePhones

Unnamed: 0,MobileName,RAM_GB,ROM_GB,Ratings,Reviews,Stars,ListPrice,SalesPrice
0,"Redmi 8 (Onyx Black, 64 GB)",4,64,670107,49811,4.4,10999,9799
1,"Redmi 8 (Sapphire Blue, 64 GB)",4,64,670107,49811,4.4,10999,9799
2,"Realme 6 (Comet Blue, 64 GB)",6,64,29837,2944,4.4,17999,15999
3,"Realme Narzo 10A (So White, 32 GB)",3,32,42472,2903,4.6,9999,8999
4,"Realme Narzo 10A (So White, 64 GB)",4,64,16351,1069,4.6,10999,9999
5,"Realme Narzo 10A (So Blue, 32 GB)",3,32,42472,2903,4.6,9999,8999
6,"Realme Narzo 10A (So Blue, 64 GB)",4,64,16351,1069,4.6,10999,9999
7,"Realme 6 (Comet White, 64 GB)",6,64,29837,2944,4.4,17999,15999
8,"Realme Narzo 10 (That Green, 128 GB)",4,128,73440,5690,4.5,12999,11999
9,"Realme Narzo 10 (That Blue, 128 GB)",4,128,73440,5690,4.5,12999,11999


In [19]:
print(MobilePhones.dtypes)

MobileName     object
RAM_GB          int32
ROM_GB          int32
Ratings         int32
Reviews         int32
Stars         float64
ListPrice       int32
SalesPrice      int32
dtype: object


-------------------------------------------------------------------------------------------------------------------------------