Credits : https://github.com/bansalkanav/Machine_Learning_and_Deep_Learning

# Web Scrapping using BeautifulSoup

During this project we will be learning following two libraries:

1. `requests` - This is used to extract the HTML code from the given URL
2. `BeautifulSoup` - Format and Scrap the data from the HTML

In [1]:
# Installing BeautifulSoup

! pip install bs4



You should consider upgrading via the 'C:\Users\DELL\AppData\Local\Programs\Python\Python39\python.exe -m pip install --upgrade pip' command.


In [1]:
# Loading required libraries
import numpy as np

import requests
from bs4 import BeautifulSoup

In [2]:
# Identify the URL

URL = 'https://www.flipkart.com/search?q=laptops'

# URL = "https://www.flipkart.com/search?q=laptops"

In [3]:
# Loading the WebPage in Memory using requests library

request_header = {'Content-Type': 'text/html; charset=UTF-8', 
                  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0', 
                  'Accept-Encoding': 'gzip, deflate, br'}

response = requests.get(URL, headers=request_header)

# Check the Status Code of the Page
print(response.status_code)

200


In [4]:
response.headers

{'server': 'nginx', 'date': 'Thu, 02 Nov 2023 07:25:15 GMT', 'content-type': 'text/html; charset=utf-8', 'transfer-encoding': 'chunked', 'content-security-policy': "script-src 'self' 'unsafe-eval' https://*.flixcart.com https://*.flixcart.net https://flipkart.d1.sc.omtrdc.net https://dpm.demdex.net https://tnc.phonepe.com https://captcha.px-cdn.net https://js-agent.newrelic.com https://bam.nr-data.net https://www.googletagmanager.com 'nonce-13360362059258659049'; style-src 'self' 'unsafe-inline' https://*.flixcart.com https://tnc.phonepe.com https://*.flixcart.net; img-src 'self' data: blob: https://*.flixcart.com https://*.flixcart.net https://images.ixigo.com https://flipkart.d1.sc.omtrdc.net https://www.facebook.com https://*.fkapi.net https://googleads.g.doubleclick.net https://www.google.com https://www.google.co.in https://www.googleadservices.com https://sp.analytics.yahoo.com https://bat.bing.com https://bat.r.msn.com https://pay.payzippy.com https://1.pay.payzippy.com https://

In [5]:
# Extracting the HTML Code of the WebPage

html_code = response.text

In [6]:
html_code

'<!doctype html><html lang="en"><head><link href="https://rukminim2.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.ccbde3.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><meta name="robots" content="noodp"/><link rel="shortcut icon" href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" h

## Is this Gibberish ? 😱

**Understanding HTML Syntax**
```HTML
<html>
    <head>
        <title>Home Page</title>
    </head>
    <body>
        <h1>This is a heading</h1>
        <p>A dummy paragraph with some dummy text.</p>
    </body>
</html>
```

**Few Important HTML Tags**
| Tag | Description |
|:---|:---|
| `<h1>` to `<h6>` | Defines HTML headings |
| `<p>` | Defines a paragraph |
| `<audio>` | Defines embedded audio content |
| `<video>` | Defines embedded video content |
| `<img>` | Defines an image content |
| `<a>` | Defines a hyperlink |
| `<b>` | Defines bold text |
| `<br>` | Defines a single line break |
| `<form>` | Defines an HTML form for user input |
| `<button>` | Defines a button |
| `<div>` | Defines a section in a document |
| `<ol>` | Defines an ordered list |
| `<ul>` | Defines an unordered list |
| `<li>` | Defines the list item |

**Important Global HTML Attribute**
1. **id** - Unique ID for an element. Each element can have only one ID. Each page can have only one element with that ID.
2. **class** - You can use the same Class on multiple elements. Class naming is case sensitive. You can use multiple classes on the same element.


## Introduction to BeautifulSoup

In [7]:
# Format the HTML code using bs4 library

soup = BeautifulSoup(html_code)

In [8]:
print(type(html_code))

<class 'str'>


In [9]:
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [10]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <link href="https://rukminim2.flixcart.com" rel="preconnect"/>
  <link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css" rel="stylesheet"/>
  <link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.ccbde3.css" rel="stylesheet"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="102988293558" property="fb:page_id"/>
  <meta content="658873552,624500995,100000233612389" property="fb:admins"/>
  <meta content="noodp" name="robots"/>
  <link href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/>
  <link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/>
  <meta content="website" property="og:type"/>
  <meta content="Flipkart.com" name="og_site_name" property="og:site_name"/>
  <li

## Identifying the Data for Scrapping

<img src="images/01_product_page.PNG">

**Steps**
1. Identify URL = '?'
2. Extract HTML code from the URL.
```python
import requests
response = requests.get(URL)
```
3. Typecast HTML code to BeautifulSoup Object.
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_code)
```
4. Inspect the Web Page and find the HTML tag for the data that you want to extract.
5. Using BeautifulSoup Object, extract the tags you are interested in.
```python
# Return the first occurance of the identified tag
tag = soup.find('div', attrs={'class' : '12_X_y'})
print(interested_tag.text)

# Return all occurances of the identified tag
all_tags = soup.find_all('div', attrs={'class' : '12_X_y'})
for tag in all_tags:
    print(tag.text)
```


**Let's identify the tags and their classes of below mentioned features and based on them we will try to scrape out the relavant data from FlipKart website.**

<img src="images/02_product.PNG">


**URL** 
- `https://www.flipkart.com/search?q=laptops`

**Product Title**  
- Tag - div 
- Class - _4rR01T

**Product Price**
- Tag - div 
- Class - _30jeq3 _1_WHN1  

**Product Rating** 
- Tag - div 
- Class - _3LWZlK  

**Number of Product Reviews and Ratings**
- Tag - span 
- Class - _2_R_DZ

**Product Feature List** 
- Tag - ul 
- Class - _1xgFaf  


### Extracting Data using `find()`

In [11]:
# Product Title

title = soup.find('div', attrs={'class' : '_4rR01T'})

print(type(title))

print(title.text)

<class 'bs4.element.Tag'>
Primebook Wifi MT8183 - (4 GB/64 GB EMMC Storage/Prime OS) PB Wifi Thin and Light Laptop


In [12]:
# Product Price

price = soup.find('div', attrs={'class' : '_30jeq3 _1_WHN1'})

print(price.text)

₹8,990


In [13]:
# Product Rating

customer_rating = soup.find('div', attrs={'class' : '_3LWZlK'})

print(customer_rating.text)

4.2


In [14]:
customer_review = soup.find('span', attrs={'class' : '_2_R_DZ'})

print(customer_review.text)

773 Ratings & 207 Reviews


In [15]:
# Product Feature List

feature_list = soup.find('ul', attrs = {'class' : '_1xgFaf'})

print(feature_list.text)

MediaTek MT8183 Processor4 GB LPDDR4 RAMAndroid Operating System29.46 cm (11.6 Inch) Display1 Year Pick and Drop Warranty


### Extracting Data using `find_all()`

In [16]:
# Find All Product Titles

titles = soup.find_all('div', attrs={'class' : '_4rR01T'})

print(type(titles))
print(type(titles[1]))

for title in titles:
    print(title.text)

<class 'bs4.element.ResultSet'>
<class 'bs4.element.Tag'>
Primebook Wifi MT8183 - (4 GB/64 GB EMMC Storage/Prime OS) PB Wifi Thin and Light Laptop
ASUS TUF Gaming F15 - AI Powered Gaming Core i5 11th Gen 11260H - (8 GB/512 GB SSD/Windows 11 Home/4 G...
HP 2023 Athlon Dual Core 3050U - (8 GB/512 GB SSD/Windows 11 Home) 15s-ey1509AU Thin and Light Laptop
APPLE 2022 MacBook AIR M2 - (8 GB/256 GB SSD/Mac OS Monterey) MLY33HN/A
HP 255 G9 840T7PA Athlon Dual Core 3050U - (4 GB/256 GB SSD/DOS) 255 G8 Thin and Light Laptop
Lenovo Lenovo V15 Celeron Dual Core 4th Gen - (8 GB/256 GB SSD/Windows 11 Home) 82QYA00MIN Laptop
Infinix INBook Y1 Plus Intel Core i3 10th Gen 1005G1 - (8 GB/512 GB SSD/Windows 11 Home) XL28 Thin and...
Primebook S Wifi MT8183 - (4 GB/128 GB EMMC Storage/Prime OS) PB S Wifi Thin and Light Laptop
HP Ryzen 5 Hexa Core 5500U - (8 GB/512 GB SSD/Windows 11 Home) 15s- eq2144au Thin and Light Laptop
Acer Extensa (2023) Intel Core i3 12th Gen N305 - (8 GB/256 GB SSD/Windows 11 Home

In [17]:
# Find All Prices

prices = soup.find_all('div', attrs={'class' : '_30jeq3 _1_WHN1'})

for price in prices:
    print(price.text)

₹8,990
₹51,990
₹26,990
₹87,990
₹19,490
₹19,990
₹23,990
₹10,990
₹38,990
₹26,990
₹12,490
₹11,990
₹28,990
₹29,990
₹36,450
₹22,990
₹48,990
₹94,990
₹35,990
₹27,990
₹15,606
₹24,990
₹47,990
₹34,990


In [18]:
# Find All Customer Ratings

customer_ratings = soup.find_all('div', attrs={'class' : '_3LWZlK'})

for customer_rating in customer_ratings:
    print(customer_rating.text)

4.2
4.3
4.1
4.7
3.9
4
4.2
4.2
4.3
4.2
4.1
4.2
4.2
4.2
4.2
4.2
4.5
4.4
4.2
4.2
4.1
4.2
4
4.2
4.2
4
5
4.2
5
4
3.9
5
5
4.5
5
4
4.2
5
5


In [19]:
# Find all Customer Reviews

customer_reviews = soup.find_all('span', attrs={'class' : '_2_R_DZ'})

for customer_review in customer_reviews:
    print(customer_review.text)

773 Ratings & 207 Reviews
1,768 Ratings & 177 Reviews
2,776 Ratings & 227 Reviews
3,207 Ratings & 252 Reviews
401 Ratings & 32 Reviews
1,152 Ratings & 90 Reviews
3,042 Ratings & 387 Reviews
324 Ratings & 97 Reviews
5,643 Ratings & 470 Reviews
712 Ratings & 83 Reviews
1,176 Ratings & 350 Reviews
2,180 Ratings & 690 Reviews
2,979 Ratings & 277 Reviews
418 Ratings & 69 Reviews
533 Ratings & 62 Reviews
3,042 Ratings & 387 Reviews
420 Ratings & 42 Reviews
20 Ratings & 1 Reviews
787 Ratings & 75 Reviews
3,295 Ratings & 288 Reviews
1,176 Ratings & 350 Reviews
194 Ratings & 38 Reviews
134 Ratings & 12 Reviews
2,344 Ratings & 186 Reviews


In [20]:
# Find all Product Features

## How many total page?

**URL for Page 2** - https://www.flipkart.com/search?q=laptops&page=2

**URL for Page 5** - https://www.flipkart.com/search?q=laptops&page=5

**URL for Page 8** - https://www.flipkart.com/search?q=laptops&page=8

**URL for Page 3** - https://www.flipkart.com/search?q=laptops&page=3

**URL for Page 10** - https://www.flipkart.com/search?q=laptops&page=10

**URL for Page 9** - https://www.flipkart.com/search?q=laptops&page=9

**If we know the base URL, can we generate each individual page URL's?**

In [21]:
# Code

'''
URL = https://www.flipkart.com/search?q=laptops&page=9
'''

pages = 80
for i in range(1, pages):
    print('https://www.flipkart.com/search?q=laptops&page={}'. format(i))

https://www.flipkart.com/search?q=laptops&page=1
https://www.flipkart.com/search?q=laptops&page=2
https://www.flipkart.com/search?q=laptops&page=3
https://www.flipkart.com/search?q=laptops&page=4
https://www.flipkart.com/search?q=laptops&page=5
https://www.flipkart.com/search?q=laptops&page=6
https://www.flipkart.com/search?q=laptops&page=7
https://www.flipkart.com/search?q=laptops&page=8
https://www.flipkart.com/search?q=laptops&page=9
https://www.flipkart.com/search?q=laptops&page=10
https://www.flipkart.com/search?q=laptops&page=11
https://www.flipkart.com/search?q=laptops&page=12
https://www.flipkart.com/search?q=laptops&page=13
https://www.flipkart.com/search?q=laptops&page=14
https://www.flipkart.com/search?q=laptops&page=15
https://www.flipkart.com/search?q=laptops&page=16
https://www.flipkart.com/search?q=laptops&page=17
https://www.flipkart.com/search?q=laptops&page=18
https://www.flipkart.com/search?q=laptops&page=19
https://www.flipkart.com/search?q=laptops&page=20
https://w

## Iterate Over Each Page & Scrape the Data

<img src="images/03_items_1.PNG">


**URL** 
- `https://www.flipkart.com/search?q=laptops`

**Product Title**  
- Tag - div 
- Class - _4rR01T

**Product Price**
- Tag - div 
- Class - _30jeq3 _1_WHN1  

**Product Rating** 
- Tag - div 
- Class - _3LWZlK  

**Number of Product Reviews and Ratings**
- Tag - span 
- Class - _2_R_DZ

**Product Feature List** 
- Tag - ul 
- Class - _1xgFaf  


In [22]:
%%time

title_data = []
price_data = []
rating_data = []
review_data = []
feature_data = []

for i in range(1, 10):
    URL = 'https://www.flipkart.com/search?q=laptops&page={}'. format(i)
    request_header = {'Content-Type': 'text/html; charset=UTF-8', 
                      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0', 
                      'Accept-Encoding': 'gzip, deflate, br'}
    
    response = requests.get(URL, headers=request_header)
    html_code = response.text
    
    soup = BeautifulSoup(html_code)
    
    # Scrape all titles from the current page
    titles = soup.find_all('div', attrs={'class' : '_4rR01T'})
    for title in titles:
        title_data.append(title.text)

    # prices
    prices = soup.find_all('div', attrs={'class' : '_30jeq3 _1_WHN1'})
    for price in prices:
        price_data.append(price.text)

    # ratings
    customer_ratings = soup.find_all('div', attrs={'class' : '_3LWZlK'})
    for customer_rating in customer_ratings:
        rating_data.append(customer_rating.text)
        
    # reviews
    customer_reviews = soup.find_all('span', attrs={'class' : '_2_R_DZ'})
    for customer_review in customer_reviews:
        review_data.append(customer_review.text)
    
    # features
    features = soup.find_all('ul', attrs={'class' : '_1xgFaf'})
    for feature in features:
        feature_data.append(feature.text)

CPU times: total: 1.27 s
Wall time: 5.1 s


In [23]:
print("N U M B E R 's")
print("Product Title:", len(title_data))
print("Product Price:", len(price_data))
print("Product Rating:", len(rating_data))
print("Product Review:", len(review_data))
print("Product Feature:", len(feature_data))

N U M B E R 's
Product Title: 216
Product Price: 216
Product Rating: 347
Product Review: 212
Product Feature: 216


### More Ratings? Where is the problem? 🤔

<img src="images/04_problem.PNG">

## Iterate Over Each Page & Scrape the Data Again (Correct Way)

**This time let's do the following:**
1. Iterate over each page.
2. Each page contains multiple product containers. Identify these containers with their tag and class.
3. Iterate over each container and extract relevant product data.

<img src="images/05_items_2.PNG">

In [24]:
%%time

title_data = []
price_data = []
rating_data = []
review_data = []
feature_data = []

pages = 80
print("Progress: ", end="")

for i in range(1, pages):
    URL = 'https://www.flipkart.com/search?q=laptops&page={}'. format(i)
    request_header = {'Content-Type': 'text/html; charset=UTF-8', 
                      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/119.0', 
                      'Accept-Encoding': 'gzip, deflate, br'}
    
    response = requests.get(URL, headers=request_header)
    html_code = response.text
    
    soup = BeautifulSoup(html_code)

    # Iterate over all the products on the current page
    for container in soup.find_all('div', attrs={'class' : '_2kHMtA'}):
        # For each container, scrape the following:
        # 1. Product Title
        # 2. Product Price
        # 3. Product Customer Rating
        # 4. Product Customer Review
        # 5. Product Features

        # 1. Product Title
        product = container.find('div', attrs={'class' : '_4rR01T'})
        if product is None:
            title_data.append(np.NaN)
        else:
            title_data.append(product.text)

        # 2. Product Price
        price = container.find('div', attrs={'class' : '_30jeq3 _1_WHN1'})
        if price is None:
            price_data.append(np.NaN)
        else:
            price_data.append(price.text)

        # 3. Product Customer Rating
        rating = container.find('div', attrs={'class' : '_3LWZlK'})
        if rating is None:
            rating_data.append(np.NaN)
        else:
            rating_data.append(rating.text)

        # 4. Product Customer Review
        customer_review = container.find('span', attrs={'class' : '_2_R_DZ'})
        if customer_review is None:
            review_data.append(np.NaN)
        else:
            review_data.append(customer_review.text)

        # 5. Product Features
        feature = container.find('ul', attrs={'class' : '_1xgFaf'})
        if feature is None:
            feature_data.append(np.NaN)
        else:
            feature_data.append(feature.text)

    print(".", end="")

print()

Progress: ...............................................................................
CPU times: total: 7.41 s
Wall time: 33.8 s


In [25]:
print("N U M B E R 's")
print("Product Title:", len(title_data))
print("Product Price:", len(price_data))
print("Product Rating:", len(rating_data))
print("Product Review:", len(review_data))
print("Product Feature:", len(feature_data))

N U M B E R 's
Product Title: 984
Product Price: 984
Product Rating: 984
Product Review: 984
Product Feature: 984


# Create a DataFrame & Export it in CSV Format

In [26]:
import pandas as pd

In [27]:
df = pd.DataFrame({'Product Title' : title_data, 
                   'Product Price' : price_data, 
                   'Product Rating' : rating_data, 
                   'Product Review' : review_data,
                   'Product Feature' : feature_data})

In [28]:
df.shape

(984, 5)

In [29]:
df.head()

Unnamed: 0,Product Title,Product Price,Product Rating,Product Review,Product Feature
0,Primebook Wifi MT8183 - (4 GB/64 GB EMMC Stora...,"₹8,990",4.2,773 Ratings & 207 Reviews,MediaTek MT8183 Processor4 GB LPDDR4 RAMAndroi...
1,ASUS TUF Gaming F15 - AI Powered Gaming Core i...,"₹51,990",4.3,"1,768 Ratings & 177 Reviews",Intel Core i5 Processor (11th Gen)8 GB DDR4 RA...
2,HP 2023 Athlon Dual Core 3050U - (8 GB/512 GB ...,"₹26,990",4.1,"2,776 Ratings & 227 Reviews",AMD Athlon Dual Core Processor8 GB DDR4 RAMWin...
3,APPLE 2022 MacBook AIR M2 - (8 GB/256 GB SSD/M...,"₹87,990",4.7,"3,207 Ratings & 252 Reviews",Apple M2 Processor8 GB Unified Memory RAMMac O...
4,Infinix INBook Y1 Plus Intel Core i3 10th Gen ...,"₹23,990",4.2,"3,042 Ratings & 387 Reviews",Intel Core i3 Processor (10th Gen)8 GB LPDDR4X...


In [30]:
df.tail()

Unnamed: 0,Product Title,Product Price,Product Rating,Product Review,Product Feature
979,ASUS TUF Gaming F15 Core i5 10th Gen i5-10300H...,"₹67,990",4.5,177 Ratings & 15 Reviews,Intel Core i5 Processor (10th Gen)8 GB DDR4 RA...
980,Wings Nuvobook S1 Aluminium Alloy Metal Body I...,"₹24,990",4.2,194 Ratings & 38 Reviews,Intel Core i3 Processor (11th Gen)8 GB DDR4 RA...
981,Lenovo IdeaPad 1 Athlon Dual Core 7120U - (8 G...,"₹26,990",4.2,593 Ratings & 60 Reviews,AMD Athlon Dual Core Processor8 GB LPDDR5 RAMW...
982,HP Pavilion Intel Core i3 11th Gen 1125G4 - (8...,"₹54,990",4.3,215 Ratings & 15 Reviews,Intel Core i3 Processor (11th Gen)8 GB DDR4 RA...
983,MSI GF63 Core i5 12th Gen 12450H - (16 GB/1 TB...,"₹68,222",4.2,58 Ratings & 7 Reviews,Intel Core i5 Processor (12th Gen)16 GB DDR4 R...


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Product Title    984 non-null    object
 1   Product Price    983 non-null    object
 2   Product Rating   865 non-null    object
 3   Product Review   865 non-null    object
 4   Product Feature  984 non-null    object
dtypes: object(5)
memory usage: 38.6+ KB


In [33]:
df.to_csv('data/laptop_raw_data.csv', index = False)