# Web Scraping Smartphone Data from Flipkart
This notebook demonstrates how to collect smartphone data from Flipkart using Python. The process involves automated web scraping with Selenium and BeautifulSoup, followed by data cleaning and saving the results to a CSV file.

**Steps performed in this notebook:**
1. **Import Libraries:** Essential libraries for web automation, HTML parsing, and data manipulation are imported.
2. **Web Scraping:** Selenium is used to automate browser actions and retrieve HTML content from multiple pages of Flipkart smartphone listings. BeautifulSoup parses the HTML to extract relevant details such as name, ratings, prices, actual prices, discounts, and processor information for each phone.
3. **Data Cleaning:** Extracted price and discount information is cleaned and converted to numerical format using NumPy.
4. **Validation:** The lengths of all extracted lists are printed to ensure data consistency.
5. **DataFrame Creation:** The collected data is organized into a pandas DataFrame for easy analysis.
6. **Export:** The final dataset is exported to a CSV file for further use.

This workflow enables efficient collection and structuring of product data for analysis or machine learning tasks.

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import pandas as pd
import time


phones={
    'name':[],
    'ratings':[],
    'prices':[],
    'actual_prices':[],
    'discounts':[],
    'processors':[]
        
}


driver = webdriver.Chrome()

for i in range(1, 13):
    driver.get(f'https://www.flipkart.com/search?q=smartphones&page={i}')
    time.sleep(3)  # Allow some time for the page to load

        # Get page source
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')

    all_phone=soup.find_all('div','tUxRFH') 
    
    for i in range(len(all_phone)):
        phones['name'].append(all_phone[i].find_all('div','KzDlHZ')[0].text.strip())
        phones['ratings'].append(all_phone[i].find_all('div','XQDdHH')[0].text.strip())
        phones['prices'].append(all_phone[i].find_all('div','Nx9bqj _4b5DiR')[0].text.strip())
        phones['actual_prices'].append(all_phone[i].find_all('div','yRaY8j ZYYwLA')[0].text.strip())
        phones['processors'].append(all_phone[i].find_all('li','J+igdf')[0].text.strip())

driver.quit()

In [2]:
import numpy as np

for i in range(len(phones['name'])):
    actual_price = np.int64(phones['actual_prices'][i].split('₹')[1].replace(',', ''))
    price = np.int64(phones['prices'][i].split('₹')[1].replace(',', ''))
    phones['discounts'].append(actual_price - price)

In [None]:
print(len(phones['name']))
print(len(phones['ratings']))
print(len(phones['prices']))
print(len(phones['actual_prices']))
print(len(phones['processors']))
print(len(phones['discounts']))



In [3]:
phones_df = pd.DataFrame(phones)
phones_df

Unnamed: 0,name,ratings,prices,actual_prices,discounts,processors
0,POCO C71 - Locked with Airtel Prepaid (Cool B...,3.9,"₹5,899","₹8,999",3100,4 GB RAM | 64 GB ROM | Expandable Upto 2 TB
1,vivo T4 Lite 5G Charger in the Box (Titanium G...,4.4,"₹10,999","₹14,999",4000,6 GB RAM | 128 GB ROM | Expandable Upto 2 TB
2,"vivo T4R 5G (Arctic White, 128 GB)",4.5,"₹19,499","₹23,499",4000,8 GB RAM | 128 GB ROM
3,"POCO C71 (Desert Gold, 128 GB)",4.1,"₹6,999","₹9,999",3000,6 GB RAM | 128 GB ROM | Expandable Upto 2 TB
4,"Motorola g45 5G (Brilliant Green, 128 GB)",4.3,"₹11,999","₹14,999",3000,8 GB RAM | 128 GB ROM | Expandable Upto 1 TB
...,...,...,...,...,...,...
283,"MOTOROLA G96 5G (Pantone Dresden Blue, 256 GB)",4.4,"₹19,999","₹22,999",3000,8 GB RAM | 256 GB ROM
284,"realme Narzo 80 Lite 5G (Crystal Purple, 128 GB)",4.6,"₹11,846","₹13,999",2153,4 GB RAM | 128 GB ROM
285,"realme Narzo 80 Lite 5G (Onyx Black, 128 GB)",4.6,"₹11,899","₹13,999",2100,4 GB RAM | 128 GB ROM
286,"realme P3 Ultra 5G (Orion Red, 256 GB)",4.4,"₹27,999","₹33,999",6000,8 GB RAM | 256 GB ROM


In [4]:
phones_df.to_csv('phones_data.csv', index=False)