# Scraping Data 

###  Two ecommerce Amazon & Daraz Sites 
- Digital Camera (Black) is common product

###  Selenium library used for web scraping from above two e-commerce websites
- scraping e-commerce sites with dynamic content, interactive elements, or user authentication, Selenium is a suitable choice hence beautifulsoup is not choosen

In [1]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

In [2]:
driver = webdriver.Chrome()

In [3]:
amazon_url = 'https://www.amazon.com/'
driver.get(amazon_url)
driver.maximize_window()

In [4]:
amazon_input_search = driver.find_element(By.ID,'twotabsearchtextbox')
amazon_input_search.clear()
amazon_search_btn = driver.find_element(By.XPATH,"(//input[@type='submit'])[1]")

In [5]:
amazon_input_search.send_keys("Sony DSC-W800 20.1 MP Digital Camera (Black)")
amazon_search_btn.click()

In [6]:
amazon_product_link = driver.find_element(By.XPATH , "(//*[@class='a-size-medium a-color-base a-text-normal'])[1]").click()

In [7]:
product_name_amazon  =  driver.find_element(By.XPATH , "//span[@id='productTitle']").text
product_price_amazon  = driver.find_element(By.XPATH , "//span[@class='a-price-whole']").text 
product_amazon_rating = driver.find_element(By.ID , 'acrCustomerReviewText').text

In [8]:
daraz_url = 'https://www.daraz.pk/'
driver.get(daraz_url)

In [9]:
daraz_input_search = driver.find_element(By.CLASS_NAME,'search-box__input--O34g')
daraz_input_search.clear()
daraz_search_btn = driver.find_element(By.CLASS_NAME,'search-box__button--1oH7')

In [10]:
daraz_input_search.send_keys("Sony DSC-W800 20.1 MP Digital Camera (Black)")
daraz_search_btn.click()

In [11]:
daraz_product_link = driver.find_element(By.ID , 'id-a-link').click()

In [12]:
product_name_daraz  =  driver.find_element(By.CLASS_NAME , "pdp-mod-product-badge-title").text
product_price_daraz  = driver.find_element(By.XPATH , "//span[@class=' pdp-price pdp-price_type_normal pdp-price_color_orange pdp-price_size_xl']").text
product_rating_daraz = driver.find_element(By.LINK_TEXT , 'No Ratings').text

In [13]:
print("Amazon Product Name:", product_name_amazon)
print("Amazon Product Price: $",product_price_amazon)
print("Amazon Product Rating:",product_amazon_rating)

print("Daraz Product Name:", product_name_daraz)
print("Daraz Product Price: RS",product_price_daraz)
print("Daraz Product Rating:", product_rating_daraz)

Amazon Product Name: Sony Cyber-Shot DSC-W800 Digital Camera (Black)
Amazon Product Price: $ 170
Amazon Product Rating: 220 ratings
Daraz Product Name: Sony DSC-W800 20.1 MP Digital Camera (Black)
Daraz Product Price: RS Rs. 25,000
Daraz Product Rating: No Ratings


### Script to Clean Data using Pandas
#### As there is data of two hence data cleaning is not required if there is more data we would utilize
- We would remove duplicates from th dataframe
- We would would place Nan if there were empty values
- Code is given at end if there are hundred's of data 

In [14]:
data = {
    'Website': ['Amazon', 'Daraz'],
    'Product Name': [product_name_amazon, product_name_daraz],
    'Product Price': [product_price_amazon, product_price_daraz],
    'Product Rating': [product_amazon_rating,product_rating_daraz]

}

In [15]:
df = pd.DataFrame(data)

In [16]:
df

Unnamed: 0,Website,Product Name,Product Price,Product Rating
0,Amazon,Sony Cyber-Shot DSC-W800 Digital Camera (Black),170,220 ratings
1,Daraz,Sony DSC-W800 20.1 MP Digital Camera (Black),"Rs. 25,000",No Ratings


### Comparision of Prices and Recommendig a Website
- Converting USD to PKR for Amazon product
- Comparing both products with ratings and Prize

In [17]:
usd_to_pkr_exchange_rate = 278.97
product_price_amazon = float(product_price_amazon)
amazon_price_pkr = product_price_amazon * usd_to_pkr_exchange_rate
# Print Amazon product details
print("Amazon Product Name:", product_name_amazon)
print("Amazon Product Price (USD): $", product_price_amazon)
print("Amazon Product Price (PKR): RS",amazon_price_pkr )

Amazon Product Name: Sony Cyber-Shot DSC-W800 Digital Camera (Black)
Amazon Product Price (USD): $ 170.0
Amazon Product Price (PKR): RS 47424.9


In [18]:
# As daraz product price is in string , hence it's converted to int before comparison with Amazon Product
def convert_daraz_price(price_string):
    # Remove commas, extract the numeric part, and convert to int
    cleaned_price = ''.join(filter(str.isdigit, price_string))
    return int(cleaned_price)  # Convert to int

# Clean and convert Daraz prices
df.loc[df['Website'] == 'Daraz', 'Product Price'] = df.loc[df['Website'] == 'Daraz', 'Product Price'].apply(convert_daraz_price)
df

Unnamed: 0,Website,Product Name,Product Price,Product Rating
0,Amazon,Sony Cyber-Shot DSC-W800 Digital Camera (Black),170,220 ratings
1,Daraz,Sony DSC-W800 20.1 MP Digital Camera (Black),25000,No Ratings


In [19]:
daraz_price_pkr = int(df.at[1, 'Product Price'])  # Convert to int

In [20]:
if amazon_price_pkr < daraz_price_pkr:
    print("Amazon is more favorable.")
elif daraz_price_pkr < amazon_price_pkr:
    print("Daraz is more favorable.As its cheaper and take less money for delivery")
else:
    print("Both websites offer the same price.")

Daraz is more favorable.As its cheaper and take less money for delivery


### Data Cleaning if there were 100's of products

In [21]:
# Handle missing values (if any)
df = df.fillna('N/A') 

In [22]:
# Remove duplicate entries if any 
df = df.drop_duplicates()

In [23]:
df

Unnamed: 0,Website,Product Name,Product Price,Product Rating
0,Amazon,Sony Cyber-Shot DSC-W800 Digital Camera (Black),170,220 ratings
1,Daraz,Sony DSC-W800 20.1 MP Digital Camera (Black),25000,No Ratings
