# Lab 5: Web Scraping with Selenium

## Overview
In this lab, we will use Selenium to automate web scraping tasks. We will scrape product data from Amazon and AliExpress, preprocess and clean the data, perform exploratory data analysis, and develop a simple prediction algorithm.

## Key Topics
- **Selenium Overview**: Introduction to Selenium for browser automation.
- **Scraper Class**: Implementation of a `Scraper` class for web scraping.
- **Amazon Scraping**: Extract product details, descriptions, and reviews from Amazon.
- **AliExpress Scraping**: Extract product details from AliExpress.
- **Data Preprocessing**: Clean and preprocess the scraped data.
- **Data Merging**: Merge data from multiple sources.
- **Exploratory Data Analysis**: Analyze data distributions and relationships.
- **Prediction Algorithm**: Develop and apply a prediction algorithm for product sales.
- **Data Visualization**: Visualize the results of the prediction algorithm.
- **Interactive Data Exploration**: Use interactive elements to explore the data.

## Objectives
- Understand the basics of Selenium for web scraping.
- Implement a web scraper using the `Scraper` class.
- Scrape and preprocess data from Amazon and AliExpress.
- Perform exploratory data analysis and visualize the results.
- Develop a simple prediction algorithm and apply it to the data.
- Explore the data interactively.

## Steps
1. **Setup and Installations**: Install required packages.
2. **Import Libraries**: Import necessary libraries for web scraping and data analysis.
3. **Scraper Class Implementation**: Define and use the `Scraper` class.
4. **Amazon Data Scraping**: Scrape product details and reviews from Amazon.
5. **AliExpress Data Scraping**: Scrape product details from AliExpress.
6. **Data Preprocessing**: Clean and preprocess the scraped data.
7. **Data Merging**: Merge and standardize data from multiple sources.
8. **Exploratory Data Analysis**: Perform and visualize exploratory data analysis.
9. **Prediction Algorithm**: Develop and apply a prediction algorithm.
10. **Interactive Exploration**: Use interactive widgets to explore the data.

By the end of this lab, you will have a comprehensive understanding of web scraping with Selenium and how to preprocess, analyze, and visualize the scraped data.



### Install required packages

In [None]:
%pip install pandas matplotlib numpy beautifulsoup4 selenium

### Import Required Libraries

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from information_types import Product
# Ensure inline plotting for Jupyter Notebook
%matplotlib inline

### Overview of Selenium

Selenium is a powerful tool for controlling web browsers through programs and performing browser automation. It is primarily used for testing web applications but can also be used for web scraping and automating repetitive web-based tasks.

#### Key Features of Selenium:
- **Cross-Browser Support**: Selenium supports multiple browsers like Chrome, Firefox, Safari, and Edge.
- **Automation**: It can automate web interactions such as clicking buttons, filling forms, and navigating through pages.
- **Dynamic Content Handling**: Selenium can handle dynamic content that is loaded via JavaScript, which is not possible with static HTML parsers.
- **Browser Simulation**: Selenium simulates a real browser, making it less likely to be blocked by websites that detect and prevent automated scraping.
  
The last two points are the real reason why we will be using Selenium in this notebook. BeautifulSoup and Requests are great for static web pages, but they cannot handle dynamic content that is loaded via JavaScript. Also sometimes when trying to scrape a website, the website might detect that we are using a bot and block our requests. Selenium can help us overcome these limitations by simulating a real user interacting with the website.


## Why Not Just Use BeautifulSoup with Requests?

While BeautifulSoup and Requests are excellent tools for web scraping, they have limitations compared to Selenium:

- **Static Content**: BeautifulSoup and Requests can only scrape static content that is available in the initial HTML response. They cannot handle dynamic content loaded via JavaScript.
- **JavaScript Execution**: Selenium can execute JavaScript, allowing it to interact with elements that are dynamically generated or modified after the initial page load.
- **Complex Interactions**: Selenium can perform complex interactions like clicking buttons, filling forms, and navigating through multiple pages, which is challenging with just BeautifulSoup and Requests.

### Scraper class
The `Scraper` class is designed to facilitate web scraping using Selenium with Chrome WebDriver. 

#### Key Features:
- **Headless Mode**: Allows the scraper to run without opening a browser window, which is useful for running scripts on servers or in the background.
- **Image Loading Control**: Option to disable image loading for faster scraping.
- **Customizable Options**: Additional Chrome options can be set for the WebDriver.
- **Window Size**: Sets the window size for the browser.

#### Initialization Parameters:
- `headless` (bool): If `True`, runs the scraper in headless mode.
- `load_images` (bool): If `False`, disables image loading.
- `options` (Options): Chrome options for the WebDriver.
- `window_size` (tuple): Sets the window size for the browser.

In [None]:
class Scraper_show:
    def __init__(self, headless = True,
        load_images = False, # for faster scraping we can turn off image loading
        options = Options(), # options for the web surfer
        window_size = (700,900)):

        if headless:
            # if headless is True, we can run the scraper 
            # without opening a browser
            options.add_argument("--headless")
            options.add_argument("--disable-gpu")
            
        prefs = {"profile.managed_default_content_settings.images": 2}
        if not load_images:
            options.add_experimental_option("prefs", prefs)
        # adding some more options to the web surfer
        options.add_argument("--no-sandbox")
        options.add_argument("--disable-dev-shm-usage")
        # creating the websurfer using chrome
        self.driver = webdriver.Chrome(options=options)
        self.driver.set_window_size(*window_size)

Here is an example of how to use the `Scraper` class to scrape a website:


In [None]:
from information_types import Scraper
scraper = Scraper(headless=False,load_images=True) # creating the scraper object

In [None]:
scraper.driver.get("https://www.english-hatter.nl") # opening the website

### closing the browser

with the command below we close the browser. we will still need this instance to scrape the data from the website.
So after we close it we should reinitialize it again.


In [None]:
scraper.driver.quit() # closing the browser

In [None]:
scraper = Scraper(headless=False,load_images=True)

### Usecase Amazon

We want to get all the data from amazons webpage for a given search term. We will use the search term `"laptop"` and scrape the first page of the search results. The code below will scrape the product name, price, and rating for each product on the page.

In [None]:

# go to the url containing the search results
scraper.driver.get("https://www.amazon.nl/s?k=laptop&language=en_GB")
# used to store the products
products = []

# get the html data in a format that we can use easily
html_data = BeautifulSoup(scraper.driver.page_source, 'html.parser')

# Based on the structure of the html data, we can find the products
products = html_data.find_all('div', {'data-component-type': 's-search-result'})

# loop through the products and get the required information
for product in products:
    # we know that the title is in a span tag with the class a-text-normal
    title = product.find('span', {'class': 'a-text-normal'}).text
    print(title)
    # used to get the all the elements that are related to the price
    price_symbol = product.find('span', {'class': 'a-price-symbol'})
    if price_symbol is not None:
        price_symbol = price_symbol.text
    price = product.find('span', {'class': 'a-price-whole'})
    price_decimal = product.find('span', {'class': 'a-price-fraction'})
    if price and price_decimal:
        full_price = price.text.replace(',', '') + price_decimal.text
        print(full_price + price_symbol)
    else:
        print('Price not available')
    url = product.find("a",{"class":"a-link-normal s-no-outline"})['href']
    print(url)

The last part where we get the link is the most important as that will allow us to scrape the data from the product page. We will use the link to go to the product page and scrape additional information like the product description, specifications, and reviews.
So now we can focus on getting the desired information from the product page.

### Usecase Amazon Product Page

In [None]:
scraper.driver.get("https://www.amazon.nl/Aqara-Aanwezigheids-zonepositionering-multi-persoon-val-detectie/dp/B0BXWZMQJ3/ref=sr_1_1_sspa?__mk_nl_NL=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=21JVASD4BUSV7&dib=eyJ2IjoiMSJ9.NPyacmd0m85ZeCNArVPM9Bon6Y8TLrSQcO9BdCAO0vDsR2arYYQrWnOVgQK5_U968FCJnBVFq6IaJqXyvgpQNZgCCpdhZVI68wwMkuBDdTN7BE4mdQelc53TzCAMMio3UkQKa-x_IZ6i39k3zpA_IGomSlEcAiDewcidR7y8WKv-vdhlCqK_mT3z7aob5B8-pwYrwiAreNvD-WpE_TWeRcBRUT2bgDrGfoWt2PIe-ADqIag3e2do1CRWBanpEE9wAGnVNMfaVC1dzKbcaHCnlWaoJ3ccRbgMqwqsylaB3uc.KnvdkEs2tzfp7_Z-JoAal-rD1uenvIzuky1B7BtI12I&dib_tag=se&keywords=sensors&qid=1732036910&sprefix=sensors%2Caps%2C97&sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&psc=1&language=en_GB")

In [None]:

# once again we get the html data in a format that we can use easily
page_data = BeautifulSoup(scraper.driver.page_source, 'html.parser')

# get the elements containing the price information
price_element = page_data.find("span", {"class": "a-price-whole"})
price_decimal_element = page_data.find("span", {"class": "a-price-fraction"})
price_symbol_element = page_data.find("span", {"class": "a-price-symbol"})

# if the elements are found, we can get the price information
if price_element and price_decimal_element and price_symbol_element:
	price = price_element.text.replace(',', '')
	price_decimal = price_decimal_element.text
	full_price = price + price_decimal
	price_symbol = price_symbol_element.text
	print(full_price + price_symbol)
else:
	print("Price information not found")

# get the product name
product_name_element = page_data.find("span", {"id": "productTitle"})
print(product_name_element.text.strip())

# get the product description
labels_element = page_data.find("ul", {"class": "a-unordered-list a-vertical a-spacing-mini"})
labels = []

# if the labels are found, we can get the product description
if labels_element:
	list_items = labels_element.find_all("li", {"class": "a-spacing-mini"})
	for item in list_items:
		label = item.find("span", {"class": "a-list-item"}).text.strip()
		labels.append(label)
	print(labels)
else:
	print("Labels not found")

# wait for the rating element to be present

rating_element = page_data.select_one("#acrPopover")
	
if rating_element:
	text_rating = rating_element.get("title")
	if isinstance(text_rating,str):
		rating = float(text_rating.split(" ")[0].replace(",","."))

print(rating)

### Amazon reviews

We want to see what the people are saying about the product. We will scrape the reviews for a given product. The code below will scrape the review 
- title
- rating
- review text

In [None]:
# from the page we were on we can search for the results
reviews = page_data.find_all("div", {"data-hook": "review"})

# loop through the reviews and get the required information
for review in reviews:
    title_element = review.find("a", {"data-hook": "review-title"})
    if title_element:
        title = title_element.text.strip().split('\n', 1)[-1]
    else:
        title_element = review.find("span", {"class": "cr-original-review-content"})
        title = title_element.text.strip().split('\n', 1)[-1] if title_element else "Title not found"
    
    review_text_element = review.find("span", {"data-hook": "review-body"})
    if review_text_element:
        review_text = review_text_element.text.strip().replace('Read more', '').strip()
    else:
        review_text_element = review.find("span", {"class": "cr-original-review-content"})
        review_text = review_text_element.text.strip().replace('Read more', '').strip() if review_text_element else "Review text not found"
    
    rating_element = review.find("i", {"data-hook": "review-star-rating"})
    if not rating_element:
        rating_element = review.find("span", {"class": "a-icon-alt"})
    if rating_element:
        rating_text = rating_element.text.strip()
        rating_value = float(rating_text.split()[0].replace(',', '.'))
    else:
        rating_value = 0.0
    
    print(f"Title: {title}")
    print(f"Rating: {rating_value}")
    print(f"Review: {review_text}")

### AliExpress Scraping

I have create functions to scrape the product details from AliExpress. The functions and they work similar to the Amazon functions. The only difference is the class names and the structure of the webpage.

In [None]:
from get_product_data_AEx import get_product_page_data_AE, get_product_urls_AE

# Loop over the products and get the required information

product_to_search = "Butt plug"
scraper = Scraper(headless=False, load_images=True)

link_list: list[str] = get_product_urls_AE(scraper=scraper, search_query=product_to_search, max_page_number=1)

# Initialize an empty list to store products
products:list[Product] = []

# Loop over the products and get the required information
for url in link_list:
    product = get_product_page_data_AE(scraper, url)
    products.append(product)
    print(product)

Product.save_product_data(products, "aliexpress_products.csv")
print("Data saved to aliexpress_products.csv")

Now that we have AliExpress data we also would like to get all of the amazon data. We will use the functions we created for Amazon and scrape the data for a specific search term. The code for this has been extracted from the previous examples and we now go through the process of scraping the data for Amazon.

### Scrape all Amazon Data

In [None]:
from get_product_data_az import get_product_data_az, get_product_urls_az
product_to_search = "Butt plug"
all_products: list[Product] = []
product_urls: list[str] = get_product_urls_az(scraper=scraper, search_query=product_to_search, max_page_number=1)
for url in product_urls:
    product_data = get_product_data_az(scraper=scraper, url=url)
    all_products.append(product_data)

Product.save_product_data(all_products, "amazon_products.csv")
scraper.driver.quit()

### Preprocess and Clean Data

Before we can do anything useful with the data, we need to preprocess and clean it. This involves tasks like removing duplicates, handling missing values, and converting data to the correct format. 

In [None]:
import ast

# Preprocess and Clean Data
ae_df = pd.read_csv("aliexpress_products.csv")
ae_df["price"] = ae_df["price"].str.replace("€", "").str.replace(",", ".").astype(float)

# Convert the string representation of the list back to a list
def safe_literal_eval(val):
    try:
        return ast.literal_eval(val)
    except (ValueError, SyntaxError):
        return []

ae_df["about_product"] = ae_df["about_product"].apply(lambda x: safe_literal_eval(x) if isinstance(x, str) else x)

# Instead of using a list of about_product, we can use a single string
ae_df["about_product"] = ae_df["about_product"].apply(lambda x: " ".join(x))
print(ae_df.head())

# Preprocess and Clean Data for amazon data
az_df = pd.read_csv("amazon_products.csv")
az_df["price"] = az_df["price"].astype(str).str.replace("€", "").str.replace(",", ".").astype(float)
print(az_df.head())

full_df = pd.concat([ae_df, az_df], ignore_index=True)

# Remove emojis from reviews
full_df["reviews"] = full_df["reviews"].str.encode('ascii', 'ignore').str.decode('ascii')

# Remove newline characters from reviews
full_df["reviews"] = full_df["reviews"].str.replace("\n", " ")

# Remove any remaining non-ASCII characters
full_df["reviews"] = full_df["reviews"].str.replace(r'[^\x00-\x7F]+', '', regex=True)

# Handle missing values
full_df.fillna('', inplace=True)

# Convert reviews to string type
full_df["reviews"] = full_df["reviews"].astype(str)

# Ensure that the reviews are properly cleaned and formatted
full_df["reviews"] = full_df["reviews"].apply(lambda x: x.encode('ascii', 'ignore').decode('ascii'))

print(full_df.head())


# Exploratory Data Analysis
Perform exploratory analysis to understand data distributions and relationships.

In [None]:
# Exploratory Data Analysis

# Summary statistics for numerical columns
full_df.describe()

# Distribution of product prices
plt.figure(figsize=(10, 6))
plt.hist(full_df['price'], bins=30, edgecolor='k', alpha=0.7)
plt.title('Distribution of Product Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()


# Distribution of number of reviews
plt.figure(figsize=(10, 6))
plt.hist(full_df['reviews'], bins=30, edgecolor='k', alpha=0.7)
plt.title('Distribution of Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Frequency')
plt.show()

# Scatter plot of price vs number of reviews
plt.figure(figsize=(10, 6))
plt.scatter(full_df['price'], full_df['reviews'], alpha=0.7)
plt.title('Price vs Number of Reviews')
plt.xlabel('Price')
plt.ylabel('Number of Reviews')
plt.show()
