# Web Scraping for Data Science Course

### Made by: Andrija Botica, Daria Milić, Karlo Nevešćanin
### Professor: dr. sc. Toni Perković

## Introduction

Our idea was to use web scraping to gather data from Croatian stores.

The data can then be used to find the product you want for the cheapest price.

Data will also be used in Human Computer Interaction course where we will implement full-stack web application.

## Getting Started

We used Python, BeautifulSoup and Selenium to scrape data from Croatian stores.

We scraped data from the following stores:
- Konzum - Webshop
- Ribola - Wolt
- Studenac - Wolt
- Tommy - Wolt


To start scraping we need to install the following libraries:
- requests
- bs4
- lxml

In [None]:
pip install requests bs4 lxml

Now we need to include them in our code (lxml does not need to be included).

In [2]:
import requests
from bs4 import BeautifulSoup

Konzum webshop has different categories of products, so we need to scrape them separately.

Every category has its own URL, so we need to scrape them one by one.

We define a variable `categories` which contains all the categories we want to scrape.
To get each category we will create `categories_links` and `categories_urls` lists.

The latter will have base URL appended to each category link.

In [3]:
# Konzum webshop
URL = "https://www.konzum.hr"

# Variables
categories = []
categories_links = []
categories_urls = []

Now let's scrape the data from Konzum webshop and extract the categories.

In [4]:
r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")
table = soup.find("section", attrs={"class": "py-3"})
categories = table.find_all("a", attrs={"class": "category-box__link"})

We will create each categories URL using `for loops`.

In [5]:
for category in categories:
    categories_links.append(category['href'])

for links in categories_links:
    categories_urls.append(URL + links)

To keep count of the number of products we will use `total_products` variable.

In [6]:
total_products = 0

Now we can scrape the data from the categories.

For each category url from `categories_urls` we will scrape the data and extract the product name, price and image alongside with name of the store.

In [None]:
for category_url in categories_urls:
            page = requests.get(f'{category_url}').text
            pageSoup = BeautifulSoup(page, 'lxml')
            subCategories = pageSoup.find('ul', class_='plain-list mb-3')
            subCategories_aTag = subCategories.find_all('a')

            if subCategories_aTag:
                for aTag in subCategories_aTag:
                    subCategoryLink = aTag.get('href')
                    subCategoryURL = URL + subCategoryLink

                    if subCategoryURL:
                        page_number = 1
                        while True:
                            finalPage = requests.get(f'{subCategoryURL}?page={page_number}')
                            if finalPage.status_code != 200:
                                break
                            finalSoup = BeautifulSoup(finalPage.text, 'lxml')
                            allItems = finalSoup.find('div', class_='col-12 col-md-12 col-lg-10')
                            if not allItems:
                                break

                            productsList = allItems.find('div', class_='product-list product-list--md-5 js-product-layout-container product-list--grid')
                            if not productsList:
                                break

                            articles = productsList.find_all('article', class_='product-item product-default')
                            if not articles:
                                break

                            for article in articles:
                                articleImageURL = article.find('img').get('src')
                                if articleImageURL:
                                    articleTittleTag = article.find('h4', class_='product-default__title')
                                    if articleTittleTag:
                                        articleNameTag = articleTittleTag.find('a', class_='link-to-product')
                                        if articleNameTag:
                                            articleName = articleNameTag.get_text(strip=True)
                                            if articleName:
                                                articleEuro = article.find('span', class_='price--kn').text
                                                articleCent = article.find('span', class_='price--li').text
                                                total_products += 1

                                                print(f'Name: {articleName}', f'Price: {articleEuro}.{articleCent}', f'Image URL: {articleImageURL}', f'Store: Konzum')
                            page_number += 1

print(f'Total number of products: {total_products}')

We have created this code by looking at HTML code of the website. We simply tell BeatifulSoup to find the tags that contain the data we need.

## Selenium

This was just one scraping example. There were other stores that we scraped data from, some of which required use of `Selenium` because of their dynamic content.

Following stores required `Selenium` for scraping due to their dynamic content:
- Ribola - Wolt
- Studenac - Wolt
- Tommy - Wolt


Our scraping code can be found on this link: https://github.com/abotica/data-science-scraping

To start scraping we need to install the following libraries:
- selenium (used for scraping)
- pandas (used for creating dataset)

In [None]:
pip install selenium pandas

Now we need to include them in our code.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
import os
import re

In this example, we will be scraping **Studenac - Wolt**

We will start by opening the Wolt Studenac page using Selenium and handle any pop-ups that may appear.

In [None]:
WOLT_URL = "https://wolt.com/hr/hrv/split/venue/studenac-kralja-zvonimira-t300/items/"
URL = "https://wolt.com/hr/hrv/split/venue/studenac-kralja-zvonimira-t300"
Store_name = "Wolt-Studenac"
categories_links = []

driver = webdriver.Chrome()
driver.get(URL)

try:
    consent_button = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, "div.deorxlo button.cbc_TextButton_rootCss_7cfd4"))
    )
    consent_button.click()
    time.sleep(1)
except Exception as e:
    print(f"Consent button not found or not clickable: {e}")

try:
    x_button = WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, "button[class='cbc_IconButton_root_7cfd4 c1hmjr97']"))
    )
    x_button.click()
    time.sleep(1) 
except Exception as e:
    print(f"X button not found or not clickable: {e}")

Like above, each store has different categories of products, and almost every category has a few subcategories.

Every subcategory has its own URL, so we need to scrape them one by one.

The following codes show how to get each subcategory's URL, which will then be forwarded to the `scrape_products()` function.

First, we extract the URL of each category on the page and store it in `categories_links`.

Next, we filter these links to remove any irrelevant ones.

In [None]:
subMenu = driver.find_element(By.CSS_SELECTOR, "div[data-test-id='navigation-bar']")
head_nav_aTag = driver.find_elements(By.CSS_SELECTOR, "a[data-test-id='navigation-bar-link']")
if head_nav_aTag:
    for head_nav in head_nav_aTag:
        categories_links.append(head_nav.get_attribute('href'))

filtered_categories_urls = [url for url in categories_links if "https://wolt.com/hr/hrv/split/venue/studenac-kralja-zvonimira-t300/items/" in url]

For each category URL, we attempt to find subcategories.

If a category has no subcategories, the category URL will be forwarded to the `scrape_products()` function.

If subcategories are present, we extract the URLs of each subcategory and call the `scrape_products()` function for each subcategory.

The `scrape_products()` function is defined in a later cell.

In [None]:
if filtered_categories_urls:
    for filtered_category_url in filtered_categories_urls:
        page_url = filtered_category_url.replace("https://wolt.com", "")
        a_Tag = driver.find_element(By.XPATH, f'//a[contains(@href, "{page_url}")]')
        a_Tag_parent = a_Tag.find_element(By.XPATH, "./ancestor::div[1]")
        a_Tag.click()
        time.sleep(1)
        temporary = a_Tag_parent.find_element(By.XPATH, "./ancestor::div[1]")
        if 'a1qapeeb rljt8w0' in temporary.get_attribute('class'):
            subpage_divs = a_Tag_parent.find_elements(By.XPATH, "./following-sibling::div[1]//a[@data-test-id='navigation-bar-link']")
            if subpage_divs:
                for subpage_div in subpage_divs:
                    subpage_url = subpage_div.get_attribute('href')
                    subcategory_name = subpage_div.find_element(By.XPATH, "./div[@data-test-id='NavigationListItem-title']")
                    # scrape_products(subpage_url, subcategory_name.text, Store_name)
            else:
                print("Error with getting a tags")
        else:
            category_name =  a_Tag.find_element(By.XPATH, "./div[@data-test-id='NavigationListItem-title']")
            # scrape_products(filtered_category_url, category_name.text, Store_name)
            print("Else")

        driver.execute_script("arguments[0].scrollIntoView({ block: 'start', inline: 'nearest'});", a_Tag_parent)
        time.sleep(1)

Now we can scrape the data from the (sub)categories using the `scrape_products()` function.

First, we handle any pop-ups that may appear.

Then, we start scraping the data (product name, product price, product image URL) from dynamically loaded products. Since they are dynamically loaded, scrolling is needed in order to load all the products.
 
 The script scrolls to the product that is following the last scraped product to ensure all products are loaded and their data is captured, and it continues to scroll until all products of the page are scraped.


Once all products have been scraped, the data is stored in a CSV file. This CSV file will later be used for analysis.

The data is appended to the CSV file if it already exists, ensuring that all scraped data from different categories and stores is consolidated into a single file. This approach allows for easy data manipulation and analysis in subsequent steps.

In [None]:
def scrape_products(url, cat_name, Store_Name):

    driver = webdriver.Chrome()
    driver.get(url)
    product_list = []
    previous_length = 0
    visited_pages = set()

    print(f'Wep page: {url}')
    try:
        consent_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, "div.deorxlo button.cbc_TextButton_rootCss_7cfd4"))
        )
        consent_button.click()
        time.sleep(1)
    except Exception as e:
        print(f"Consent button not found or not clickable: {e}")

    while True:
        initial_html = driver.page_source
        names_h3 = driver.find_elements(By.CSS_SELECTOR, "[data-test-id='ImageCentricProductCard.Title']")
        seen_products = set([product[0] for product in product_list])
        for name_div in names_h3:
            name = name_div.text.strip()
            if name not in seen_products:
                try:
                    price_div = driver.find_element(By.XPATH, f'//h3[contains(normalize-space(text()), "{name.replace("'", "\'")}")]/../preceding-sibling::div[1]/span')
                    price = price_div.text.strip()
                    if price_div:
                        image_url = price_div.find_element(By.XPATH, "./../../preceding-sibling::div[1]/span/img").get_attribute('src')
                    else:
                        image_url = ""
                except Exception as e:
                    price = "Price not found"
                product_list.append((name, price, image_url))
                seen_products.add(name)
                

        
        # Find and scroll to the last product element
        if product_list:
            last_product_name = product_list[-1][0].strip().replace("'", "\'")
            element = driver.find_element(By.XPATH, f'//h3[contains(normalize-space(text()), "{last_product_name}")]')
            driver.execute_script("arguments[0].scrollIntoView({ block: 'start', inline: 'nearest'});", element)
            time.sleep(1.5)
        else:
            print("No products found.")
            break

        # Check if new products were loaded
        if len(product_list) == previous_length:
            break
        previous_length = len(product_list)

    
    print(f'Number of products: {len(product_list)}')
    print(product_list)
        
    driver.quit()

    print("\n")

    # Create a pandas DataFrame from the product list
    df = pd.DataFrame(product_list, columns=['Product Name', 'Price', 'ImageURL']) 
    df['PageURL'] = url
    df['Category'] = cat_name
    df['Store'] = Store_Name
    df = df[['Product Name', 'Price', 'Category', 'Store', 'ImageURL', 'PageURL']]

    # Save the DataFrame to a CSV file, appending if the file exists
    file_exists = os.path.isfile('products.csv')
    df.to_csv('products.csv', mode='a', header=not file_exists, index=False)


We have created this code by looking at HTML code of the website. We simply tell Selenium to find the tags that contain the data we need.


## Analysis

After successfully scraping the data we created a CSV file containing the data from all the stores.

**ODE IDE LINK NA CSV FILE I OSTATAK ANALIZE I ZAKLJUCAK**
