<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project: Harmony
## 1.1 Web scraping - Lian Seng Hin
> Authors: Eugene Matthew Cheong
---

## Introduction

This project is designed to assist interior designers by optimizing the selection of interior design products. It compiles a detailed catalogue of items such as tiles, laminates, and paints, and employs cosine similarity calculations to find and suggest products that closely match desired colors. This feature enhances the ability to match color palettes with precision, catering to client preferences and design requirements.

Additionally, the system includes an image matching feature, which allows designers to upload images provided by clients or photos of spaces to be redesigned. It automatically identifies and suggests corresponding products from the catalogue. This capability ensures that designers can efficiently align client expectations with the available inventory, thereby streamlining the design process.

---

## Persona
George has spent eight years designing beautiful homes, adjusting to different client preferences. Despite his experience, he often finds it difficult to start new projects because of the vast array of products and colours available. He also struggles to understand what clients want from their text messages alone. George needs a way to simplify the beginning of his design projects and better grasp client needs.

---

## Problem Statement
How can we help interior designers recommend designs more efficiently?

---

## Approach
By using arecommendation system, it will be tailored to support interior designers by suggesting products based on an initial color or product choice. The system begins by analyzing the color characteristics of the chosen item. Utilizing cosine similarity calculations, it identifies products within our comprehensive catalogue that have similar color properties.

Once a base color or product is selected, the system generates a color palette that harmonizes with the initial choice. This palette serves as a guide for interior designers, enabling them to recommend a range of products that not only match but also complement the core color scheme.

This approach ensures a cohesive aesthetic across the design project, allowing designers to confidently match products with the overall style and color preferences of their clients. The documentation provided here explains the technical and practical aspects of this approach, offering a clear pathway for designers to leverage the system's capabilities effectively.

---

## Table of Contents ##

#### 1. Web Scraping

- [1.1 Scraping Lian Seng Hin Website](1.1_web_scraping_liansenghin.ipynb)
- [1.2 Scraping Hafary Website](1.2_web_scraping_hafary.ipynb)
- [1.3 Scraping Lamitak Website](1.3_web_scraping_lamitak.ipynb)
- [1.4 Scraping Nippon Website](1.4_web_scraping_nippon.ipynb)
- [1.5 Consolidate All Product Database](1.5_consolidate_product_database.ipynb)

#### 2. Preprocessing

- [2.1 Processing Canva Palettes](2.1_processing_canva_palette.ipynb)

#### 3. Modelling

- [3.1 Matching Input Photo to Products](3.1_matching_input_photo_to_products.ipynb)
- [3.2 Recommending Canva Palette to Products](3.2_recommending_canva_palette_to_product.ipynb)
- [3.3 Recommending Colours and Colour Palettes with Llama3](3.3_recommending_colours_and_colour_palettes_with_llama3.ipynb)

---

# Import Modules

In [None]:
import os
import time
import shutil

import requests
from bs4 import BeautifulSoup

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Website to scrape
- https://liansenghin.com.sg/product-category/tiles/
- https://liansenghin.com.sg/product-category/uncategorized/

In [None]:
#Links to scrape
links_list = ["https://liansenghin.com.sg/product-category/tiles/",
              "https://liansenghin.com.sg/product-category/uncategorized/"]

In [None]:
#Setting variable for location of the images
data_img_folder = "../datasets/images"
liansenghin_img_folder =  os.path.join(data_img_folder,"liansenghin")

# Scraping for Lian Seng Hin tile products

### Function to scrape information required per Lian Seng Hin page

In [None]:
# Function to scrape images and labels from a single page
def scrape_page(url,input_folder):

    # Create a directory to store images
    os.makedirs(input_folder, exist_ok=True)

    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch {url}")
        return

    soup = BeautifulSoup(response.content, 'html.parser')

    # Find image containers within the specified class
    image_containers = soup.find_all('div', class_='modus-column-custom')

    # Extract image URLs and labels
    for container in image_containers:
        label_tag = container.find('div', class_="ct-product-right")
        label = label_tag.find('a')['href'].split("/")[-2]

        image_tags = container.find_all('a')
        image_tags = container.find_all('img')
        for image_tag in image_tags:
            image_url = image_tag['src']
            
            #Download image and save with label

            download_image(image_url, label, input_folder) # Uncomment this if you would like to download the images.
            print(f"Image Label Found: {label}")
            print(f"Image URL Found: {image_url}")


### Function to download images with given URL

In [None]:
# Function to download image and save with label
def download_image(url, label, input_folder):
    try:
        image_data = requests.get(url).content
        filename = f"{label}.jpg"
        image_filepath = os.path.join(input_folder,filename)
        with open(image_filepath, 'wb') as f:
            f.write(image_data)
        print(f"Image saved: {image_filepath}")
    except:
        print(f'Unable to save image: {url}')

### Function to scrape Lian Seng Hin Tile pages

In [None]:
# Main function to iterate through pages and scrape
def liansenghin_scrape(base_url):
    page_number = 1
    # Iterate through pages
    while True:
        page_url = f"{base_url}page/{page_number}/"
        response = requests.get(page_url)
        if response.status_code != 200:
            print(f"No more pages. Exiting.")
            break

        print(f"Scraping page {page_number}...")
        scrape_page(page_url,liansenghin_img_folder)
        page_number += 1

In [None]:

start_time = time.time()

for link in links_list:
  print(f"Scraping current link {link}")
  liansenghin_scrape(link)


end_time = time.time()

runtime = end_time - start_time
print("Scraping Runtime:", runtime, "seconds")


# Gathering the details of the products in their sub-pages

### Gathering Category tags

Getting the list of image file names in the liansenghin_img_folder to get the product URL

In [None]:
listdir = os.listdir(liansenghin_img_folder)

In [None]:
print(listdir)

### Base URL that the website use to store their tile products

In [None]:
base_product_url = "https://liansenghin.com.sg/product/"

### To list out the product URL

Taking the product name from the file and adding to the base url.

In [None]:
product_list = []

for imagefile in listdir:
  imagename = os.path.splitext(imagefile)[0]
  product_url = f"{base_product_url}{imagename}/"
  current_product = {"Model Name" : imagename, 
                     "Product URL" : product_url, 
                     "Filename" : imagefile, 
                     "Company" : "Lian Seng Hin",
                    "Type" : "Tiles",
                     "Application" : "Floor" }
  product_list.append(current_product)

In [None]:
product_list

Converting to Dataframe

In [None]:
product_df = pd.DataFrame(product_list)
product_df

### This function populates the "Category tags" scraped from each product site with tags.

In [None]:
def scrape_page_categories(row):
    print("Processing row:", row.name)
    url = row['Product URL']

    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch {url}")
        return None

    soup = BeautifulSoup(response.content, 'html.parser')

    # Find image containers within the specified class
    product_meta_details = soup.find_all('span', class_='posted_in')

    # Extract categories from each span element
    categories = []
    for details in product_meta_details:
        # Find all 'a' tags within the span
        category_links = details.find_all('a')
        for link in category_links:
            categories.append(link.get_text())

    categories_string = ', '.join(categories)
    return(categories_string)

In [None]:
product_df['Category Tags'] = product_df.apply(scrape_page_categories, axis=1)

# Remove rows where the function returned None
product_df = product_df.dropna(subset=['Category Tags'])

### This function populates the "Origin Country" scraped from each product site with where it was made.

In [None]:
def scrape_page_country(row):
    print("Processing row:", row.name)
    url = row['Product URL']

    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch {url}")
        return None

    soup = BeautifulSoup(response.content, 'html.parser')

    # Find image containers within the specified class
    product_attributes_country = soup.find_all('tr', class_="woocommerce-product-attributes-item woocommerce-product-attributes-item--attribute_pa_country")

    # Extract categories from each span element
    for details in product_attributes_country:
        # Find all 'a' tags within the span
        country = details.find_all('p')[0].text.strip()

    return(country)

In [None]:
start_time = time.time()

product_df['Origin Country'] = product_df.apply(scrape_page_country, axis=1)

end_time = time.time()

runtime = end_time - start_time
print("Scraping Runtime:", runtime, "seconds")

### This function populates the "Dimension" scraped from each product site with the dimension of the tile.

In [None]:
def scrape_page_dimension(row):
    print("Processing row:", row.name)
    url = row['Product URL']

    response = requests.get(url)
    if response.status_code != 200:
        print(f"Failed to fetch {url}")
        return None

    soup = BeautifulSoup(response.content, 'html.parser')

    # Find image containers within the specified class
    product_attributes_dimension = soup.find_all('tr', class_="woocommerce-product-attributes-item woocommerce-product-attributes-item--attribute_pa_size")

    # Extract categories from each span element
    for details in product_attributes_dimension:
        # Find all 'a' tags within the span
        dimension = details.find_all('p')[0].text.strip()

    return(dimension)

In [None]:
start_time = time.time()

product_df['Dimension (cm)'] = product_df.apply(scrape_page_dimension, axis=1)

runtime = end_time - start_time
print("Scraping Runtime:", runtime, "seconds")

Removing the " CM" in the dimensions so that we can split the width and height

In [None]:
product_df['Dimension (cm)'] = product_df['Dimension (cm)'].str.replace("CM", "")

Changes all "X" to "x". To easily remove all of the "x".

In [None]:
product_df['Dimension (cm)'] = product_df['Dimension (cm)'].str.replace("X", "x")

Split the dimension into height and weight column.

In [None]:
product_df[['Width (cm)', 'Height (cm)']] = product_df['Dimension (cm)'].str.split(" x ", expand=True)

No longer need the 'Dimension (cm)' column

In [None]:
product_df.drop(columns='Dimension (cm)', inplace=True)

Fix back "HEXAGON"

In [None]:
product_df['Width (cm)'] = product_df['Width (cm)'].str.replace("HExAGON", "HEXAGON")

In [None]:
product_df

In [None]:
product_df.info()

There are some values that are not numerical. Will attend to it later on to see if the information is necessary.

In [None]:
product_df['Width (cm)'].unique()

In [None]:
product_df['Height (cm)'].unique()

# Export Dataframe to CSV

In [None]:
archive_dataset_path = "../datasets/archive_dataset/"
file_path = '../datasets/liansenghin_df.csv'

Archives the old csv and updates with the current list

In [None]:
if not os.path.exists(archive_dataset_path):
    os.makedirs(archive_dataset_path)  # Create the archive folder if it doesn't exist

# Check if the file exists
if os.path.isfile(file_path):
    # Move the file to the archive folder
    shutil.move(file_path, os.path.join(archive_dataset_path, f"liansenghin_df_archived_{pd.Timestamp.now().strftime('%Y%m%d%H%M%S')}.csv"))

In [None]:
product_df.to_csv("../datasets/liansenghin_df.csv")

I noticed after scraping, there are some images that are not correct and showing the tile image. It shows a room instead. So I will update the images later.

# Find missing files and update to the correct image

In [None]:
product_df = pd.read_csv(file_path)

In [None]:
missing_image_list = []

for i in list(product_df['Filename']):
  full_image_filepath = os.path.join(liansenghin_img_folder,i)
  if os.path.exists(full_image_filepath):
    missing_image_list.append(os.path.join(liansenghin_img_folder,i))
  else:
    print(f"Error finding image path: {full_image_filepath}")

# Moving old product image to archive when it is no longer in the CSV

When there are new updates to the catalogue, it will archive the images so that it will not be included in the recommendation.

In [None]:
archive_img_path = os.path.join(liansenghin_img_folder,"archived")
if not os.path.exists(archive_img_path):
    os.makedirs(archive_img_path)  # Create the archive folder if it doesn't exist

# Iterate over all files in the image folder
for image in listdir:
    if os.path.isfile(image):
        # Extract the name or identifier from the image filename
        image_name = os.path.basename(image)  # Adjust this according to your filename structure

        # Check if this image_name exists in the DataFrame
        if not any(product_df['Filename'].astype(str).str.contains(image_name)):
            # Move the file to the archive folder
            try:
                shutil.move(os.path.join(liansenghin_img_folder, image), os.path.join(archive_img_path, image))
                print(f'Image moved to archived: {os.path.join(liansenghin_img_folder, image)}')
            except:
                print(f'Error: Image not found: {os.path.join(liansenghin_img_folder, image)}')

            print(image_name)


---

### Next Notebook: [1.2 Scraping Hafary Website](1.2_web_scraping_hafary.ipynb)