<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project: Harmony
## 1.5 Consolidate Product Database
> Authors: Eugene Matthew Cheong
---

## Table of Contents ##

#### 1. Web Scraping

- [1.1 Scraping Lian Seng Hin Website](1.1_web_scraping_liansenghin.ipynb)
- [1.2 Scraping Hafary Website](1.2_web_scraping_hafary.ipynb)
- [1.3 Scraping Lamitak Website](1.3_web_scraping_lamitak.ipynb)
- [1.4 Scraping Nippon Website](1.4_web_scraping_nippon.ipynb)
- [1.5 Consolidate All Product Database](1.5_consolidate_product_database.ipynb)

#### 2. Preprocessing

- [2.1 Processing Canva Palettes](2.1_processing_canva_palette.ipynb)

#### 3. Modelling

- [3.1 Matching Input Photo to Products](3.1_matching_input_photo_to_products.ipynb)
- [3.2 Recommending Canva Palette to Products](3.2_recommending_canva_palette_to_product.ipynb)
- [3.3 Recommending Colours and Colour Palettes with Llama3](3.3_recommending_colours_and_colour_palettes_with_llama3.ipynb)

---

# Import Modules

In [94]:
import os
import re
import time
import h5py

from skimage.io import imread
from skimage.transform import resize
from skimage.color import gray2rgb
from skimage.color import rgb2lab, lab2rgb, rgb2hsv, hsv2rgb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from scipy.spatial.distance import euclidean, cosine, cityblock

from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
from keras.models import Model

# Functions for saving and loading h5 files

In [95]:
def save_to_hdf5(processed_images, filename):
    with h5py.File(filename, 'w') as f:
        for i, (img, flat) in enumerate(processed_images):
            f.create_dataset(f'image_{i}', data=img)
            f.create_dataset(f'flat_{i}', data=flat)

In [96]:
def load_from_hdf5(filename):
    with h5py.File(filename, 'r') as f:
        # Initialize containers for images and flattened data
        images = []
        flattened_data = []

        # Iterate over items in HDF5 file and load them
        for i in range(len(f.keys()) // 2):  # Assuming each image has two corresponding keys (image and flat)
            img_key = f'image_{i}'
            flat_key = f'flat_{i}'
            
            # Load and append to respective lists
            images.append(np.array(f[img_key]))  # Convert to numpy array if necessary
            flattened_data.append(np.array(f[flat_key]))

    return images, flattened_data

In [97]:
data_folder = "../datasets/"

# Importing and cleaning up all the catalogue dataframes

### List to reorder the dataframes

In [98]:
columns = ['Model Name', 'Company', 'Type', 'Origin Country', 'Application', 'Filename']

In [99]:
hafary_df = pd.read_csv('../datasets/hafary_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Height (cm)', 'Width (cm)', 'Product URL', 'Category Tags'])
lamitak_df = pd.read_csv('../datasets/lamitak_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Height (cm)', 'Width (cm)', 'Product URL', 'Category Tags'])
liansenghin_df = pd.read_csv('../datasets/liansenghin_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Height (cm)', 'Width (cm)', 'Product URL', 'Category Tags'])
nippon_df = pd.read_csv('../datasets/nippon_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Product URL', 'Category Tags'])

These 2 catalogues did not have information of where the products were made, so I decided to fill it in it with 'Singapore' for the time being.

In [100]:
hafary_df = hafary_df[columns]

lamitak_df = lamitak_df[columns]
lamitak_df['Origin Country'] = lamitak_df['Origin Country'].fillna('Singapore')

liansenghin_df = liansenghin_df[columns]

nippon_df = nippon_df[columns]
nippon_df['Origin Country'] = lamitak_df['Origin Country'].fillna('Singapore')

In [101]:
lamitak_df

Unnamed: 0,Model Name,Company,Type,Origin Country,Application,Filename
0,WY 4276X,lamitak,Laminate,Singapore,Carpentry,lamitak_-_rye_savannah_wood_wy_4276x_thumbnail...
1,WY 4274X,lamitak,Laminate,Singapore,Carpentry,lamitak_-_bronze_savannah_wood_wy_4274x_thumbn...
2,WY 4275X,lamitak,Laminate,Singapore,Carpentry,lamitak_-_russet_savannah_wood_wy_4275x_thumbn...
3,WY 4273X,lamitak,Laminate,Singapore,Carpentry,lamitak_-_barley_savannah_wood_wy_4273x_thumbn...
4,WY 7202D,lamitak,Laminate,Singapore,Carpentry,lamitak_-_bleached_barn_wood_wy_7202d_thumbnai...
...,...,...,...,...,...,...
508,CC 47101S,lamitak,Laminate,Singapore,Carpentry,cc_47101s_nude_core_4x8_base_thumbnail_1.jpg
509,ART 11011 D,lamitak,Laminate,Singapore,Carpentry,art_11011d_montignoso_due_4x10_base_thumbnail_...
510,ART 11012 D,lamitak,Laminate,Singapore,Carpentry,art_11012d_montignoso_uno_4x10_base_thumbnail_...
511,ART 1010 XM,lamitak,Laminate,Singapore,Carpentry,art_1010xm_lunigiana_due_4x8_base_thumbnail_1.jpg


In [102]:
liansenghin_df

Unnamed: 0,Model Name,Company,Type,Origin Country,Application,Filename
0,06-eucalyptus,Lian Seng Hin,Tiles,Spain,Floor,06-eucalyptus.jpg
1,01-ckh603111b-60,Lian Seng Hin,Tiles,China,Floor,01-ckh603111b-60.jpg
2,01-broadway-ash-grey,Lian Seng Hin,Tiles,Malaysia,Floor,01-broadway-ash-grey.jpg
3,09-art-blanco,Lian Seng Hin,Tiles,Spain,Floor,09-art-blanco.jpg
4,01-bari-ivory-m-60,Lian Seng Hin,Tiles,China,Floor,01-bari-ivory-m-60.jpg
...,...,...,...,...,...,...
2451,01-bari-silver-m-60,Lian Seng Hin,Tiles,China,Floor,01-bari-silver-m-60.jpg
2452,01-assen-grey-mate,Lian Seng Hin,Tiles,Spain,Floor,01-assen-grey-mate.jpg
2453,03-330a-gloss,Lian Seng Hin,Tiles,China,Floor,03-330a-gloss.jpg
2454,10-stella-azul,Lian Seng Hin,Tiles,Spain,Floor,10-stella-azul.jpg


In [103]:
nippon_df

Unnamed: 0,Model Name,Company,Type,Origin Country,Application,Filename
0,Angel Pink,Nippon,Paint,Singapore,Wall,1162.png
1,Apple White,Nippon,Paint,Singapore,Wall,1140.png
2,Apple White,Nippon,Paint,Singapore,Wall,9070.png
3,Barley White,Nippon,Paint,Singapore,Wall,1141.png
4,Barley White,Nippon,Paint,Singapore,Wall,9071.png
...,...,...,...,...,...,...
2503,Innermost Thoughts,Nippon,Paint,,Wall,NP AC 3455 A.png
2504,Twisted,Nippon,Paint,,Wall,NP AC 3456 A.png
2505,Mystic,Nippon,Paint,,Wall,NP AC 3457 A.png
2506,Blindfold,Nippon,Paint,,Wall,NP AC 3458 A.png


## Cleaning and removing missing data in nippon_df

There was some errors that were caused when using the recommendation system and noticed that there were missing values in the dataframe. Based on observations, it looks like the scraping function could not pick up the Model name for some products. Will be removing them for now.

In [104]:
nippon_df['Model Name'].isnull().sum()

746

In [105]:
nippon_df['Filename'].isnull().sum()

0

In [106]:
nippon_df = nippon_df.dropna()

There were images that were named 'nan.png' as well. Will be removing them as well.

In [107]:
nippon_df = nippon_df[nippon_df['Filename'] != 'nan.png']

In [108]:
nippon_df.head()

Unnamed: 0,Model Name,Company,Type,Origin Country,Application,Filename
0,Angel Pink,Nippon,Paint,Singapore,Wall,1162.png
1,Apple White,Nippon,Paint,Singapore,Wall,1140.png
2,Apple White,Nippon,Paint,Singapore,Wall,9070.png
3,Barley White,Nippon,Paint,Singapore,Wall,1141.png
4,Barley White,Nippon,Paint,Singapore,Wall,9071.png


In [109]:
nippon_df.shape

(505, 6)

### Combined all the dataframes together

In [110]:
all_products_df = pd.concat([hafary_df, liansenghin_df, lamitak_df, nippon_df], axis=0)

In [111]:
all_products_df

Unnamed: 0,Model Name,Company,Type,Origin Country,Application,Filename
0,For Treccia F,Hafary,Tiles,Italy,"Wall, Floor",for-treccia-f_r-500-500-60-user-data-collectio...
1,ST26996N(M),Hafary,Tiles,Malaysia,"Floor, Wall",st26996n(m)_r-500-500-60-user-data-collections...
2,ST26996N(M),Hafary,Tiles,Malaysia,"Floor, Wall",st26996n(m)_r-500-500-60-user-data-collections...
3,QSY8003A,Hafary,Tiles,China,"Floor, Wall",qsy8003a_r-500-500-60-user-data-collections-40...
4,QSY8017A,Hafary,Tiles,China,"Floor, Wall",qsy8017a_r-500-500-60-user-data-collections-40...
...,...,...,...,...,...,...
508,Cumulus Clouds,Nippon,Paint,Singapore,Wall,NP N 3313 P.png
509,Hail Rain,Nippon,Paint,Singapore,Wall,NP N 3314 P.png
510,Painted Glass,Nippon,Paint,Singapore,Wall,NP N 3315 P.png
511,Blue Pool,Nippon,Paint,Singapore,Wall,NP N 1972 P.png


In [112]:
all_img_folder =  os.path.join(data_folder,"images","all_images")

In [113]:
def sanitize_and_rename(filename, folder_path):
    # Split the filename into name and extension
    name, ext = os.path.splitext(filename)
    
    # Define the pattern to match any character that is not a letter, digit, or underscore
    pattern = r'[^a-zA-Z0-9_]'
    
    # Replace all matching characters with an underscore in the name part only
    sanitized_name = re.sub(pattern, '-', name)
    
    # Reconstruct the sanitized filename
    sanitized_filename = sanitized_name + ext
    
    # Construct full paths
    old_path = os.path.join(folder_path, filename)
    new_path = os.path.join(folder_path, sanitized_filename)
    
    # Rename the file if it exists
    if os.path.exists(old_path):
        os.rename(old_path, new_path)
    
    return sanitized_filename

In [114]:
all_products_df['Filename'] = all_products_df['Filename'].apply(lambda x: sanitize_and_rename(x, all_img_folder))

# Export combined products to "all_products_df.csv"

In [115]:
all_products_df.to_csv('../datasets/all_products_df.csv', index=False)

# Import "all_products_df.csv"

In [116]:
all_products_df = pd.read_csv('../datasets/all_products_df.csv')
image_folder = '../datasets/images/all_images'

In [117]:
all_products_df

Unnamed: 0,Model Name,Company,Type,Origin Country,Application,Filename
0,For Treccia F,Hafary,Tiles,Italy,"Wall, Floor",for-treccia-f_r-500-500-60-user-data-collectio...
1,ST26996N(M),Hafary,Tiles,Malaysia,"Floor, Wall",st26996n-m-_r-500-500-60-user-data-collections...
2,ST26996N(M),Hafary,Tiles,Malaysia,"Floor, Wall",st26996n-m-_r-500-500-60-user-data-collections...
3,QSY8003A,Hafary,Tiles,China,"Floor, Wall",qsy8003a_r-500-500-60-user-data-collections-40...
4,QSY8017A,Hafary,Tiles,China,"Floor, Wall",qsy8017a_r-500-500-60-user-data-collections-40...
...,...,...,...,...,...,...
12414,Cumulus Clouds,Nippon,Paint,Singapore,Wall,NP-N-3313-P.png
12415,Hail Rain,Nippon,Paint,Singapore,Wall,NP-N-3314-P.png
12416,Painted Glass,Nippon,Paint,Singapore,Wall,NP-N-3315-P.png
12417,Blue Pool,Nippon,Paint,Singapore,Wall,NP-N-1972-P.png


# Preprocessing all the images so that it'll be easier to load in later on than calculating individual images every time 

In [118]:
# Function to preprocess and flatten images
def preprocess(image_path):
    img = imread(image_path)
    if  img.shape[2] == 4:  # Check if the image has an alpha channel
        img = img[:, :, :3]  # Remove the alpha channel if present

    # Convert the RGB image to Lab color space
    #img_lab = rgb2hsv(img)
    img_lab = rgb2lab(img)

    # Resize the image to a fixed size (e.g., 256x256)
    img_resized = resize(img_lab, (256, 256), anti_aliasing=True)
    
    # Optionally, you can flatten the image if needed for further processing
    img_flattened = img_resized.flatten()

    # Return both the resized Lab image and the flattened version
    return img_resized, img_flattened

In [119]:
def safe_cosine(u, v):
    if np.linalg.norm(u) == 0 or np.linalg.norm(v) == 0:
        return 1.0  # Use 1.0 to indicate maximum dissimilarity or undefined similarity
    else:
        return cosine(u, v)

In [120]:
image_list = []

for i in list(all_products_df['Filename']):
  full_image_filepath = os.path.join(image_folder,i)
  if os.path.exists(full_image_filepath):
    image_list.append(os.path.join(image_folder,i))
  else:
    print(f"Error finding image path: {full_image_filepath}")

Error finding image path: ../datasets/images/all_images/salou-ab-c-beige_r-500-500-60-user-data-collections-2781-15965-1648027707.jpg
Error finding image path: ../datasets/images/all_images/salou-ab-c-esmeralda_r-500-500-60-user-data-collections-2781-15966-1648027708.jpg
Error finding image path: ../datasets/images/all_images/salou-ab-c-beige_r-500-500-60-user-data-collections-2781-15967-1648027718.jpg
Error finding image path: ../datasets/images/all_images/salou-ab-c-beige_r-500-500-60-user-data-collections-2781-15968-1648027721.jpg
Error finding image path: ../datasets/images/all_images/salou-ab-c-blanco_r-500-500-60-user-data-collections-2781-15969-1648027725.jpg
Error finding image path: ../datasets/images/all_images/salou-ab-c-blanco_r-500-500-60-user-data-collections-2781-15970-1648029067.jpg
Error finding image path: ../datasets/images/all_images/salou-ab-c-negro_r-500-500-60-user-data-collections-2781-15971-1648027737.jpg
Error finding image path: ../datasets/images/all_images/

In [121]:
processed_images = [preprocess(img_path) for img_path in image_list]

# Saving preprocessed all_images to a h5 file

In [122]:
save_to_hdf5(processed_images, os.path.join(data_folder,'h5','preprocessed_all_images.h5'))

---

### Next Notebook: [2.1 Processing Canva Palettes](2.1_processing_canva_palette.ipynb)