<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project: Harmony
## 1.5 Consolidate Product Database
> Authors: Eugene Matthew Cheong
---

## Table of Contents ##

#### 1. Web Scraping

- [1.1 Scraping Lian Seng Hin Website](1.1_web_scraping_liansenghin.ipynb)
- [1.2 Scraping Hafary Website](1.2_web_scraping_hafary.ipynb)
- [1.3 Scraping Lamitak Website](1.3_web_scraping_lamitak.ipynb)
- [1.4 Scraping Nippon Website](1.4_web_scraping_nippon.ipynb)
- [1.5 Consolidate All Product Database](1.5_consolidate_product_database.ipynb)

#### 2. Preprocessing

- [2.1 Processing Canva Palettes](2.1_processing_canva_palette.ipynb)

#### 3. Modelling

- [3.1 Matching Input Photo to Products](3.1_matching_input_photo_to_products.ipynb)
- [3.2 Recommending Canva Palette to Products](3.2_recommending_canva_palette_to_product.ipynb)
- [3.3 Recommending Colours and Colour Palettes with Llama3](3.3_recommending_colours_and_colour_palettes_with_llama3.ipynb)

---

# Import Modules

In [None]:
import os
import time
import h5py

from skimage.io import imread
from skimage.transform import resize
from skimage.color import gray2rgb
from skimage.color import rgb2lab, lab2rgb, rgb2hsv, hsv2rgb

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


from scipy.spatial.distance import euclidean, cosine, cityblock

from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
from keras.models import Model

# Functions for saving and loading h5 files

In [None]:
def save_to_hdf5(processed_images, filename):
    with h5py.File(filename, 'w') as f:
        for i, (img, flat) in enumerate(processed_images):
            f.create_dataset(f'image_{i}', data=img)
            f.create_dataset(f'flat_{i}', data=flat)

In [None]:
def load_from_hdf5(filename):
    with h5py.File(filename, 'r') as f:
        # Initialize containers for images and flattened data
        images = []
        flattened_data = []

        # Iterate over items in HDF5 file and load them
        for i in range(len(f.keys()) // 2):  # Assuming each image has two corresponding keys (image and flat)
            img_key = f'image_{i}'
            flat_key = f'flat_{i}'
            
            # Load and append to respective lists
            images.append(np.array(f[img_key]))  # Convert to numpy array if necessary
            flattened_data.append(np.array(f[flat_key]))

    return images, flattened_data

In [None]:
data_folder = "../datasets/"

# Importing and cleaning up all the catalogue dataframes

### List to reorder the dataframes

In [None]:
columns = ['Model Name', 'Company', 'Type', 'Origin Country', 'Application', 'Filename']

In [None]:
hafary_df = pd.read_csv('../datasets/hafary_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Height (cm)', 'Width (cm)', 'Product URL', 'Category Tags'])
lamitak_df = pd.read_csv('../datasets/lamitak_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Height (cm)', 'Width (cm)', 'Product URL', 'Category Tags'])
liansenghin_df = pd.read_csv('../datasets/liansenghin_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Height (cm)', 'Width (cm)', 'Product URL', 'Category Tags'])
nippon_df = pd.read_csv('../datasets/nippon_df.csv').drop(['Unnamed: 0'],axis=1).drop(columns=['Product URL', 'Category Tags'])

These 2 catalogues did not have information of where the products were made, so I decided to fill it in it with 'Singapore' for the time being.

In [None]:
hafary_df = hafary_df[columns]

lamitak_df = lamitak_df[columns]
lamitak_df['Origin Country'] = lamitak_df['Origin Country'].fillna('Singapore')

liansenghin_df = liansenghin_df[columns]

nippon_df = nippon_df[columns]
nippon_df['Origin Country'] = lamitak_df['Origin Country'].fillna('Singapore')

In [None]:
lamitak_df

In [None]:
liansenghin_df

In [None]:
nippon_df

## Cleaning and removing missing data in nippon_df

There was some errors that were caused when using the recommendation system and noticed that there were missing values in the dataframe. Based on observations, it looks like the scraping function could not pick up the Model name for some products. Will be removing them for now.

In [None]:
nippon_df['Model Name'].isnull().sum()

In [None]:
nippon_df['Filename'].isnull().sum()

In [None]:
nippon_df = nippon_df.dropna()

There were images that were named 'nan.png' as well. Will be removing them as well.

In [None]:
nippon_df = nippon_df[nippon_df['Filename'] != 'nan.png']

In [None]:
nippon_df.head()

In [None]:
nippon_df.shape

### Combined all the dataframes together

In [None]:
all_products_df = pd.concat([hafary_df, liansenghin_df, lamitak_df, nippon_df], axis=0)

In [None]:
all_products_df

# Export combined products to "all_products_df.csv"

In [None]:
all_products_df.to_csv('../datasets/all_products_df.csv', index=False)

# Import "all_products_df.csv"

In [None]:
all_products_df = pd.read_csv('../datasets/all_products_df.csv')
image_folder = '../datasets/images/all_images'

In [None]:
all_products_df

# Preprocessing all the images so that it'll be easier to load in later on than calculating individual images every time 

In [None]:
# Function to preprocess and flatten images
def preprocess(image_path):
    img = imread(image_path)
    if  img.shape[2] == 4:  # Check if the image has an alpha channel
        img = img[:, :, :3]  # Remove the alpha channel if present

    # Convert the RGB image to Lab color space
    #img_lab = rgb2hsv(img)
    img_lab = rgb2lab(img)

    # Resize the image to a fixed size (e.g., 256x256)
    img_resized = resize(img_lab, (256, 256), anti_aliasing=True)
    
    # Optionally, you can flatten the image if needed for further processing
    img_flattened = img_resized.flatten()

    # Return both the resized Lab image and the flattened version
    return img_resized, img_flattened

In [None]:
def safe_cosine(u, v):
    if np.linalg.norm(u) == 0 or np.linalg.norm(v) == 0:
        return 1.0  # Use 1.0 to indicate maximum dissimilarity or undefined similarity
    else:
        return cosine(u, v)

In [None]:
image_list = []

for i in list(all_products_df['Filename']):
  full_image_filepath = os.path.join(image_folder,i)
  if os.path.exists(full_image_filepath):
    image_list.append(os.path.join(image_folder,i))
  else:
    print(f"Error finding image path: {full_image_filepath}")

In [None]:
processed_images = [preprocess(img_path) for img_path in image_list]

# Saving preprocessed all_images to a h5 file

In [None]:
save_to_hdf5(processed_images, os.path.join(data_folder,'h5','preprocessed_all_images.h5'))

---

### Next Notebook: [2.1 Processing Canva Palettes](2.1_processing_canva_palette.ipynb)