# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import re

from sklearn.preprocessing import LabelEncoder

from keras.applications.vgg16 import VGG16, preprocess_input

from tensorflow.keras.preprocessing import image



# Checking GPU Information using NVIDIA System Management Interface (nvidia-smi)

This code snippet utilizes the NVIDIA System Management Interface (nvidia-smi) to provide information about the available GPU(s) in the Colab environment.
The command '!nvidia-smi' is executed to display details such as GPU model, memory usage, and temperature.
This information is valuable when working on tasks that benefit from GPU acceleration, such as deep learning.

Note: Colab provides access to GPUs, but the specific GPU model and capabilities may vary.


In [2]:
!nvidia-smi

/bin/bash: line 1: nvidia-smi: command not found


# Loading Data from Kaggle in Colab

This code snippet demonstrates the process of loading data from Kaggle into a Colab notebook environment.
It utilizes the 'files' module from the 'google.colab' library to upload the Kaggle API key (kaggle.json).
Additionally, it installs the 'kaggle' Python package, sets up the Kaggle API key, and downloads a specific dataset.
In this example, the dataset 'visuelle2' is downloaded and extracted using the Kaggle command line interface.

Instructions:
1. Upload your Kaggle API key (kaggle.json) using the file upload widget.
2. Install the 'kaggle' Python package.
3. Set up the Kaggle API key, move it to the appropriate directory, and adjust permissions.
4. Download and unzip the desired Kaggle dataset using the 'kaggle datasets download' command.

Note: Make sure to replace 'dqhdqmcttdqx/visuelle2' with the actual Kaggle dataset URL you want to download.


In [None]:
#Load data frpm kaggle
from google.colab import files
files.upload()

%pip install -q kaggle
!rm -r ~/.kaggle
!mkdir ~/.kaggle
!mv ./kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
#!kaggle datasets list
!kaggle datasets download dqhdqmcttdqx/visuelle2 --unzip

Saving kaggle.json to kaggle.json
rm: cannot remove '/root/.kaggle': No such file or directory
Downloading visuelle2.zip to /content
 99% 2.07G/2.09G [00:21<00:00, 87.7MB/s]
100% 2.09G/2.09G [00:22<00:00, 102MB/s] 


# Loading and Displaying Data from CSV File

This code snippet reads a CSV file named 'sales.csv' from the specified path ("/content/visuelle2/") using Pandas.
The loaded data is stored in a Pandas DataFrame named 'data'.
Subsequently, the 'head()' function is used to display the first few rows of the DataFrame.

Instructions:
1. Ensure that the CSV file 'sales.csv' is located in the correct directory ("/content/visuelle2/").
2. Run this code to load the data into the 'data' DataFrame and print the first few rows.


Note: Adjust the file path if the CSV file is located in a different directory.


In [None]:
data = pd.read_csv("/content/visuelle2/sales.csv")
print(data.head())

   Unnamed: 0  external_code  retail season     category   color  \
0           0              5      36   SS17  long sleeve    grey   
1           1              2      51   SS17  long sleeve  violet   
2           2              5      10   SS17  long sleeve    grey   
3           3              9      41   SS17     culottes  yellow   
4           4              5      13   SS17  long sleeve    grey   

       image_path       fabric release_date  restock  ...    2    3    4    5  \
0  PE17/00005.png      acrylic   2016-11-28       22  ...  1.0  1.0  2.0  1.0   
1  PE17/00002.png      acrylic   2016-11-28       17  ...  1.0  0.0  0.0  2.0   
2  PE17/00005.png      acrylic   2016-11-28       15  ...  1.0  0.0  1.0  1.0   
3  PE17/00009.png  scuba crepe   2016-11-28       32  ...  1.0  1.0  0.0  0.0   
4  PE17/00005.png      acrylic   2016-11-28       26  ...  4.0  0.0  3.0  0.0   

     6    7    8    9   10   11  
0  0.0  0.0  2.0  0.0  0.0  0.0  
1  0.0  0.0  0.0  1.0  1.0  0.0  
2 

In [None]:
'''
data = pd.read_csv("/content/visuelle2/stfore_train.csv")
print(data.head())
print(data.shape)
'''

'\ndata = pd.read_csv("/content/visuelle2/stfore_train.csv")\nprint(data.head())\nprint(data.shape)\n'

# Features encoding

This code snippet focuses on encoding categorical variables and images within a Pandas DataFrame. The categorical variables, namely "category," "color," and "fabric," are encoded using the LabelEncoder from scikit-learn. The resulting encoded values are stored in new columns named "category_encoded," "color_encoded," and "fabric_encoded."

Furthermore, images specified in the "image_path" column are processed using a pre-trained VGG16 model to extract features. The encoding process is designed for efficiency by storing the encoded image features in a dictionary ("image_features_dict") for future retrieval. The 'encode_image' function handles the loading, preprocessing, and feature extraction for each image path. The resulting image features are stored in the "image_features" column of the DataFrame.

To use this code:
1. Ensure that your DataFrame, named 'data,' contains the necessary columns: "category," "color," "fabric," and "image_path."
2. Execute the code to perform the encoding of categorical variables and images.
3. The encoded values for categorical variables and image features are stored in respective columns in the DataFrame.

Note: Adjust the paths and column names accordingly if your DataFrame structure is different.

In [None]:

# Encode categorical variables using LabelEncoder
le = LabelEncoder()
data["category_encoded"] = le.fit_transform(data["category"])
data["color_encoded"] = le.fit_transform(data["color"])
data["fabric_encoded"] = le.fit_transform(data["fabric"])

# Encode images using a pre-trained image model (optimized for multiple occurrences)
image_features_dict = {}  # Store features for efficient retrieval

def encode_image(image_path):
    if image_path not in image_features_dict:
        # Load and preprocess the image (only if not already encoded)
        path = '//content/visuelle2/images/' + image_path
        img = image.load_img(path, target_size=(224, 224))
        x = image.img_to_array(img)
        x = np.expand_dims(x, axis=0)
        x = preprocess_input(x)

        # Extract features using VGG16
        model = VGG16(weights="imagenet", include_top=False)
        #model = MobileNet(weights="imagenet", include_top=False)
        features = model.predict(x, verbose=0).flatten()  # Flatten directly

        # Store features for future retrieval
        image_features_dict[image_path] = features
        if int(len(image_features_dict))%1000 == 0:
          print ("step: ", len(image_features_dict))

    return image_features_dict[image_path]  # Retrieve stored features

# Apply encoding efficiently using the dictionary
data["image_features"] = data["image_path"].apply(encode_image)


KeyboardInterrupt: 

# Print dataframe head

In [None]:
data.head()

Unnamed: 0,external_code,retail,season,category,color,image_path,fabric,release_date,restock,0,...,5,6,7,8,9,10,11,category_encoded,color_encoded,fabric_encoded
0,5,36,SS17,long sleeve,grey,PE17/00005.png,acrylic,2016-11-28,0.415094,0.018868,...,0.018868,0.0,0.0,0.037736,0.0,0.0,0.0,11,4,0
1,2,51,SS17,long sleeve,violet,PE17/00002.png,acrylic,2016-11-28,0.320755,0.018868,...,0.037736,0.0,0.0,0.0,0.018868,0.018868,0.0,11,7,0
2,5,10,SS17,long sleeve,grey,PE17/00005.png,acrylic,2016-11-28,0.283019,0.018868,...,0.018868,0.018868,0.018868,0.018868,0.0,0.0,0.018868,11,4,0
3,9,41,SS17,culottes,yellow,PE17/00009.png,scuba crepe,2016-11-28,0.603774,0.018868,...,0.0,0.0,0.018868,0.0,0.018868,0.0,0.0,1,9,49
4,5,13,SS17,long sleeve,grey,PE17/00005.png,acrylic,2016-11-28,0.490566,0.018868,...,0.0,0.037736,0.018868,0.0,0.0,0.0,0.0,11,4,0


# Save dataframe to file


In [None]:
data.to_csv('data.csv')

# Function to handle string lists

This function, named `cleanString`, is designed to process and clean a string containing numerical values. The steps performed by the function are as follows:

1. **Remove Ellipsis and Extra Spaces:**
   - The function uses regular expressions to remove any occurrences of ellipsis ('...') from the input string.
   - It then utilizes the `re.sub` function to replace multiple consecutive spaces with a single space.

2. **Add a Zero After the Decimal Point:**
   - The function appends a zero after the decimal point for single-digit integers in the cleaned string. For example, '2. ' becomes '2.0 '.

3. **Replace the Last '...' with ']':**
   - The function replaces the last occurrence of '...' in the cleaned string with ']'.

4. **Remove Brackets and Split by Space:**
   - The function removes the leading and trailing brackets from the cleaned string.
   - It then splits the string into individual values based on spaces.

5. **Convert to Numpy Array:**
   - The resulting split values are converted into a NumPy array of floating-point numbers using `np.array`.

6. **Return Resulting Array:**
   - The final cleaned and processed array is returned by the function.

To use this function, pass a string containing numerical values to it, and it will return a NumPy array after the specified cleaning operations.

In [None]:
def cleanString(input_str):

    # Remove ellipsis and extra spaces
    cleaned_str = re.sub(r'\.{3}', '', input_str)
    cleaned_str = ' '.join(cleaned_str.split())

    # Add a zero after the decimal point for single-digit integers
    cleaned_str = cleaned_str.replace('. ', '.0 ')

    # Replace the last '...' with ']'
    cleaned_str = cleaned_str.replace('...', ']')

    input_str = ' '.join(cleaned_str.split())

    # Remove brackets and split by space
    split_values = input_str[1:-1].split()

    # Convert to numpy array
    result_array = np.array(split_values, dtype=float)

    return result_array



# Extracting image features

This code loads data from a CSV file named "data.csv" into a Pandas DataFrame (`df`). It then extracts the "image_features" column as a list of strings and processes each string using the `cleanString` function. The cleaned results are stored in a new list named `listOfArrays`.

The DataFrame (`df`) is subsequently updated with the cleaned image features and a subset of selected columns. The selected columns include "external_code," "retail," "category_encoded," "fabric_encoded," "color_encoded," and individual dimensions of the image features.

After processing, the resulting DataFrame is displayed, showing columns such as "external_code," "retail," "category_encoded," "fabric_encoded," "color_encoded," and individual dimensions of the image features.

Note: Ensure the CSV file "data.csv" is in the correct directory, and adjust the column names as needed for your specific dataset.

In [None]:
# Load the CSV file
df = pd.read_csv("/content/data.csv")

# Extract the image features array
image_features_list = df["image_features"].tolist()

listOfArrays = []
for str in image_features_list:

    # Replace multiple spaces with a single space
    input_str = ' '.join(str.split())
    result_array = cleanString(input_str)
    listOfArrays.append(result_array)

df['image_features'] = listOfArrays

df =df[["external_code", "retail", "category_encoded",
                    "fabric_encoded", "color_encoded", '0', '1', '2', '3', '4',
                    '5', '6', '7', '8', '9', '10', '11','image_features']]

# Print the resulting DataFrame
print(df.columns)


# Exracting goolge trend features

the code integrates sales and Google Trends data for each product by extracting relevant information from the sales DataFrame, defining a date range, and retrieving corresponding Google Trends data. The results are organized into a list named "gtrends," containing 2D arrays for each product's category, color, and fabric Google Trends data over the specified time range through the following operations:

1. **Read DataFrames from CSV Files:**
   - It reads two CSV files into Pandas DataFrames: "sales.csv" and "vis2_gtrends_data.csv" located in the "/content/visuelle2/" directory.

2. **Iterate Over Rows of Sales DataFrame:**
   - It iterates over each row of the "sales_df" DataFrame representing sales data.
   - For each row, it extracts product information such as category ("cat"), color ("col"), fabric ("fab"), and release date ("start_date").

3. **Date Range Calculation:**
   - It converts the release date to a datetime object and calculates the end date as the start date plus 52 weeks.

4. **Date Formatting:**
   - It formats the start and end dates as strings in the "YYYY-MM-DD" format.

5. **Filter and Extract Google Trends Data:**
   - It filters the "gtrends_df" DataFrame based on the calculated date range and the specified columns (category, color, fabric).
   - Extracts the values of the specified columns as NumPy arrays.

6. **Stack Arrays and Append to List:**
   - It stacks the extracted arrays vertically into a 2D array named "multitrends."
   - Appends the 2D array to the "gtrends" list.



In [None]:
import pandas as pd

# Read the sales data frame from a csv file
sales_df = pd.read_csv("/content/visuelle2/sales.csv")

# Read the gtrends data frame from a csv file
gtrends_df = pd.read_csv("/content/visuelle2/vis2_gtrends_data.csv")

# Create an empty list to store the trends data
gtrends = []

# Iterate over each row of the sales data frame
for index, row in sales_df.iterrows():
    # Get the product information from the row
    cat, col, fab, start_date = row["category"], row["color"], row["fabric"], row["release_date"]

    # Convert the start date to a datetime object
    start_date = pd.to_datetime(start_date)

    # Calculate the end date as the start date plus 52 weeks
    end_date = start_date + pd.DateOffset(weeks=52)

    # Format the start and end dates as strings in YYYY-MM-DD format
    start_date_str = start_date.strftime("%Y-%m-%d")
    end_date_str = end_date.strftime("%Y-%m-%d")

    # Get the gtrends data for the corresponding category, color, and fabric
    # Use the loc function to filter the gtrends data frame by the date range and the columns
    gtrends_data = gtrends_df.loc[(gtrends_df["date"] >= start_date_str) & (gtrends_df["date"] <= end_date_str), [cat, col, fab]]

    # Get the values of the three columns as numpy arrays
    cat_gtrend = gtrends_data[cat].values
    col_gtrend = gtrends_data[col].values
    fab_gtrend = gtrends_data[fab].values

    # Stack the three arrays into a 2D array
    multitrends = np.vstack([cat_gtrend, col_gtrend, fab_gtrend])

    # Append the 2D array to the gtrends list
    gtrends.append(multitrends)


# Adding to dataframe

This code snippet adds new columns to the DataFrame (`df`) based on the Google Trends data stored in the "gtrends" list:

1. **"cat_gtrend" Column:**
   - It uses the apply function to extract the category Google Trends data for each row from the "gtrends" list.
   - The lambda function takes the row index (x.name) to access the corresponding 2D array and extracts the first array ([0]) representing category trends.

2. **"col_gtrend" Column:**
   - Similarly, it uses the apply function to extract the color Google Trends data for each row from the "gtrends" list.
   - The lambda function accesses the corresponding 2D array and extracts the second array ([1]) representing color trends.

3. **"fab_gtrend" Column:**
   - It uses the apply function to extract the fabric Google Trends data for each row from the "gtrends" list.
   - The lambda function accesses the corresponding 2D array and extracts the third array ([2]) representing fabric trends.

In [None]:
df["cat_gtrend"] = df.apply(lambda x: gtrends[x.name][0], axis=1)
df["col_gtrend"] = df.apply(lambda x: gtrends[x.name][1], axis=1)
df["fab_gtrend"] = df.apply(lambda x: gtrends[x.name][2], axis=1)

# Save final dataframe

In [None]:
#save to csv
df.to_csv('dataFinal.csv')
print(df)

        external_code  retail  category_encoded  fabric_encoded  \
0                   5      36                11               0   
1                   2      51                11               0   
2                   5      10                11               0   
3                   9      41                 1              49   
4                   5      13                11               0   
...               ...     ...               ...             ...   
106845           5504      51                14              18   
106846           5558      10                14              18   
106847           4988     108                14               7   
106848           4280     105                 1              28   
106849           4791      28                24              41   

        color_encoded    0    1    2    3    4  ...    6    7    8    9   10  \
0                   4  1.0  3.0  1.0  1.0  2.0  ...  0.0  0.0  2.0  0.0  0.0   
1                   7  1.0  1.0  1.