## **Sprint 3: H&M Personalized Fashion Recommendations**

###Part 3: BERT-Based Recommendations on Input

___

Atoosa Rashid

[GitHub](https://github.com/atoosa-r/)

[LinkedIn](https://www.linkedin.com/in/atoosarashid/)
____

### **Introduction**

In this data analysis, we explore H&M Group datasets, including transactions, customer information, and article details. H&M Group operates globally with 53 online markets and approximately 4850 stores. The objective is to uncover insights for developing effective product recommendations.

This notebook has been primarily created to facilitate an interactive UI experience, allowing participants to test and explore the recommendation system during our demo day presentations.

**Step-by-Step Plan**

1. Data Preparation

2. Word Embedding with BERT

3. Recommendation Function for Inputs

___

We'll begin by importing our necessary libraries and sentence transformers for word embedding.

In [3]:
!pip install sentence_transformers


Collecting sentence_transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/227.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m225.3/227.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.11.0->sentence_transformers)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)


In [4]:
#Importing libraries:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import time
import re
import string
import logging

from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine as cosine_distance
from scipy.sparse import csr_matrix

import tensorflow as tf
import transformers
from transformers import DistilBertModel, DistilBertTokenizer

from sentence_transformers import SentenceTransformer


In [5]:
#Loading Transformer Model:

bert = SentenceTransformer('paraphrase-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [6]:
#Checking on bert:

bert

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [10]:
#Importing the df:

filtered_articles_df = pd.read_csv("filtered_articles.csv")

In [11]:
#Sanity check:

filtered_articles_df.head()

Unnamed: 0,article_id,prod_name,product_type_name,product_group_name,colour_group_name,department_name,index_group_name,section_name,garment_group_name,detail_desc,preprocessed_detail_desc,updated_description
0,108775015,Strap top,Vest top,Garment Upper body,Black,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps,Jersey top with narrow shoulder straps. Color:...
1,108775044,Strap top,Vest top,Garment Upper body,White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps,Jersey top with narrow shoulder straps. Color:...
2,108775051,Strap top (1),Vest top,Garment Upper body,Off White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps,Jersey top with narrow shoulder straps. Color:...
3,110065001,OP T-shirt (Idro),Bra,Underwear,Black,Clean Lingerie,Ladieswear,Womens Lingerie,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde...",microfibre shirt bra underwired moulded lightl...,"Microfibre T-shirt bra with underwired, moulde..."
4,110065002,OP T-shirt (Idro),Bra,Underwear,White,Clean Lingerie,Ladieswear,Womens Lingerie,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde...",microfibre shirt bra underwired moulded lightl...,"Microfibre T-shirt bra with underwired, moulde..."


First, we will convert the article descriptions and their corresponding article IDs into lists. These lists will be used to get the embeddings and map the recommendations back to the original articles.

Next, we use our imported BERT model to generate embeddings for the article descriptions.

We then define a function, recommend_items, that takes an input sentence and returns the top most similar items based on their embeddings. The function works by encoding the input sentence, calculating cosine similarities with the article embeddings, identifying the top articles with the highest similarity scores, and creating a DataFrame containing the recommended article IDs and their similarity scores.

In [12]:
filtered_articles = filtered_articles_df['updated_description'].tolist()
article_ids = filtered_articles_df['article_id'].tolist()

#Getting embeddings using our bert model:

article_embeddings = bert.encode(filtered_articles)

def recommend_items(input_sentence, top_n=5):
    """
    Recommend top number of similar items based on an input sentence.

    Parameters:
    input_sentence (str): The input sentence for which recommendations are to be made.
    top_n (int, optional): The number of top similar items to recommend. Default is 5.

    Returns:
    pandas.DataFrame: A DataFrame containing the recommended article IDs and their similarity scores.
    """
    #Encoding the inputted sentence using the same bert model:
    input_embedding = bert.encode([input_sentence])[0]

    #Calculating cosine similarities between the input embedding and all article embeddings:
    similarities = cosine_similarity([input_embedding], article_embeddings)[0]

    #Getting the top similar items:
    top_indices = np.argsort(similarities)[::-1][:top_n]

    #Getting the top article IDs:
    top_article_ids = [article_ids[i] for i in top_indices]
    top_similarities = similarities[top_indices]

    #Creating a df for the recommendations:
    recommendations = pd.DataFrame({
        'article_id': top_article_ids,
        'similarity_score': top_similarities
    })

    return recommendations

With our function we can now input any sentence and receive the top reccomendations and the respective similarity_score.

In [13]:
#Sanity check 1:

input_sentence = "A casual summer dress"
top_recommendations = recommend_items(input_sentence, top_n=5)

print(top_recommendations)

   article_id  similarity_score
0   458239017          0.765280
1   502522006          0.728661
2   458239022          0.723263
3   458239001          0.723263
4   567484002          0.704251


In [16]:
#Sanity check 2:

input_sentence = "A simple black dress"
top_recommendations = recommend_items(input_sentence, top_n=5)

print(top_recommendations)

   article_id  similarity_score
0   629024002          0.764172
1   629024001          0.764172
2   577513003          0.752349
3   774506001          0.748125
4   879242010          0.744952


In [14]:
#Sanity check 3:

input_sentence = "Trousers for work."
top_recommendations = recommend_items(input_sentence, top_n=5)

print(top_recommendations)

   article_id  similarity_score
0   807362001          0.782147
1   807362002          0.779681
2   807362003          0.779681
3   521846002          0.719355
4   937667002          0.708640


As shown in the example above, our generalized recommender system for inputted sentences is performing effectively. To enhance the user experience, the associated images of these articles will also be displayed in the Streamlit interface.