## **Sprint 3: H&M Personalized Fashion Recommendations**

### Part 1: Incorporating BERT Word Embeddings

___
**Atoosa Rashid** 

[GitHub](https://github.com/atoosa-r/)

[LinkedIn](https://www.linkedin.com/in/atoosarashid/) 
___

### **Introduction**

In this data analysis, we explore H&M Group datasets, including transactions, customer information, and article details. H&M Group operates globally with 53 online markets and approximately 4850 stores. The objective is to uncover insights for developing effective product recommendations.

In this notebook, we will develop a recommender system based on word embeddings. We will utilize BERT from the Sentence Transformers library to generate embeddings for product descriptions. Additionally, we will enhance these descriptions by incorporating the `colour_group_name` column, providing more context and detail to improve the recommendation accuracy.


**Step-by-Step Plan**
1. Data Preparation:

- Merge the `colour_group_name` column with the `detail_desc` column to create more refined product descriptions.
- Ensure data consistency and handle any missing values that may already be present.

2. Word Embedding with BERT:

- Use a Sentence Transformer model to generate embeddings for the updated product descriptions.
- Prepare the embeddings for use in the recommendation system.

3. Building the Recommendation System:

- Develop a recommendation system based on the generated embeddings.
- Evaluate and fine-tune the system for optimal performance.

___

We'll begin by importing our necessary libraries and sentence transformers for word embedding.

In [5]:
!pip install sentence_transformers




In [6]:
#Importing libraries:
import numpy as np
import pandas as pd
import time
import re
import string
import logging

from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import cosine as cosine_distance
from scipy.sparse import csr_matrix

import tensorflow as tf
import transformers
from transformers import DistilBertModel, DistilBertTokenizer
from sentence_transformers import SentenceTransformer

In [7]:
#Loading Transformer Model:

bert = SentenceTransformer('paraphrase-MiniLM-L6-v2')



For this recommender system, we only need the `cleaned_article_df` that we used for Matrix Factorization in our previous notebooks.

In [8]:
#Importing the Dataframe:

articles_df = pd.read_csv("cleaned_articles_df.csv")

In [9]:
#Sanity check:

articles_df.head(3)

Unnamed: 0,article_id,prod_name,product_type_name,product_group_name,colour_group_name,department_name,index_group_name,section_name,garment_group_name,detail_desc,preprocessed_detail_desc
0,108775015,Strap top,Vest top,Garment Upper body,Black,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps
1,108775044,Strap top,Vest top,Garment Upper body,White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps
2,108775051,Strap top (1),Vest top,Garment Upper body,Off White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps


In [10]:
#Initial checks:

articles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 105126 entries, 0 to 105125
Data columns (total 11 columns):
 #   Column                    Non-Null Count   Dtype 
---  ------                    --------------   ----- 
 0   article_id                105126 non-null  int64 
 1   prod_name                 105126 non-null  object
 2   product_type_name         105126 non-null  object
 3   product_group_name        105126 non-null  object
 4   colour_group_name         105126 non-null  object
 5   department_name           105126 non-null  object
 6   index_group_name          105126 non-null  object
 7   section_name              105126 non-null  object
 8   garment_group_name        105126 non-null  object
 9   detail_desc               105126 non-null  object
 10  preprocessed_detail_desc  105126 non-null  object
dtypes: int64(1), object(10)
memory usage: 8.8+ MB


In [11]:
#Initial checks:

articles_df['index_group_name'].value_counts()


index_group_name
Ladieswear       39523
Baby/Children    34619
Divided          15086
Menswear         12539
Sport             3359
Name: count, dtype: int64

Due to computational constraints, we will limit the scope of our recommendations to only three categories within the `index_group_name` column: *Ladieswear*, *Menswear*, and *Sport*. However, this can be expanded in the future.

In [12]:
#Creating the new df:

filtered_articles_df = articles_df[articles_df['index_group_name'].isin(['Ladieswear', 'Menswear', 'Sport'])]

In [13]:
#Sanity check:

filtered_articles_df.head(3)

Unnamed: 0,article_id,prod_name,product_type_name,product_group_name,colour_group_name,department_name,index_group_name,section_name,garment_group_name,detail_desc,preprocessed_detail_desc
0,108775015,Strap top,Vest top,Garment Upper body,Black,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps
1,108775044,Strap top,Vest top,Garment Upper body,White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps
2,108775051,Strap top (1),Vest top,Garment Upper body,Off White,Jersey Basic,Ladieswear,Womens Everyday Basics,Jersey Basic,Jersey top with narrow shoulder straps.,jersey top narrow shoulder straps


In [14]:
#More sanity checks:

filtered_articles_df['index_group_name'].value_counts()

index_group_name
Ladieswear    39523
Menswear      12539
Sport          3359
Name: count, dtype: int64

In [15]:
# Checking for potential null values

filtered_articles_df['colour_group_name'].isna().sum()

0

We can now add the `colour_group_name` column to our `detail_desc` column. The current descriptions don't mention the color of the clothing items, so including this detail will ensure more comprehensive information is captured in our word embedding.

In [17]:
#Defining the function to combine description with colour group name:

def add_color_to_description(row):
    return f"{row['detail_desc']} Color: {row['colour_group_name']}."

#Applying the function and creating a new column with the updated descriptions:

filtered_articles_df.loc[:, 'updated_description'] = filtered_articles_df.apply(add_color_to_description, axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_articles_df.loc[:, 'updated_description'] = filtered_articles_df.apply(add_color_to_description, axis=1)


In [18]:
#Sanity check:

pd.set_option('display.max_colwidth', None)

filtered_articles_df[[ 'detail_desc', 'colour_group_name', 'updated_description']].head()


Unnamed: 0,detail_desc,colour_group_name,updated_description
0,Jersey top with narrow shoulder straps.,Black,Jersey top with narrow shoulder straps. Color: Black.
1,Jersey top with narrow shoulder straps.,White,Jersey top with narrow shoulder straps. Color: White.
2,Jersey top with narrow shoulder straps.,Off White,Jersey top with narrow shoulder straps. Color: Off White.
3,"Microfibre T-shirt bra with underwired, moulded, lightly padded cups that shape the bust and provide good support. Narrow adjustable shoulder straps and a narrow hook-and-eye fastening at the back. Without visible seams for greater comfort.",Black,"Microfibre T-shirt bra with underwired, moulded, lightly padded cups that shape the bust and provide good support. Narrow adjustable shoulder straps and a narrow hook-and-eye fastening at the back. Without visible seams for greater comfort. Color: Black."
4,"Microfibre T-shirt bra with underwired, moulded, lightly padded cups that shape the bust and provide good support. Narrow adjustable shoulder straps and a narrow hook-and-eye fastening at the back. Without visible seams for greater comfort.",White,"Microfibre T-shirt bra with underwired, moulded, lightly padded cups that shape the bust and provide good support. Narrow adjustable shoulder straps and a narrow hook-and-eye fastening at the back. Without visible seams for greater comfort. Color: White."


In [19]:
#Checking on bert:

bert

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

In [20]:
#Checking time necessary to run embedding through bert:

start = time.time()

sentence_embeddings = bert.encode(filtered_articles_df['updated_description'].tolist())

end = time.time()

elapsed_time_seconds = end - start
elapsed_time_hours = elapsed_time_seconds / 3600

print(f"Elapsed time: {elapsed_time_hours} hours")

Elapsed time: 0.08850748936335245 hours


The `updated_description` can be processed using BERT encoding via the Sentence Transformer model. This generates a list of embeddings for the descriptions, which are then used in our recommendation system.

BERT (Bidirectional Encoder Representations from Transformers) is a model designed to understand word context by considering surrounding words. In this setup, BERT processes descriptions up to 128 tokens long.

In [21]:
#Doing sentence embedding on the updated_description:

sentence_embeddings = bert.encode(filtered_articles_df['updated_description'].tolist())

With the individual sentence embeddings, we can calculate the cosine similarity between items. Specifically, we will compute the cosine similarity between the embeddings of the article descriptions.

In [22]:
#Calculating cosine similarity and creating a item vs item embedding matrix:

cosine_sim_matrix = cosine_similarity(sentence_embeddings, sentence_embeddings)

In [23]:
#Sanity check:

cosine_sim_matrix.shape

(55421, 55421)

In [24]:
#Putting the article_ids into a list:

article_ids = filtered_articles_df['article_id'].tolist()

#Labelling the columns and indexing the rows:

article_similarity = pd.DataFrame(cosine_sim_matrix, index=article_ids, columns=article_ids)

#Sanity check:

article_similarity.head()

Unnamed: 0,108775015,108775044,108775051,110065001,110065002,110065011,111565001,111565003,111586001,111593001,...,949198001,949323002,949594001,952267001,952937003,952938001,953450001,953763001,956217002,959461001
108775015,1.0,0.932557,0.929937,0.565288,0.531662,0.550211,0.592509,0.560499,0.518501,0.51131,...,0.318772,0.553594,0.360813,0.545787,0.65809,0.838722,0.408288,0.592692,0.732169,0.65685
108775044,0.932557,1.0,0.993682,0.470773,0.539942,0.521229,0.540954,0.569279,0.439311,0.441906,...,0.24383,0.484759,0.312076,0.462915,0.681742,0.855469,0.329519,0.531827,0.697878,0.693355
108775051,0.929937,0.993682,1.0,0.466655,0.532731,0.516286,0.542607,0.569446,0.440151,0.446092,...,0.24806,0.488602,0.3105,0.471702,0.677598,0.850086,0.33411,0.538772,0.693866,0.697373
110065001,0.565288,0.470773,0.466655,1.0,0.953168,0.968726,0.512428,0.442882,0.554566,0.429782,...,0.349812,0.571967,0.297611,0.568542,0.499511,0.472624,0.477609,0.576684,0.56049,0.500261
110065002,0.531662,0.539942,0.532731,0.953168,1.0,0.983416,0.479509,0.460407,0.506738,0.380632,...,0.280923,0.524764,0.261014,0.507981,0.53651,0.499931,0.40965,0.53724,0.545171,0.546722


In order to generate personalized recommendations for customers, we will need to use the User-Item matrix that we created in our prior recommendation systems.

We'll follow the same steps: starting with the R table and creating a User-Item matrix that is free of NaNs.


In [25]:
#Importing our R table:

R=pd.read_csv("R_df.csv")


In [26]:
#Sanity check:

R.head()

Unnamed: 0,customer_id,article_id,unit_number
0,000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,794321007,1
1,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,448509014,1
2,0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,719530003,1
3,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37e011580a479e80aa94,734592001,1
4,0002cca4cc68601e894ab62839428e5f0696417fe0f9e84551c6827a7629d441,910601002,1


In [27]:
#Initial checks:

R.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 470191 entries, 0 to 470190
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   customer_id  470191 non-null  object
 1   article_id   470191 non-null  int64 
 2   unit_number  470191 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 10.8+ MB


In [28]:
#Creating the User-Item matrix:

filled_matrix = R.pivot(index='customer_id', columns='article_id', values='unit_number')

  filled_matrix = R.pivot(index='customer_id', columns='article_id', values='unit_number')


Since our matrix has `article_id` as columns and `customer_id` as the index, it is inevitable that there will be many null values. This is because customers typically do not purchase the entire catalog of articles available. These null values will need to be filled with 0s to prepare for the subsequent cosine similarity calculation.

In [29]:
#Processing each row to replace NaN with 0:

for index, row in filled_matrix.iterrows():
    filled_matrix.loc[index] = row.fillna(0)

#Sanity Check:

filled_matrix.head()

article_id,108775044,111565001,111586001,111593001,111609001,120129001,120129014,123173001,126589007,129085001,...,948152002,949198001,949551001,949551002,949594001,952267001,952938001,953450001,953763001,956217002
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0000757967448a6cb83efb3ea7a3fb9d418ac7adf2379d8cd0c725276a467a2a,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37e011580a479e80aa94,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0002cca4cc68601e894ab62839428e5f0696417fe0f9e84551c6827a7629d441,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
00039306476aaf41a07fed942884f16b30abfa83a2a8bea972019098d6406793,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We will now develop our functions to help produce our recommended items based on the customers previous purchase history.

In [30]:
def recommend_articles_for_customer(customer_id, top_n=5):

    """
    Recommend articles of clothing to a customer based on their purchase history and article similarity.

    Parameters:
    customer_id (str): The unique identifier for the customer.
    top_n (int): The number of top similar articles to consider for each purchased article. Default is 5.

    Returns:
    list: A list of recommended article IDs for the customer, excluding articles they have already purchased.
    """

    #Getting the articles purchased by the customer:
    purchased_articles = filled_matrix.loc[customer_id]
    purchased_articles = purchased_articles[purchased_articles > 0].index.tolist()

    #Getting similar articles for each purchased article:
    recommended_articles = []
    for article_id in purchased_articles:
        similar_articles = get_similar_articles(article_id, top_n)
        recommended_articles.extend(similar_articles)

    #Removing already purchased articles:
    recommended_articles = list(set(recommended_articles) - set(purchased_articles))

    return recommended_articles[:top_n]


In [31]:
def get_similar_articles(article_id, top_n):

    """
    Get the top number of similar articles of clothing for a specific article.

    Parameters:
    article_id (int): The unique identifier for the article.
    top_n (int): The number of top similar articles to return. Default is 5.

    Returns:
    list: A list of the top similar article IDs.
    """

    #Getting the similarity scores for the specified article:
    similarity_scores = article_similarity.loc[article_id]

    #Sorting the scores in descending order and getting the top similar articles:
    top_similar_articles = similarity_scores.sort_values(ascending=False).head(top_n).index.tolist()

    return top_similar_articles


In [32]:
#Example:

customer_id = '0efc7abe48c4111b1386bc7f122aacdc291af2c31541609c488a38d7383d6ed0'

recommendations = recommend_articles_for_customer(customer_id=customer_id)

print(f"Customer {customer_id}: \n \nTop 5 Recommendations: {recommendations}")


Customer 0efc7abe48c4111b1386bc7f122aacdc291af2c31541609c488a38d7383d6ed0: 
 
Top 5 Recommendations: [915092001, 782061001, 906705001, 684914001, 694478001]


In order to analyze our model we will review prior purchases of this specific customer.

In [33]:
#Specified customer_id:

specific_customer_id = '0efc7abe48c4111b1386bc7f122aacdc291af2c31541609c488a38d7383d6ed0'

customer_purchases = R[R['customer_id'] == specific_customer_id]

#Printing prior purchases:

print(f"Purchases for customer_id {specific_customer_id}:")
customer_purchases

NameError: name 'R_df' is not defined

This customer previously purchased a wool sweater and loafers. Our BERT word embedding recommender system specifically recommended similar sweaters and loafers that match the style and features of the original purchases, offering personalized suggestions based on detailed product descriptions.