#KeyBlend Recommender : a Product Recommendation Engine for matching contents with products ü§ñ

##Problem statement

This project is about creating a product recommendation engine for a company named HYPD for their creators that will automatically suggest relevant products to tag when they upload a pebble/video content.



**Who are the creators?**

Creators are core to the HYPD ecosystem. Creators are often influencers or individuals
who can create shops on HYPD and share content to sell the products (their own or
other brands) to their audience.

**What content do creators create?**

Creators can upload pebbles on HYPD, which are nothing but video content, the pebble has a video and product associated with it.

---
##Data Description
It consists of two datasets - Catalog and Content descriptions.

**Catalog data description**
* id: Unique identifier of the catalog
* name: Name of product
* brand_id: ID of product brand
* keywords: Keywords associated with product
* retail_price: Retail price of product
* base_price: Base price of product
* cat_one: Top-level Category ID (Eg: Men)
* cat_two: 2nd Level Category ID (Eg: Clothing)
* cat_three: Third level Category ID (Eg: Shirts)
* status: Publish or Unpublished

**Content data description**

* id: Unique identifier of the content (content_id)
* type: Content type
* media_type: Format of content
* influencer_ids: Unique identifier of influencers
* brand_ids: Unique identifier of brands
* label: Labels associated with the content like Interests, Gender, etc.
* is_processed: Yes or No
* is_active: Yes or No
* view_count: View count of the content
* like_count: Like count on the content
* comment_count: Comment count on the content
* caption: Caption of the content
* catalog_ids: Identifier of catalogs
* catalog_info: Information of catalog
* created_at: timestamp of created
* processed_at: timestamp of created
* like_ids: ids of likes
* liked_by: ids of users who like the content
* last_sync: timestamp of content last synced
* category_path: path of category
* hashtags: hashtags of content

---
##Idea of the Project

The idea is to fine tune a Sentence-Transformer LLM that analyses the caption, hashtags and interests that come with each content created to try and predict the relevent products that could be related to the content.



In [None]:
#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
from gensim.corpora import Dictionary
from gensim.matutils import corpus2csc
from nltk.tokenize import word_tokenize

##1. Data Creation

###Loading the data

In [None]:
content_path="/content/drive/MyDrive/content_masked.json"
with open(content_path, 'r') as json_file:
    content_data = json.load(json_file)
catalog_path="/content/drive/MyDrive/catalog_masked.json"
with open(catalog_path, 'r') as json_file:
    catalog_data = json.load(json_file)

catalog data example :

In [None]:
catalog_data[1]

{'_id': {'$oid': '*****************94cb'},
 'name': 'Bold Swimsuit',
 'brand_id': {'$oid': '*****************93d3'},
 'keywords': ['swimsuit',
  'gymwear',
  'active wear',
  'swimming costume',
  'fitness'],
 'retail_price': 1350,
 'base_price': 1500,
 'cat_one': '*****************7324',
 'cat_two': '*****************7cf9',
 'cat_three': '*****************f056',
 'status': 'Publish'}

content data example :

In [None]:
content_data[0]

{'_id': {'$oid': '*****************b634'},
 'type': 'pebble',
 'media_type': 'video',
 'influencer_ids': [{'$oid': '*****************4055'}],
 'brand_ids': [{'$oid': '*****************be8b'}],
 'label': {'interests': ['interest'],
  'age_groups': ['26-30', '30-35'],
  'genders': ['M', 'F']},
 'is_processed': True,
 'is_active': False,
 'view_count': 0,
 'like_count': 0,
 'comment_count': 0,
 'caption': 'Test pebble',
 'catalog_ids': [{'$oid': '5f6362aa6e9f136645ab2d34'}],
 'catalog_info': None,
 'created_at': {'$date': '2021-04-19T09:47:58.735Z'},
 'processed_at': {'$date': '2021-04-19T09:48:42.575Z'},
 'like_ids': None,
 'liked_by': None,
 'last_sync': None,
 'category_path': None,
 'hashtags': None,
 'sync_type': None,
 'sync': None,
 'series_ids': None}

###Making the product list that contains the id and name and keywords of each product

In [None]:
#extracting "keywords" list from catalog data
catalog_keywords={}
for i in range(len(catalog_data)) :
    catalog_keywords[str(i)]={}
    catalog_keywords[str(i)]["id"]=catalog_data[i]["_id"]["$oid"]
    catalog_keywords[str(i)]["name"]=catalog_data[i]["name"]
    catalog_keywords[str(i)]["keywords"]=catalog_data[i]["keywords"]

In [None]:
product_list=pd.DataFrame(catalog_keywords).T
product_list.head()

Unnamed: 0,id,name,keywords
0,*****************93d0,Moisturising Cream | Shea Butter & Vitamin E,"[cream, moisturising, shea, butter, vitamin e,..."
1,*****************94cb,Bold Swimsuit,"[swimsuit, gymwear, active wear, swimming cost..."
2,*****************93ff,Onion Seed Hair Oil,"[oil, hair fall, growth oil, hair, onion, blac..."
3,*****************9548,Maroon Embroidered Shift Dress,"[one piece, casual, party, long dress, dress]"
4,*****************9548,Maroon Embroidered Shift Dress,"[one piece, casual, party, long dress, dress]"


In [None]:
#clearing up the id column
def clean_id(x) :
    new_x=""
    for i in x :
        if i=="*":
           continue
        else :
           new_x+=i
    return new_x
product_list["id"]=product_list["id"].apply(clean_id)

In [None]:
product_list.head()

Unnamed: 0,id,name,keywords
0,93d0,Moisturising Cream | Shea Butter & Vitamin E,"[cream, moisturising, shea, butter, vitamin e,..."
1,94cb,Bold Swimsuit,"[swimsuit, gymwear, active wear, swimming cost..."
2,93ff,Onion Seed Hair Oil,"[oil, hair fall, growth oil, hair, onion, blac..."
3,9548,Maroon Embroidered Shift Dress,"[one piece, casual, party, long dress, dress]"
4,9548,Maroon Embroidered Shift Dress,"[one piece, casual, party, long dress, dress]"


In [None]:
#deleting duplicates
product_list.drop_duplicates(["id"],inplace=True,ignore_index=True)

In [None]:
#lowercasing the keywords in the dataframe
def lowercase(x) :
    return [word.lower() for word in x]
product_list["keywords"]=product_list['keywords'].apply(lowercase)

###Making the content matrix

In [None]:
#visualizing the content data
content_data[0]

{'_id': {'$oid': '*****************b634'},
 'type': 'pebble',
 'media_type': 'video',
 'influencer_ids': [{'$oid': '*****************4055'}],
 'brand_ids': [{'$oid': '*****************be8b'}],
 'label': {'interests': ['interest'],
  'age_groups': ['26-30', '30-35'],
  'genders': ['M', 'F']},
 'is_processed': True,
 'is_active': False,
 'view_count': 0,
 'like_count': 0,
 'comment_count': 0,
 'caption': 'Test pebble',
 'catalog_ids': [{'$oid': '5f6362aa6e9f136645ab2d34'}],
 'catalog_info': None,
 'created_at': {'$date': '2021-04-19T09:47:58.735Z'},
 'processed_at': {'$date': '2021-04-19T09:48:42.575Z'},
 'like_ids': None,
 'liked_by': None,
 'last_sync': None,
 'category_path': None,
 'hashtags': None,
 'sync_type': None,
 'sync': None,
 'series_ids': None}

> we can see that the keywords here may come from the variable <code> hashtag </code>, the variable <code> caption </code> or even <code> label.interests </code>

In [None]:
#extracting keywords from each content
content_keywords={}
for i in range(1,len(content_data)) :
    content_keywords[str(i)]={}
    content_keywords[str(i)]["id"]=content_data[i]["_id"]["$oid"]
    if str(type(content_data[i]['label']))=="<class 'dict'>" :
       content_keywords[str(i)]["interests"]=content_data[i]["label"]["interests"]
    else :
       content_keywords[str(i)]["interests"]="None"
    if content_data[i]["hashtags"]==None :
       content_keywords[str(i)]["hashtags"]=['None']
    else :
       content_keywords[str(i)]["hashtags"]=content_data[i]["hashtags"]
    if content_data[i]["caption"]==None :
       content_keywords[str(i)]["caption"]=['None']
    else :
       content_keywords[str(i)]["caption"]=content_data[i]["caption"]

In [None]:
#putting it into a dataframe for clearning
content_list=pd.DataFrame(content_keywords).T
content_list.head()

Unnamed: 0,id,interests,hashtags,caption
1,*****************b698,"[Sample, testers, free products]","[#beardoil, #package, #gift, #sample, #pebbles]",A curated box of surprise samples from select ...
2,*****************b6dc,"[work wear, work, women, office look, professi...","[#quaclothing, #workwear, #corporatefashion, #...",Ways to Wear Black Cotton Trousers\n\nThe tape...
3,*****************b6fa,"[work out, activewear, gym, workout, body toni...","[#kicaactive, #workout, #gymwear, #activewear]",She's anything but ordinary. Wearing Kica Acti...
4,*****************b6fe,"[shirt, graphic shirt, quirky, orinted shirt, ...","[#destello, #fashion, #trending, #ootd, #shirt]",Midweek feels. There's so much more to a week ...
5,*****************b714,"[tailored , custom, printed shirts, casual, ev...","[#mensclothing, #floralprint, #printedshirt, #...","Vibrant Tropical Shirt from 15 buttons, styled..."


In [None]:
#cleaning the data
def clean_id(x) :
    new_x=""
    for i in x :
        if i=="*":
           continue
        else :
           new_x+=i
    return new_x
content_list["id"]=content_list["id"].apply(clean_id)

def clean_hashtags(x) :
    if x!=None :
       return [word[1:] for word in x]
content_list['hashtags']=content_list['hashtags'].apply(clean_hashtags)

In [None]:
content_list.head()

Unnamed: 0,id,interests,hashtags,caption
1,b698,"[Sample, testers, free products]","[beardoil, package, gift, sample, pebbles]",A curated box of surprise samples from select ...
2,b6dc,"[work wear, work, women, office look, professi...","[quaclothing, workwear, corporatefashion, offi...",Ways to Wear Black Cotton Trousers\n\nThe tape...
3,b6fa,"[work out, activewear, gym, workout, body toni...","[kicaactive, workout, gymwear, activewear]",She's anything but ordinary. Wearing Kica Acti...
4,b6fe,"[shirt, graphic shirt, quirky, orinted shirt, ...","[destello, fashion, trending, ootd, shirt]",Midweek feels. There's so much more to a week ...
5,b714,"[tailored , custom, printed shirts, casual, ev...","[mensclothing, floralprint, printedshirt, beac...","Vibrant Tropical Shirt from 15 buttons, styled..."


###Making the content-product association list

In [None]:
#loading the original data
content_path="/content/drive/MyDrive/content_masked.json"
with open(content_path, 'r') as json_file:
    content_json = json.load(json_file)
catalog_path="/content/drive/MyDrive/catalog_masked.json"
with open(catalog_path, 'r') as json_file:
    catalog_json = json.load(json_file)

In [None]:
#matching each content with the list of products listed in it
product_content={}
for i in range(len(content_json)) :
    product_content[str(i)]={}
    product_content[str(i)]["content_id"]=content_json[i]["_id"]["$oid"]
    product_content[str(i)]["product_ids"]=content_json[i]["catalog_ids"]

In [None]:
product_content_list=pd.DataFrame(product_content).T
product_content_list.head()

Unnamed: 0,content_id,product_ids
0,*****************b634,[{'$oid': '5f6362aa6e9f136645ab2d34'}]
1,*****************b698,[{'$oid': '6035106e58b8f4136b6e4505'}]
2,*****************b6dc,"[{'$oid': '60094d9352e78c76b023fc9b'}, {'$oid'..."
3,*****************b6fa,"[{'$oid': '5fbbbb65c94eac4d3ca460cd'}, {'$oid'..."
4,*****************b6fe,[{'$oid': '5fda294b0a30d919d2439684'}]


In [None]:
#cleaning the data
def clean_content_ids(x) :
    new_x=""
    for i in x :
        if i=="*" :
           continue
        else :
           new_x+=i
    return new_x
def clean_product_ids(x) :
    new_x=[]
    if x!=None :
      for id in x :
          new_x.append(str(id["$oid"][-4:]))
    return new_x
product_content_list["content_id"]=product_content_list.content_id.apply(clean_content_ids)
product_content_list["product_ids"]=product_content_list["product_ids"].apply(clean_product_ids)

In [None]:
product_content_list=product_content_list.drop_duplicates(subset=["content_id"],keep=False)

In [None]:
product_content_list.head()

Unnamed: 0,content_id,product_ids
0,b634,[2d34]
1,b698,[4505]
2,b6dc,"[fc9b, fbed, fc8b]"
3,b6fa,"[60cd, 6162, 3ee8, 61ba, 61cb]"
4,b6fe,[9684]


## Preprocessing the data for the sentence-transformer LLM


###Cleaning the input data

In [None]:
#removing the ids that are in product_content_list but not in product_list
product_ids=[i for x in product_content_list["product_ids"].values for i in x]
missing_ids=[]
for product_id in product_ids :
    if product_id in product_list["id"].values :
       continue
    else :
       missing_ids.append(product_id)

In [None]:
len(missing_ids)

131

In [None]:
def clean_ids(x) :
    new_x=[]
    for i in x :
        if i in missing_ids :
           continue
        else :
           new_x.append(i)
    return new_x
product_content_list["product_ids"]=product_content_list["product_ids"].apply(clean_ids)

In [None]:
product_ids=[i for x in product_content_list["product_ids"].values for i in x]
missing_ids=[]
for product_id in product_ids :
    if product_id in product_list["id"].values :
       continue
    else :
       missing_ids.append(product_id)

In [None]:
len(missing_ids)

0

In [None]:
product_content_list.head()

Unnamed: 0,content_id,product_ids
1,b698,[4505]
2,b6dc,"[fc9b, fbed, fc8b]"
3,b6fa,"[60cd, 6162, 3ee8, 61ba, 61cb]"
4,b6fe,[9684]
5,b714,"[457d, 4556, 4566, 453d]"


In [None]:
#adding a length column to track how many products are associated with each content
product_content_list["lengths"]=product_content_list["product_ids"].apply(len)
product_content_list=product_content_list[product_content_list["lengths"]!=0]

In [None]:
#sampling the data (because of machine limitaions)
product_content_list=product_content_list.sample(frac=0.2,random_state=42).reset_index()

In [None]:
product_content_list.shape

(1030, 4)

###Formating the input data

In [None]:
content_list.head()

Unnamed: 0,id,interests,hashtags,caption
1,b698,"[Sample, testers, free products]","[beardoil, package, gift, sample, pebbles]",A curated box of surprise samples from select ...
2,b6dc,"[work wear, work, women, office look, professi...","[quaclothing, workwear, corporatefashion, offi...",Ways to Wear Black Cotton Trousers\n\nThe tape...
3,b6fa,"[work out, activewear, gym, workout, body toni...","[kicaactive, workout, gymwear, activewear]",She's anything but ordinary. Wearing Kica Acti...
4,b6fe,"[shirt, graphic shirt, quirky, orinted shirt, ...","[destello, fashion, trending, ootd, shirt]",Midweek feels. There's so much more to a week ...
5,b714,"[tailored , custom, printed shirts, casual, ev...","[mensclothing, floralprint, printedshirt, beac...","Vibrant Tropical Shirt from 15 buttons, styled..."


In [None]:
#transforming the columns in content_list into strings instead of lists
def format(x) :
    string=""
    for el in x :
        string+=el+" "
    return string
def clean(x) :
    return x.split("#")[0]
content_list["interests"]=content_list["interests"].apply(format)
content_list["hashtags"]=content_list["hashtags"].apply(format)
content_list["caption"]=content_list["caption"].apply(clean)

In [None]:
#merging everything into one column called queries
product_content_list["queries"]=""
def queries(x) :
    query=""
    content_vec=content_list[content_list["id"]==x]
    query+=f"Caption : {content_vec['caption'][0]}\n"
    query+=f"Hashtags : {content_vec['hashtags'][0]}\n"
    query+=f"Interests : {content_vec['interests'][0]}\n\n"
    return query
product_content_list["queries"]=product_content_list["content_id"].apply(queries)

In [None]:
product_content_list.head()

Unnamed: 0,index,content_id,product_ids,lengths,queries
0,2117,b4cb,[21c3],1,Caption : Don't train to be skinny. ... \n\nHa...
1,5515,8581,"[c851, e44e, c506, 3680, fd05, 6e5f]",6,Caption : Black is my happy color.\n\nHashtags...
2,1951,80d8,"[3349, 334c, 333d, 3343, 3340, 3346]",6,Caption : Nails before males.\n\nHashtags : na...
3,7192,1eed,[e4a9],1,Caption : A luxurious blend of green tea with ...
4,6696,f040,[99fb],1,Caption : Disguise on the go.\n\nHashtags : sk...


In [None]:
#Transforming the product keywords into strings instead of lists
product_list["keywords_str"]=product_list["keywords"].apply(format)

In [None]:
product_list.head()

Unnamed: 0,id,name,keywords,keywords_str
0,93d0,Moisturising Cream | Shea Butter & Vitamin E,"[cream, moisturising, shea, butter, vitamin e,...",cream moisturising shea butter vitamin e men s...
1,94cb,Bold Swimsuit,"[swimsuit, gymwear, active wear, swimming cost...",swimsuit gymwear active wear swimming costume ...
2,93ff,Onion Seed Hair Oil,"[oil, hair fall, growth oil, hair, onion, blac...",oil hair fall growth oil hair onion black seed...
3,9548,Maroon Embroidered Shift Dress,"[one piece, casual, party, long dress, dress]",one piece casual party long dress dress
4,dcfd,Grey Crop Top With Embroidered Sleeves,"[top, casual wear, trendy, women wear]",top casual wear trendy women wear


In [None]:
product_content_list.head()

Unnamed: 0,index,content_id,product_ids,lengths,queries
0,2117,b4cb,[21c3],1,Caption : Don't train to be skinny. ... \n\nHa...
1,5515,8581,"[c851, e44e, c506, 3680, fd05, 6e5f]",6,Caption : Black is my happy color.\n\nHashtags...
2,1951,80d8,"[3349, 334c, 333d, 3343, 3340, 3346]",6,Caption : Nails before males.\n\nHashtags : na...
3,7192,1eed,[e4a9],1,Caption : A luxurious blend of green tea with ...
4,6696,f040,[99fb],1,Caption : Disguise on the go.\n\nHashtags : sk...


###Making the ground truth using a Doc2vec approach

The idea here is to give 1.0 cosine similarity scores to each product that is associated with each content, but to create scores for other products, we calculate the degree of similarity of each product with the product associated with each content using a Doc2Vec model, giving them thus all "similarity" scores that will help us train our LLM with data that represents what we want it to learn.

In [None]:
product_list_sample=product_list[product_list["id"].isin(product_ids)]
product_list_sample.shape

(3346, 4)

In [None]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

# Example tokenized documents (list of lists)
documents={key:value for key,value in zip(product_list_sample["id"].values,product_list_sample["keywords"].values)}
tokenized_documents = [keywords for keywords in product_list_sample["keywords"].values]

# Tag each document with a unique ID
tagged_documents = [TaggedDocument(words=doc, tags=[i]) for i, doc in enumerate(tokenized_documents)]
model = Doc2Vec(vector_size=50, min_count=1, epochs=40)

# Build the vocabulary from the tagged documents
model.build_vocab(tagged_documents)

# Train the model
model.train(tagged_documents, total_examples=model.corpus_count, epochs=model.epochs)

In [None]:
documents={key:value for key,value in zip(product_list_sample["id"].values,product_list_sample["keywords"].values)}

In [None]:
#testing the model
def similar_products(product_id) :
    product_keywords=product_list[product_list["id"]==product_id]["keywords"].values[0]
    new_vector = model.infer_vector(product_keywords)
    similar_products = model.dv.most_similar([new_vector], topn=5)
    print(f"the product name you're looking for is : {product_list[product_list['id']==product_id]['name'].values[0]}")
    # Print the most similar documents and their similarity scores
    for index, similarity in similar_products:
        sim_prod_id = [k for k, v in documents.items() if v == tokenized_documents[index]][0]
        sim_prod_name=product_list[product_list["id"]==sim_prod_id]["name"].values[0]
        print(f"Document {index} is similar with a similarity score of {similarity:.4f}")
        print(f"Name: {sim_prod_name}")
        print(f"Product ID: {sim_prod_id}")

In [None]:
content_list.head()

Unnamed: 0,id,interests,hashtags,caption
1,b698,Sample testers free products,beardoil package gift sample pebbles,A curated box of surprise samples from select ...
2,b6dc,work wear work women office look professional,quaclothing workwear corporatefashion officelook,Ways to Wear Black Cotton Trousers\n\nThe tape...
3,b6fa,work out activewear gym workout body toning ac...,kicaactive workout gymwear activewear,She's anything but ordinary. Wearing Kica Acti...
4,b6fe,shirt graphic shirt quirky orinted shirt green...,destello fashion trending ootd shirt,Midweek feels. There's so much more to a week ...
5,b714,tailored custom printed shirts casual everyda...,mensclothing floralprint printedshirt beachvibes,"Vibrant Tropical Shirt from 15 buttons, styled..."


In [None]:
#function for creating the scores
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from concurrent.futures import ThreadPoolExecutor

def similarity(vector1, vector2):
    # Use ravel instead of reshape(1, -1) for flat arrays
    return cosine_similarity(vector1.ravel().reshape(1, -1), vector2.ravel().reshape(1, -1))[0][0]

def infer_vector_batch(keywords_list, model):
    return np.array([model.infer_vector(keywords) for keywords in keywords_list])

def rating(content_id, product_id):
    # Cache the lookup results
    product_asoc_id = product_content_list.loc[product_content_list["content_id"] == content_id, "product_ids"].values[0]

    if product_id in product_asoc_id:
        return 5.0

    product_keywords = product_list.loc[product_list["id"] == product_id, "keywords"].values[0]
    product_asoc_keywords = [product_list.loc[product_list["id"] == pid, "keywords"].values[0] for pid in product_asoc_id if len(product_list.loc[product_list["id"] == pid, "keywords"].values)!=0]

    # Infer vectors in a batch to avoid repeated calls
    product_vector = model.infer_vector(product_keywords)
    product_asoc_vectors = infer_vector_batch(product_asoc_keywords, model)

    # Compute cosine similarities in batch
    similarities = cosine_similarity(product_vector.reshape(1, -1), product_asoc_vectors).flatten()

    return np.max(similarities)

In [None]:
ratings=pd.read_csv("/content/drive/MyDrive/KeyBlend recommender/ratings.csv")

In [None]:
ratings.head()

Unnamed: 0,content_id,product_id,content_product_ids,rating
0,b4cb,6e59,"['b4cb', '6e59']",<function rating at 0x7e7a5e2c0d30>
1,b4cb,63e8,"['b4cb', '63e8']",<function rating at 0x7e7a5e2c0d30>
2,b4cb,91be,"['b4cb', '91be']",<function rating at 0x7e7a5e2c0d30>
3,b4cb,fbbb,"['b4cb', 'fbbb']",<function rating at 0x7e7a5e2c0d30>
4,b4cb,9c93,"['b4cb', '9c93']",<function rating at 0x7e7a5e2c0d30>


In [None]:
ratings.drop("rating",axis=1,inplace=True)

In [None]:
ratings=ratings.iloc[:674000]

In [None]:
count=0
def apply_ratings(x) :
    global count
    rating_num=rating(x[0],x[1])
    count+=1
    print(f"{count} done out of {len(ratings)} contents")
    return rating_num
ratings["rating"]=0.0
ratings["rating"]=ratings["content_product_ids"].apply(apply_ratings)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
669001 done out of 674000 contents
669002 done out of 674000 contents
669003 done out of 674000 contents
669004 done out of 674000 contents
669005 done out of 674000 contents
669006 done out of 674000 contents
669007 done out of 674000 contents
669008 done out of 674000 contents
669009 done out of 674000 contents
669010 done out of 674000 contents
669011 done out of 674000 contents
669012 done out of 674000 contents
669013 done out of 674000 contents
669014 done out of 674000 contents
669015 done out of 674000 contents
669016 done out of 674000 contents
669017 done out of 674000 contents
669018 done out of 674000 contents
669019 done out of 674000 contents
669020 done out of 674000 contents
669021 done out of 674000 contents
669022 done out of 674000 contents
669023 done out of 674000 contents
669024 done out of 674000 contents
669025 done out of 674000 contents
669026 done out of 674000 contents
669027 done out of 674000

###Formatting the whole matrix of data

In [None]:
ratings=pd.read_csv("/content/drive/MyDrive/KeyBlend recommender/ratings_v2.csv")

In [None]:
product_list.head()

Unnamed: 0,id,name,keywords,keywords_str
0,93d0,Moisturising Cream | Shea Butter & Vitamin E,"[cream, moisturising, shea, butter, vitamin e,...",cream moisturising shea butter vitamin e men s...
1,94cb,Bold Swimsuit,"[swimsuit, gymwear, active wear, swimming cost...",swimsuit gymwear active wear swimming costume ...
2,93ff,Onion Seed Hair Oil,"[oil, hair fall, growth oil, hair, onion, blac...",oil hair fall growth oil hair onion black seed...
3,9548,Maroon Embroidered Shift Dress,"[one piece, casual, party, long dress, dress]",one piece casual party long dress dress
4,dcfd,Grey Crop Top With Embroidered Sleeves,"[top, casual wear, trendy, women wear]",top casual wear trendy women wear


In [None]:
data=ratings.copy()

In [None]:
def get_query(x) :
    query=product_content_list[product_content_list["content_id"]==x]["queries"].values[0]
    return query

In [None]:
def make_query(x) :
    query=f"keywords : {product_list[product_list['id']==x]['keywords_str'].values[0]}"
    return query
count=0
def apply_get(x) :
    global count
    count+=1
    print(f"{count} done out of {len(data_to_add)} contents")
    return get_query(x)
#data["content_query"]=data["content_id"].apply(apply_get)
data_to_add["content_query"]=data_to_add["content_id"].apply(apply_get)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
304233 done out of 309232 contents
304234 done out of 309232 contents
304235 done out of 309232 contents
304236 done out of 309232 contents
304237 done out of 309232 contents
304238 done out of 309232 contents
304239 done out of 309232 contents
304240 done out of 309232 contents
304241 done out of 309232 contents
304242 done out of 309232 contents
304243 done out of 309232 contents
304244 done out of 309232 contents
304245 done out of 309232 contents
304246 done out of 309232 contents
304247 done out of 309232 contents
304248 done out of 309232 contents
304249 done out of 309232 contents
304250 done out of 309232 contents
304251 done out of 309232 contents
304252 done out of 309232 contents
304253 done out of 309232 contents
304254 done out of 309232 contents
304255 done out of 309232 contents
304256 done out of 309232 contents
304257 done out of 309232 contents
304258 done out of 309232 contents
304259 done out of 309232

In [None]:
def make_query(x) :
    query=f"keywords : {product_list[product_list['id']==x]['keywords_str'].values[0]}"
    return query

In [None]:
count=0
def apply_make(x) :
    global count
    count+=1
    print(f"{count} done out of {len(data_to_add)} contents")
    return make_query(x)
#data["product_query"]=data["product_id"].apply(apply_make)
data_to_add["product_query"]=data_to_add["product_id"].apply(apply_make)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
304233 done out of 309232 contents
304234 done out of 309232 contents
304235 done out of 309232 contents
304236 done out of 309232 contents
304237 done out of 309232 contents
304238 done out of 309232 contents
304239 done out of 309232 contents
304240 done out of 309232 contents
304241 done out of 309232 contents
304242 done out of 309232 contents
304243 done out of 309232 contents
304244 done out of 309232 contents
304245 done out of 309232 contents
304246 done out of 309232 contents
304247 done out of 309232 contents
304248 done out of 309232 contents
304249 done out of 309232 contents
304250 done out of 309232 contents
304251 done out of 309232 contents
304252 done out of 309232 contents
304253 done out of 309232 contents
304254 done out of 309232 contents
304255 done out of 309232 contents
304256 done out of 309232 contents
304257 done out of 309232 contents
304258 done out of 309232 contents
304259 done out of 309232

In [None]:
data.to_csv("/content/drive/MyDrive/KeyBlend recommender/full_data.csv")

In [None]:
product_ids=[id for id in data[data["content_id"]=="0044"]["product_id"].values]
content_ids=[id for id in data[data["product_id"]=="1f49"]["content_id"].values]

In [None]:
data[data["content_id"]=="0044"]["content_query"].values[0]

'Caption : This amazing lip color from starstruck.  Hashtags : lipshade lipcolor lipstick starstruck  Interests : lipstick lipshade   formulation by party star struck make travel function sunny colour shade smudge color features lip meetings type feature finish glossy girl outdoor natural design liner products cinnamon trendy up formal name '

In [None]:
count=0
def encode_prods(query,prod=True) :
    global count
    encoding=encoder.encode(query)
    count+=1
    if prod :
       print(f"{count} out of {len(product_ids)} done.")
    else :
       print(f"{count} out of {len(content_ids)} done.")
    return encoding
product_embeddings={data[data["product_id"]==key]["product_query"].values[0]:encode_prods(data[data["product_id"]==key]["product_query"].values[0]) for key in product_ids}
count=0
content_embeddings={data[data["content_id"]==key]["content_query"].values[0]:encode_prods(data[data["content_id"]==key]["content_query"].values[0],prod=False) for key in content_ids}

1 out of 1957 done.
2 out of 1957 done.
3 out of 1957 done.
4 out of 1957 done.
5 out of 1957 done.
6 out of 1957 done.
7 out of 1957 done.
8 out of 1957 done.
9 out of 1957 done.
10 out of 1957 done.
11 out of 1957 done.
12 out of 1957 done.
13 out of 1957 done.
14 out of 1957 done.
15 out of 1957 done.
16 out of 1957 done.
17 out of 1957 done.
18 out of 1957 done.
19 out of 1957 done.
20 out of 1957 done.
21 out of 1957 done.
22 out of 1957 done.
23 out of 1957 done.
24 out of 1957 done.
25 out of 1957 done.
26 out of 1957 done.
27 out of 1957 done.
28 out of 1957 done.
29 out of 1957 done.
30 out of 1957 done.
31 out of 1957 done.
32 out of 1957 done.
33 out of 1957 done.
34 out of 1957 done.
35 out of 1957 done.
36 out of 1957 done.
37 out of 1957 done.
38 out of 1957 done.
39 out of 1957 done.
40 out of 1957 done.
41 out of 1957 done.
42 out of 1957 done.
43 out of 1957 done.
44 out of 1957 done.
45 out of 1957 done.
46 out of 1957 done.
47 out of 1957 done.
48 out of 1957 done.
4

In [None]:
with open("/content/drive/MyDrive/KeyBlend recommender/content_embeddings_distilbert.pkl","wb") as f :
     pickle.dump(content_embeddings,f)
with open("/content/drive/MyDrive/KeyBlend recommender/product_embeddings_distilbert.pkl","wb") as f :
     pickle.dump(product_embeddings,f)

In [None]:
content_embeddings[data[data["content_id"]=="0044"]["content_query"].values[0]].shape

(768,)

##Preprocessing the data further more

### Remove emojis from content_query

In [None]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/431.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m431.4/431.4 kB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.12.1


In [None]:
import emoji
count=0
def remove_emojis(text):
    global count
    if count%10000==0:
      print(f"{count} out of {len(data)} done.")
    new_txt=emoji.replace_emoji(text, replace='')
    count+=1
    return new_txt
data["content_query"]=data["content_query"].apply(remove_emojis)

0 out of 980457 done.
10000 out of 980457 done.
20000 out of 980457 done.
30000 out of 980457 done.
40000 out of 980457 done.
50000 out of 980457 done.
60000 out of 980457 done.
70000 out of 980457 done.
80000 out of 980457 done.
90000 out of 980457 done.
100000 out of 980457 done.
110000 out of 980457 done.
120000 out of 980457 done.
130000 out of 980457 done.
140000 out of 980457 done.
150000 out of 980457 done.
160000 out of 980457 done.
170000 out of 980457 done.
180000 out of 980457 done.
190000 out of 980457 done.
200000 out of 980457 done.
210000 out of 980457 done.
220000 out of 980457 done.
230000 out of 980457 done.
240000 out of 980457 done.
250000 out of 980457 done.
260000 out of 980457 done.
270000 out of 980457 done.
280000 out of 980457 done.
290000 out of 980457 done.
300000 out of 980457 done.
310000 out of 980457 done.
320000 out of 980457 done.
330000 out of 980457 done.
340000 out of 980457 done.
350000 out of 980457 done.
360000 out of 980457 done.
370000 out of 9

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,content_id,product_id,content_product_ids,rating,content_query,product_query
0,0,44,1f49,"['0044', '1f49']",3.142291,function meeting lipcolor star color outdoor l...,name : Defender For Her Complete Grooming Kit ...
1,1,44,c509,"['0044', 'c509']",1.585508,function meeting lipcolor star color outdoor l...,name : Translucent HD Loose Powder function wo...
2,2,44,c542,"['0044', 'c542']",3.629763,function meeting lipcolor star color outdoor l...,name : Wild Cherry- 3 Piece Lip Kit function w...
3,3,44,9d03,"['0044', '9d03']",2.564123,function meeting lipcolor star color outdoor l...,name : Black Indian Print Oxford Shoes functio...
4,4,44,e466,"['0044', 'e466']",2.602043,function meeting lipcolor star color outdoor l...,name : Australian Tea Tree Bi-Phase Micellar W...


### Lemmatizing the product_query and content_query

In [None]:
#initiate lemmatizer from nltk and lemmetize a string
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
count=0
def lemmatize(text):
  global count
  if count%10000==0:
    print(f"{count} out of {len(data)} done.")
  words={lemmatizer.lemmatize(word) for word in text.lower().split() if word not in stop_words}
  words={word.replace(".","") for word in words if word not in ["Caption",":","Hashtags","Interests","keywords","name"]}
  count+=1
  return " ".join(words)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
data["content_query"]=data["content_query"].apply(lemmatize)
data["product_query"]=data["product_query"].apply(lemmatize)

0 out of 944886 done.
10000 out of 944886 done.
20000 out of 944886 done.
30000 out of 944886 done.
40000 out of 944886 done.
50000 out of 944886 done.
60000 out of 944886 done.
70000 out of 944886 done.
80000 out of 944886 done.
90000 out of 944886 done.
100000 out of 944886 done.
110000 out of 944886 done.
120000 out of 944886 done.
130000 out of 944886 done.
140000 out of 944886 done.
150000 out of 944886 done.
160000 out of 944886 done.
170000 out of 944886 done.
180000 out of 944886 done.
190000 out of 944886 done.
200000 out of 944886 done.
210000 out of 944886 done.
220000 out of 944886 done.
230000 out of 944886 done.
240000 out of 944886 done.
250000 out of 944886 done.
260000 out of 944886 done.
270000 out of 944886 done.
280000 out of 944886 done.
290000 out of 944886 done.
300000 out of 944886 done.
310000 out of 944886 done.
320000 out of 944886 done.
330000 out of 944886 done.
340000 out of 944886 done.
350000 out of 944886 done.
360000 out of 944886 done.
370000 out of 9

###Adding product names to product_query

In [None]:
data.head()

Unnamed: 0,content_id,product_id,content_product_ids,rating,content_query,product_query
0,44,1f49,"['0044', '1f49']",3.142291,Caption : This amazing lip color from starstru...,keywords : defender for her grooming kit shavi...
1,44,c509,"['0044', 'c509']",1.585508,Caption : This amazing lip color from starstru...,keywords : women lady girl girls make up makeu...
2,44,c542,"['0044', 'c542']",3.629763,Caption : This amazing lip color from starstru...,keywords : wild cherry 3 piece lip kit women l...
3,44,9d03,"['0044', '9d03']",2.564123,Caption : This amazing lip color from starstru...,keywords : oxford women lady girl girls shoe ...
4,44,e466,"['0044', 'e466']",2.602043,Caption : This amazing lip color from starstru...,keywords : australian tea tree bi phase mi...


In [None]:
product_queries={prod_id:data[data["product_id"]==prod_id]["product_query"].values[0] for prod_id in data["product_id"].unique()}

In [None]:
#adding product names to data
count=0
def add_names(x) :
    global count
    prod_name=product_list[product_list["id"]==x]["name"].values[0]
    prod_query=product_queries[x]
    prod_query="name : "+prod_name+" "+prod_query
    count+=1
    if count%10000==0:
      print(f"{count} out of {len(data)} done.")
    return prod_query
data["product_query"]=data["product_id"].apply(add_names)

10000 out of 980457 done.
20000 out of 980457 done.
30000 out of 980457 done.
40000 out of 980457 done.
50000 out of 980457 done.
60000 out of 980457 done.
70000 out of 980457 done.
80000 out of 980457 done.
90000 out of 980457 done.
100000 out of 980457 done.
110000 out of 980457 done.
120000 out of 980457 done.
130000 out of 980457 done.
140000 out of 980457 done.
150000 out of 980457 done.
160000 out of 980457 done.
170000 out of 980457 done.
180000 out of 980457 done.
190000 out of 980457 done.
200000 out of 980457 done.
210000 out of 980457 done.
220000 out of 980457 done.
230000 out of 980457 done.
240000 out of 980457 done.
250000 out of 980457 done.
260000 out of 980457 done.
270000 out of 980457 done.
280000 out of 980457 done.
290000 out of 980457 done.
300000 out of 980457 done.
310000 out of 980457 done.
320000 out of 980457 done.
330000 out of 980457 done.
340000 out of 980457 done.
350000 out of 980457 done.
360000 out of 980457 done.
370000 out of 980457 done.
380000 out

## Exporting the data

In [None]:
data_llm=data[["content_query","product_query","rating"]].rename({"content_query":"sentence1","product_query":"sentence2","rating":"score"})

In [None]:
data_llm.to_csv("data_llm.csv")