## Semantic Model Development Project -- Sentence Transformer and Cross Encoder Model
#### Last Edit: 6/25/25 by Aneesh Naresh
##### The goal of this project is to develop a semantic similarity model capable of identifying the five most similar predecessor SKUs corresponding to a successor SKU. The project is implemented in the phases listed below:

1. Environment Setup
2. Data Inspection and String-Level Cleanup
3. Embedding Step
4. Vector Store Choices
5. Retrieval Logic
6. Evaluation Loop
7. Packaging and Serving
8. Extensions

#### Phase 2: Data Inspection and String-Level Cleanup (BASIC)

In [1]:
import pandas as pd
import numpy as np
import torch
import sklearn
import tqdm
from sentence_transformers import SentenceTransformer, CrossEncoder
from normality import normalize

In [2]:
sku = pd.read_csv("data/npl_test_data_for_sentence_semantic_analysis.csv")

In [3]:
sku.head()
print(pd.unique(sku["POTENTIAL_PREDECESSOR_SKU_DESC"]).shape)

(7364,)


In [4]:
def normalizeText(text: object):
    return normalize(text)

In [5]:
devDataframe = sku.copy()

In [6]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


In [7]:
devDataframe["SUCCESSOR_SKU"].apply(normalizeText)

0        ph7gcc
1        ph7gcc
2        ph7gcc
3        ph7gcc
4        ph7gcc
          ...  
76825    ph7g38
76826    ph7g38
76827    ph7g38
76828    ph7g38
76829    ph7g38
Name: SUCCESSOR_SKU, Length: 76830, dtype: object

In [8]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


In [9]:
uniqueSuccessorsDescNN = pd.unique(devDataframe["SUCCESSOR_SKU_DESC"])
uniquePredecessorDescNN = pd.unique(devDataframe["POTENTIAL_PREDECESSOR_SKU_DESC"])

In [10]:
print(type(uniqueSuccessorsDescNN))
print(type(uniquePredecessorDescNN))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [11]:
uniquePredecessorDescNNFiltered = uniquePredecessorDescNN[np.isin(uniquePredecessorDescNN, uniqueSuccessorsDescNN, invert=True)]

In [13]:
print(uniqueSuccessorsDescNN)

['DW SIP MAKEUP 30ML - 2N2 BUFF' 'DW SIP MAKEUP 30ML - 2N1 DESERT BEIGE'
 'DW SIP MAKEUP 30ML - 2W1 DAWN' 'DW SIP MAKEUP 30ML - 3C2 PEBBLE'
 'DW SIP MAKEUP 30ML - 2C3 FRESCO' 'DW SIP MAKEUP 30ML - 2C1 PURE BEIGE'
 'DW SIP MAKEUP 30ML - 3W1 TAWNY' 'DW SIP MAKEUP 30ML - 1N2 ECRU'
 'DW SIP MAKEUP 30ML - 3N1 IVORY BEIGE' 'DW SIP MAKEUP 30ML - 3N2 WHEAT']


In [12]:
uniqueSuccessorsDesc = np.array([normalize(x) for x in uniqueSuccessorsDescNN])
uniquePredecessorDescFiltered = np.array([normalize(x) for x in uniquePredecessorDescNNFiltered])

In [13]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


#### Phase 3: Embedding Step (BASIC)

In [14]:
modelST = SentenceTransformer("all-MiniLM-L6-v2")

In [15]:
successorEmbeddings = modelST.encode(uniqueSuccessorsDesc)

In [16]:
successorEmbeddings.shape

(10, 384)

In [17]:
predecessorDescEmbeddings = modelST.encode(uniquePredecessorDescFiltered)

In [18]:
predecessorDescEmbeddings.shape

(7354, 384)

#### Phase 5: Retrieval Logic (BASIC)

In [19]:
modelCE = CrossEncoder("cross-encoder/stsb-distilroberta-base")

In [20]:
similarities = modelST.similarity(successorEmbeddings, predecessorDescEmbeddings)

In [21]:
# Each row corresponds to a successor SKU description, each column corresponds to a predecessor SKU description
similarities.shape

torch.Size([10, 7354])

In [22]:
# dim=0 means look down each column (across rows) and dim=1 means look across each row (across columns)
top_sim, top_idx = similarities.topk(k=7354, dim=1)
type(top_idx)

torch.Tensor

In [23]:
top_idx.shape

torch.Size([10, 7354])

In [25]:
for i, succ in enumerate(uniqueSuccessorsDesc):
    cand_idx = top_idx[i]
    cand_descs = uniquePredecessorDescFiltered[cand_idx]

    allRanks = modelCE.rank(succ, cand_descs)
    top5Ranks = allRanks[:5]

    print("Successor SKU: ", succ)
    for rank in top5Ranks:
        print(f"{rank['score']:.2f}\t{cand_descs[rank['corpus_id']]}")


Successor SKU:  dw sip makeup 30ml 2n2 buff
0.85	hd foundation 30ml dark d2 can wn
0.85	double wear sip liqui 30ml shadea2 wn
0.83	dw sip makeup 30ml 1c2 petal
0.82	hd foundation 30ml light l2
0.82	hd foundation 30ml dark d2
Successor SKU:  dw sip makeup 30ml 2n1 desert beige
0.93	dw sip makeup 30ml 1w2 sand
0.88	dw sip makeup 30ml 0c1 cool sand
0.85	dw sip makeup 30ml 4n2 spiced sand
0.84	rn liquid foundation 30ml desert beige
0.82	btw matte fdn 30ml sand
Successor SKU:  dw sip makeup 30ml 2w1 dawn
0.84	dw lgt sft mat hyd mu 30ml 2w1 dawn
0.83	double wear sip liqui 30ml 2w1 dawn
0.81	dw sip makeup spf10 30ml 2w1 dawn
0.80	dwsoftspf20 30ml 2w1 dawn wn
0.78	hd foundation 30ml dark d2 can wn
Successor SKU:  dw sip makeup 30ml 3c2 pebble
0.83	dw sip makeup 30ml 3w2 cashew
0.79	double wear sip liqui 30ml 3c2 pebble
0.78	double wear sip liqui 30ml 3w2 cashew
0.78	dw lgt sft mat hyd mu 30ml 3c2 pebble
0.78	dwsoftspf20 30ml 3c2 pebble wn
Successor SKU:  dw sip makeup 30ml 2c3 fresco
0.85	dw s

In [26]:
with open('cross_encoder_results_top200', 'w') as f:
    for i, succ in enumerate(uniqueSuccessorsDesc):
        cand_idx = top_idx[i]
        cand_descs = uniquePredecessorDescFiltered[cand_idx]
    
        allRanks = modelCE.rank(succ, cand_descs)
        top5Ranks = allRanks[:5]
    
        print('Successor SKU: ' + succ, file=f)
        for rank in top5Ranks:
            print(f"{rank['score']:.2f}\t{cand_descs[rank['corpus_id']]}", file=f)