## Semantic Model Development Project
#### Last Edit: 6/18/25 by Aneesh Naresh
##### The goal of this project is to develop a semantic similarity model capable of identifying the five most similar predecessor SKUs corresponding to a successor SKU. The project is implemented in the phases listed below:

1. Environment Setup
2. Data Inspection and String-Level Cleanup
3. Embedding Step
4. Vector Store Choices
5. Retrieval Logic
6. Evaluation Loop
7. Packaging and Serving
8. Extensions

#### Phase 2: Data Inspection and String-Level Cleanup (BASIC)

In [118]:
import pandas as pd
import numpy as np
import torch
import sklearn
import tqdm
from sentence_transformers import SentenceTransformer, CrossEncoder
from normality import normalize

In [27]:
sku = pd.read_csv("data/npl_test_data_for_sentence_semantic_analysis.csv")

In [28]:
sku.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


In [36]:
def normalizeText(text: object):
    return normalize(text)

In [30]:
devDataframe = sku.copy()

In [31]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


In [42]:
devDataframe["SUCCESSOR_SKU"].apply(normalizeText)

0        ph7gcc
1        ph7gcc
2        ph7gcc
3        ph7gcc
4        ph7gcc
          ...  
76825    ph7g38
76826    ph7g38
76827    ph7g38
76828    ph7g38
76829    ph7g38
Name: SUCCESSOR_SKU, Length: 76830, dtype: object

In [52]:
for col in devDataframe.columns: 
    devDataframe[f'{col}'] = devDataframe[f'{col}'].apply(normalizeText)

In [53]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,ph7gcc,dw sip makeup 30ml 2n2 buff,te1633,architect soft matte 30ml 16 5c mocha
1,ph7gcc,dw sip makeup 30ml 2n2 buff,knxw02,eb cl spf 25 30ml cn 20 fair vf
2,ph7gcc,dw sip makeup 30ml 2n2 buff,cumm04,ao skin balncng fndtn 10ml l10 n
3,ph7gcc,dw sip makeup 30ml 2n2 buff,ph7f77,dw sip makeup spf10 30ml 2c1 pure beig
4,ph7gcc,dw sip makeup 30ml 2n2 buff,h04y05,gs sheer color mu 30ml level 3 warm


#### Phase 3: Embedding Step (BASIC)

In [74]:
modelST = SentenceTransformer("all-MiniLM-L6-v2")

In [75]:
uniqueSuccessors = pd.unique(devDataframe["SUCCESSOR_SKU_DESC"])

In [76]:
uniquePredecessorDesc = pd.unique(devDataframe["POTENTIAL_PREDECESSOR_SKU_DESC"])

In [77]:
uniquePredecessors = pd.unique(devDataframe["POTENTIAL_PREDECESSOR_SKU"])

In [78]:
successorEmbeddings = model.encode(uniqueSuccessors)

In [79]:
successorEmbeddings.shape

(10, 384)

In [80]:
predecessorDescEmbeddings = model.encode(uniquePredecessorDesc)

In [81]:
predecessorDescEmbeddings.shape

(7337, 384)

#### Phase 5: Retrieval Logic (BASIC)

In [82]:
modelCE = CrossEncoder("cross-encoder/stsb-distilroberta-base")

config.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/328M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.56M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

In [116]:
similarities = modelST.similarity(successorEmbeddings, predecessorDescEmbeddings)

In [117]:
# Each row corresponds to a successor SKU description, each column corresponds to a predecessor SKU description
similarities.shape

torch.Size([10, 7337])

In [113]:
similarities50.shape

torch.Size([10, 7337])

In [122]:
top_sim, top_idx = similarities.topk(k=50, dim=1)
type(top_idx)

torch.Tensor

In [123]:
top_idx.shape

torch.Size([10, 50])

In [136]:
for i, succ in enumerate(uniqueSuccessors):
    cand_idx = top_idx[i]
    cand_descs = uniquePredecessorDesc[cand_idx]

    allRanks = modelCE.rank(succ, cand_descs)
    top5Ranks = allRanks[:5]

    print("Successor SKU: ", succ)
    for rank in top5Ranks:
        print(f"{rank['score']:.2f}\t{cand_descs[rank['corpus_id']]}")

Successor SKU:  dw sip makeup 30ml 2n2 buff
0.99	dw sip makeup 30ml 2n2 buff
0.83	dw sip makeup 30ml 1n2 ecru
0.83	dw sip makeup 30ml 1c2 petal
0.80	dw sip makeup 30ml 2c3 fresco
0.78	dw sip makeup spf10 30ml 2n2 buff
Successor SKU:  dw sip makeup 30ml 2n1 desert beige
0.99	dw sip makeup 30ml 2n1 desert beige
0.91	dw sip makeup 30ml 2c1 pure beige
0.76	dw sip makeup spf10 30ml 2n1 desert be
0.76	dw sip makeup 30ml 4c1 outdoor beige
0.74	dw sip makeup 30ml 4n1 shell beige
Successor SKU:  dw sip makeup 30ml 2w1 dawn
0.99	dw sip makeup 30ml 2w1 dawn
0.83	dw sip makeup 30ml 2n2 buff
0.81	dw sip makeup spf10 30ml 2w1 dawn
0.76	dw sip makeup 30ml 1c2 petal
0.73	dw sip makeup 30ml 2n3 dolce
Successor SKU:  dw sip makeup 30ml 3c2 pebble
0.99	dw sip makeup 30ml 3c2 pebble
0.83	dw sip makeup 30ml 3w2 cashew
0.80	dw sip makeup 30ml 2c3 fresco
0.79	double wear sip liqui 30ml 3c2 pebble
0.76	dw sip makeup spf10 30ml 3c2 pebble
Successor SKU:  dw sip makeup 30ml 2c3 fresco
0.99	dw sip makeup 30ml 2c