## Semantic Model Development Project
#### Last Edit: 6/17/25 by Aneesh Naresh
##### The goal of this project is to develop a semantic similarity model capable of identifying the five most similar predecessor SKUs corresponding to a successor SKU. The project is implemented in the phases listed below:

1. Environment Setup
2. Data Inspection and String-Level Cleanup
3. Embedding Step
4. Vector Store Choices
5. Retrieval Logic
6. Evaluation Loop
7. Packaging and Serving
8. Extensions

#### Phase 2: Data Inspection and String-Level Cleanup (BASIC)

In [41]:
import pandas as pd
import numpy as np
import sklearn
import tqdm
from sentence_transformers import SentenceTransformer, CrossEncoder
from normality import normalize

In [27]:
sku = pd.read_csv("data/npl_test_data_for_sentence_semantic_analysis.csv")

In [28]:
sku.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


In [36]:
def normalizeText(text: object):
    return normalize(text)

In [30]:
devDataframe = sku.copy()

In [31]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,TE1633,ARCHITECT SOFT MATTE 30ML - 16.5C MOCHA
1,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,KNXW02,EB CL SPF 25 30ML - CN 20 Fair (VF)
2,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,CUMM04,AO SKIN-BALNCNG FNDTN 10ML - L10-N
3,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,PH7F77,DW SIP MAKEUP SPF10 30ML - 2C1 PURE BEIG
4,PH7GCC,DW SIP MAKEUP 30ML - 2N2 BUFF,H04Y05,GS SHEER COLOR MU 30ML - LEVEL 3 WARM


In [42]:
devDataframe["SUCCESSOR_SKU"].apply(normalizeText)

0        ph7gcc
1        ph7gcc
2        ph7gcc
3        ph7gcc
4        ph7gcc
          ...  
76825    ph7g38
76826    ph7g38
76827    ph7g38
76828    ph7g38
76829    ph7g38
Name: SUCCESSOR_SKU, Length: 76830, dtype: object

In [52]:
for col in devDataframe.columns: 
    devDataframe[f'{col}'] = devDataframe[f'{col}'].apply(normalizeText)

In [53]:
devDataframe.head()

Unnamed: 0,SUCCESSOR_SKU,SUCCESSOR_SKU_DESC,POTENTIAL_PREDECESSOR_SKU,POTENTIAL_PREDECESSOR_SKU_DESC
0,ph7gcc,dw sip makeup 30ml 2n2 buff,te1633,architect soft matte 30ml 16 5c mocha
1,ph7gcc,dw sip makeup 30ml 2n2 buff,knxw02,eb cl spf 25 30ml cn 20 fair vf
2,ph7gcc,dw sip makeup 30ml 2n2 buff,cumm04,ao skin balncng fndtn 10ml l10 n
3,ph7gcc,dw sip makeup 30ml 2n2 buff,ph7f77,dw sip makeup spf10 30ml 2c1 pure beig
4,ph7gcc,dw sip makeup 30ml 2n2 buff,h04y05,gs sheer color mu 30ml level 3 warm


#### Phase 3: Embedding Step (BASIC)

In [None]:
ta