
<p align="center"><span style="font-size: 36px;">⌬ Duplicate Matching for E‑commerce (Siamese‑style)</span></p>

**Institute:** ISI Kolkata  
**Author:** Your Name  
**Goal:** Build a strong, Meesho‑aligned **duplicate product matching** module using **Siamese‑style embeddings** (CLIP for images, SBERT for titles). Includes: metrics, retrieval, and exportable artifacts for FastAPI.

---

⟢ Overview ⟣  
- We implement a **duplicate matching** pipeline using **pretrained encoders** (Siamese/contrastive representations).  
- Dataset: **Shopee – Price Match Guarantee** (images + titles + duplicate/group labels).  
- We ship:  
  1) **Baseline** using CLIP+SBERT embeddings + thresholding (logistic regression).  
  2) **Top‑K retrieval** with FAISS (for near‑duplicate discovery).  
  3) Artifacts saved for easy **FastAPI** integration.

> Tip: Run this in **Google Colab** (GPU runtime preferred).  


⟩ Setup & Installs

In [1]:

# If running locally, uncomment and install. In Colab, this will install needed packages.
!pip -q install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121
!pip -q install sentence-transformers==2.7.0
!pip -q install ftfy regex tqdm
!pip -q install faiss-cpu
!pip -q install timm
!pip -q install open_clip_torch
!pip -q install scikit-learn matplotlib pandas numpy pillow

# Kaggle CLI for dataset download (optional; requires kaggle.json)
!pip -q install kaggle


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m171.5/171.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.8/44.8 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m21.1 MB/s[0m eta [36m0:00:00[0m
[?25h

⟩ Imports & Global Config

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:

import os, json, random, math, gc, shutil, zipfile, glob
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from PIL import Image
from tqdm import tqdm

# Text encoder
from sentence_transformers import SentenceTransformer

# Image encoder (OpenCLIP / CLIP)
import open_clip

# ANN index
import faiss

# Metrics & simple models
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score, precision_recall_curve, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", DEVICE)

# Paths
BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / "data"
IMAGES_DIR = DATA_DIR / "train_images"    # Shopee default folder name after unzip
OUTPUT_DIR = BASE_DIR / "artifacts"
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

# Models
IMG_MODEL_NAME = "ViT-B-32"     # OpenCLIP/CLIP backbone
IMG_MODEL_PRETRAINED = "openai" # weights
TXT_MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"

# Embedding config
IMAGE_EMB_DIM = 512  # ViT-B/32
TEXT_EMB_DIM  = 384  # MiniLM L12 v2
FUSION_ALPHA = 0.5   # weight for image vs text in fused similarity


Device: cpu


⟩ Download Dataset (Shopee – Price Match Guarantee)

In [14]:

# Path where kaggle.json is stored in your Google Drive
drive_kaggle = Path("/content/drive/MyDrive/Kaggle_API/kaggle.json")

# Destination: ~/.kaggle/kaggle.json
kaggle_dir = Path.home() / ".kaggle"
kaggle_json = kaggle_dir / "kaggle.json"

# Make sure ~/.kaggle exists
kaggle_dir.mkdir(parents=True, exist_ok=True)

# Copy file if present
if drive_kaggle.exists():
    !cp "{drive_kaggle}" "{kaggle_json}"
    !chmod 600 "{kaggle_json}"
    print("✅ kaggle.json copied to ~/.kaggle")

if kaggle_json.exists():
    print("Downloading Shopee dataset...")
    !kaggle competitions download -c shopee-product-matching -p data -w
    # Unzip all zips into /content/data
    for z in glob.glob("data/*.zip"):
        print("Unzipping:", z)
        with zipfile.ZipFile(z, 'r') as zip_ref:
            zip_ref.extractall("data")
else:
    print("❌ kaggle.json not found. Please check path:", drive_kaggle)

✅ kaggle.json copied to ~/.kaggle
Downloading Shopee dataset...
Downloading shopee-product-matching.zip to .
 99% 1.67G/1.68G [00:20<00:00, 34.6MB/s]
100% 1.68G/1.68G [00:20<00:00, 86.4MB/s]


⟩ Load Metadata & Quick EDA

In [22]:
!unzip -q shopee-product-matching.zip -d shopee_data


In [23]:
DATA_DIR = Path("/content/shopee_data")
train_csv = DATA_DIR / "train.csv"
IMAGES_DIR = DATA_DIR / "train_images"

assert train_csv.exists(), "❌ train.csv still not found."
assert IMAGES_DIR.exists(), "❌ train_images folder not found."

# Load metadata
df = pd.read_csv(train_csv)
print(df.head())
print("Rows:", len(df), "| Unique label_groups:", df['label_group'].nunique())

# Add image paths
df['image_path'] = df['image'].apply(lambda x: str(IMAGES_DIR / x))
df = df[df['image_path'].apply(os.path.exists)].reset_index(drop=True)
print("After filtering missing images:", len(df))

         posting_id                                 image       image_phash  \
0   train_129225211  0000a68812bc7e98c42888dfb1c07da0.jpg  94974f937d4c2433   
1  train_3386243561  00039780dfc94d01db8676fe789ecd05.jpg  af3f9460c2838f0f   
2  train_2288590299  000a190fdd715a2a36faed16e2c65df7.jpg  b94cb00ed3e50f78   
3  train_2406599165  00117e4fc239b1b641ff08340b429633.jpg  8514fc58eafea283   
4  train_3369186413  00136d1cf4edede0203f32f05f660588.jpg  a6f319f924ad708c   

                                               title  label_group  
0                          Paper Bag Victoria Secret    249114794  
1  Double Tape 3M VHB 12 mm x 4,5 m ORIGINAL / DO...   2937985045  
2        Maling TTS Canned Pork Luncheon Meat 397 gr   2395904891  
3  Daster Batik Lengan pendek - Motif Acak / Camp...   4093212188  
4                  Nescafe \xc3\x89clair Latte 220ml   3648931069  
Rows: 34250 | Unique label_groups: 11014
After filtering missing images: 34250


⟩ Load Encoders (Image: OpenCLIP/CLIP, Text: SBERT)

In [24]:

# Text model
txt_model = SentenceTransformer(TXT_MODEL_NAME, device=DEVICE)

# Image model & preprocess
img_model, _, img_preprocess = open_clip.create_model_and_transforms(
    IMG_MODEL_NAME, pretrained=IMG_MODEL_PRETRAINED, device=DEVICE
)
img_model.eval()

# Tokenizer for CLIP text if needed (not used here since we use SBERT)
tokenizer = open_clip.get_tokenizer(IMG_MODEL_NAME)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

open_clip_model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]



⟩ Utils: Image / Text Embedding

In [25]:

@torch.no_grad()
def embed_image_paths(paths, batch_size=64):
    embs = []
    for i in tqdm(range(0, len(paths), batch_size), desc="Embedding images"):
        batch_paths = paths[i:i+batch_size]
        imgs = []
        for p in batch_paths:
            img = Image.open(p).convert("RGB")
            imgs.append(img_preprocess(img))
        pixel_batch = torch.stack(imgs).to(DEVICE)
        feats = img_model.encode_image(pixel_batch)
        feats = F.normalize(feats, dim=-1)
        embs.append(feats.detach().cpu().numpy())
    return np.vstack(embs)

def embed_texts(texts, batch_size=128):
    # Sentence-Transformers returns L2-normalized by default for some models;
    # we normalize again to be safe.
    all_embs = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Embedding texts"):
        batch = texts[i:i+batch_size]
        e = txt_model.encode(batch, show_progress_bar=False, convert_to_numpy=True, normalize_embeddings=True)
        all_embs.append(e)
    embs = np.vstack(all_embs)
    # cosine uses L2-normalized vectors; already normalized
    return embs

def cosine_sim(a, b):
    # expects L2-normalized vectors
    return (a * b).sum(axis=-1)


⟩ Compute Dataset Embeddings

In [None]:

titles = df['title'].astype(str).tolist()
image_paths = df['image_path'].tolist()

text_embs  = embed_texts(titles, batch_size=256)
image_embs = embed_image_paths(image_paths, batch_size=64)

np.save(OUTPUT_DIR / "text_embs.npy", text_embs)
np.save(OUTPUT_DIR / "image_embs.npy", image_embs)
df.to_csv(OUTPUT_DIR / "meta.csv", index=False)
print("Saved embeddings & metadata to:", OUTPUT_DIR)


Embedding texts: 100%|██████████| 134/134 [19:35<00:00,  8.77s/it]
Embedding images:   5%|▍         | 25/536 [06:05<2:19:39, 16.40s/it]

⟩ Build FAISS Indices (Text & Image)

In [None]:

# Cosine similarity with FAISS uses inner product on normalized vectors.
# Ensure vectors are L2-normalized. SBERT returns normalized; we re-normalize image_embs already.

def build_faiss_index(vecs):
    dim = vecs.shape[1]
    index = faiss.IndexFlatIP(dim)  # cosine if inputs are L2-normalized
    index.add(vecs.astype('float32'))
    return index

faiss_text = build_faiss_index(text_embs.astype('float32'))
faiss_image = build_faiss_index(image_embs.astype('float32'))

faiss.write_index(faiss_text, str(OUTPUT_DIR / "faiss_text.index"))
faiss.write_index(faiss_image, str(OUTPUT_DIR / "faiss_image.index"))
print("FAISS indices saved.")


⟩ Create Positive/Negative Pairs for Thresholding

In [None]:

# Construct a manageable set of pairs for train/val
# Positive: same label_group
# Negative: different label_group

group_to_indices = df.groupby('label_group').indices
label_groups = list(group_to_indices.keys())

pos_pairs = []
for g, idxs in group_to_indices.items():
    idxs = list(idxs)
    if len(idxs) < 2:
        continue
    for i in range(len(idxs)-1):
        pos_pairs.append((idxs[i], idxs[i+1], 1))

# Sample negatives roughly matching number of positives
neg_pairs = []
all_indices = list(range(len(df)))
target_negs = len(pos_pairs)
while len(neg_pairs) < target_negs:
    i, j = np.random.choice(all_indices, 2, replace=False)
    if df.loc[i, 'label_group'] != df.loc[j, 'label_group']:
        neg_pairs.append((i, j, 0))

pairs = pos_pairs + neg_pairs
random.shuffle(pairs)
len(pos_pairs), len(neg_pairs), len(pairs)


⟩ Pairwise Features & Train/Val Split

In [None]:

def pair_features(pairs):
    feats = []
    y = []
    for i, j, lab in pairs:
        img_sim = float((image_embs[i] * image_embs[j]).sum())
        txt_sim = float((text_embs[i] * text_embs[j]).sum())
        fused_sim = FUSION_ALPHA*img_sim + (1.0 - FUSION_ALPHA)*txt_sim
        feats.append([img_sim, txt_sim, fused_sim])
        y.append(lab)
    X = np.array(feats, dtype=np.float32)
    y = np.array(y, dtype=np.int32)
    return X, y

X, y = pair_features(pairs)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=SEED, stratify=y)

print("Train pairs:", len(y_train), "Val pairs:", len(y_val))


⟩ Learn a Decision Threshold (Logistic Regression)

In [None]:

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

probs = clf.predict_proba(X_val)[:,1]
auc = roc_auc_score(y_val, probs)
ap  = average_precision_score(y_val, probs)
print("Val ROC-AUC:", round(auc, 4), "| PR-AUC:", round(ap, 4))

# Pick operating threshold by maximizing F1
prec, rec, ths = precision_recall_curve(y_val, probs)
f1s = 2 * (prec*rec) / (prec+rec + 1e-12)
best_idx = np.nanargmax(f1s)
best_th = ths[max(0, best_idx-1)] if best_idx >= len(ths) else ths[best_idx]
print("Best F1:", round(f1s[best_idx], 4), "| Best threshold:", float(best_th))

# Plot ROC (single chart)
fpr, tpr, _ = roc_curve(y_val, probs)
plt.figure()
plt.plot(fpr, tpr, label=f"ROC-AUC={auc:.3f}")
plt.plot([0,1], [0,1], linestyle="--")
plt.xlabel("FPR"); plt.ylabel("TPR"); plt.title("ROC Curve (Validation)"); plt.legend()
plt.show()

# Save classifier & threshold
import pickle
with open(OUTPUT_DIR / "threshold_clf.pkl", "wb") as f:
    pickle.dump({"clf": clf, "alpha": FUSION_ALPHA, "best_threshold": float(best_th)}, f)

print("Saved threshold model to:", OUTPUT_DIR / "threshold_clf.pkl")


⟩ Retrieval Evaluation: Hit@K / Recall@K

In [None]:

def recall_at_k(embs, index, labels, K=5, sample=1000):
    n = len(labels)
    idxs = np.random.choice(np.arange(n), size=min(sample, n), replace=False)
    hits = 0
    for i in tqdm(idxs, desc=f"Recall@{K}"):
        q = embs[i].astype('float32').reshape(1, -1)
        D, I = index.search(q, K+1)  # +1 to allow the same item at rank 0
        # Remove self if present
        nn = [j for j in I[0] if j != i][:K]
        # Hit if any neighbor shares the same label_group
        if any(labels[j] == labels[i] for j in nn):
            hits += 1
    return hits / len(idxs)

labels = df['label_group'].values
r5_img = recall_at_k(image_embs, faiss_image, labels, K=5, sample=1000)
r5_txt = recall_at_k(text_embs, faiss_text, labels, K=5, sample=1000)
print(f"Recall@5 (image): {r5_img:.3f} | Recall@5 (text): {r5_txt:.3f}")


⟩ Inference Helpers (For FastAPI Integration)

In [None]:

def embed_single_image(pil_img):
    with torch.no_grad():
        t = img_preprocess(pil_img).unsqueeze(0).to(DEVICE)
        e = img_model.encode_image(t)
        e = F.normalize(e, dim=-1)
    return e.detach().cpu().numpy()[0]

def embed_single_text(title: str):
    e = txt_model.encode([title], convert_to_numpy=True, normalize_embeddings=True)
    return e[0]

def fused_similarity(img_sim, txt_sim, alpha=FUSION_ALPHA):
    return alpha*img_sim + (1.0 - alpha)*txt_sim

# Decision with logistic classifier
def predict_duplicate(img_emb1, img_emb2, txt_emb1=None, txt_emb2=None, clf_obj=None):
    img_sim = float((img_emb1 * img_emb2).sum())
    txt_sim = float((txt_emb1 * txt_emb2).sum()) if txt_emb1 is not None and txt_emb2 is not None else 0.0
    fused = fused_similarity(img_sim, txt_sim, alpha=clf_obj.get("alpha", FUSION_ALPHA) if clf_obj else FUSION_ALPHA)
    x = np.array([[img_sim, txt_sim, fused]], dtype=np.float32)
    if clf_obj:
        p = clf_obj["clf"].predict_proba(x)[:,1][0]
        decision = p >= clf_obj["best_threshold"]
        return {"image_sim": img_sim, "text_sim": txt_sim, "fused_sim": fused, "prob": float(p), "decision": bool(decision)}
    else:
        # fallback simple threshold on fused sim
        decision = fused >= 0.5
        return {"image_sim": img_sim, "text_sim": txt_sim, "fused_sim": fused, "prob": fused, "decision": bool(decision)}


⟩ Save Artifacts Manifest

In [None]:

manifest = {
    "image_model": {"name": IMG_MODEL_NAME, "pretrained": IMG_MODEL_PRETRAINED, "dim": IMAGE_EMB_DIM},
    "text_model": {"name": TXT_MODEL_NAME, "dim": TEXT_EMB_DIM},
    "fusion_alpha": FUSION_ALPHA,
    "files": {
        "meta_csv": str(OUTPUT_DIR / "meta.csv"),
        "image_embs": str(OUTPUT_DIR / "image_embs.npy"),
        "text_embs": str(OUTPUT_DIR / "text_embs.npy"),
        "faiss_image": str(OUTPUT_DIR / "faiss_image.index"),
        "faiss_text": str(OUTPUT_DIR / "faiss_text.index"),
        "threshold_clf": str(OUTPUT_DIR / "threshold_clf.pkl")
    }
}
with open(OUTPUT_DIR / "manifest.json", "w") as f:
    json.dump(manifest, f, indent=2)
print("Wrote:", OUTPUT_DIR / "manifest.json")


⋗ Optional (Stretch): Fine‑tune a Small Siamese Head with Triplet Loss

In [None]:

# This section is optional and only sketched for time constraints.
# Idea: learn a small projection on top of CLIP/SBERT embeddings using triplet loss to tighten duplicates.
# You can implement if you have time/GPU; skip otherwise.

# Pseudocode / scaffold (not executed by default):
# class SmallProj(torch.nn.Module):
#     def __init__(self, in_dim=IMAGE_EMB_DIM, out_dim=256):
#         super().__init__()
#         self.net = torch.nn.Sequential(
#             torch.nn.Linear(in_dim, 512),
#             torch.nn.ReLU(),
#             torch.nn.Linear(512, out_dim)
#         )
#     def forward(self, x):
#         z = F.normalize(self.net(x), dim=-1)
#         return z

# - Build triplets from label_group (anchor, positive, negative)
# - Train with margin triplet loss on image and/or text embeddings
# - Replace raw embeddings with projected ones for thresholding / FAISS



⟢ Conclusions & Next Steps ⟣  
- We built a **duplicate matching** module using **Siamese‑style embeddings** (CLIP + SBERT), with:  
  - Learned decision threshold (logistic regression on pairwise sims)  
  - Retrieval metrics (**Recall@5**) using FAISS  
  - Exported artifacts ready for FastAPI

**Next:**  
1. Wrap `predict_duplicate(...)` in a FastAPI endpoint (`/api/dedup`).  
2. Load FAISS indices at startup; add a route for “find near‑duplicates” using an uploaded image.  
3. Log metrics to a `/metrics` endpoint for your website.

> You can now plug this into your Azure FastAPI backend and your GitHub Pages frontend.
