# Notebook 03 — Item2Vec (Product Similarity & Substitutes)

**Purpose:** Train Word2Vec on order sequences to learn product embeddings.
Products appearing in similar shopping baskets get similar vectors.

- **Similarities:** top-10 related items per product → `item_similarities.json`
- **Substitutes:** items with similarity > 0.70 → `substitutes.json`

**Why not co-purchase (Apriori)?**
Apriori finds *complementary* items (Pasta → Pasta Sauce).
Item2Vec finds *interchangeable* items (Whole Milk ≈ 2% Milk ≈ Oat Milk).

**Input:** `order_baskets.pkl`, `item_catalog.json`

**Output:** `item_similarities.json`, `substitutes.json`

**Runtime:** ~3–5 min on Kaggle


In [1]:
import os, json, pickle, time
import numpy as np
from collections import defaultdict

IS_KAGGLE  = os.path.exists("/kaggle/input")
OUTPUT_DIR = "/kaggle/working" if IS_KAGGLE else "../data/output"
MODELS_DIR = "/kaggle/working" if IS_KAGGLE else "../data/models"
os.makedirs(MODELS_DIR, exist_ok=True)

# Install gensim — shell magic avoids kernel message flooding
import sys
!{sys.executable} -m pip install gensim -q

from gensim.models import Word2Vec

print("gensim imported OK")
print(f"OUTPUT_DIR = {OUTPUT_DIR}")


zsh:1: no such file or directory: /Users/ranaraunitrazsingh/Desktop/Placements
gensim imported OK
OUTPUT_DIR = ../data/output


In [2]:
print("=" * 60)
print("STEP 1: Loading data...")
print("=" * 60)

with open(f"{OUTPUT_DIR}/order_baskets.pkl", "rb") as f:
    baskets = pickle.load(f)
with open(f"{OUTPUT_DIR}/item_catalog.json", "r") as f:
    item_catalog = json.load(f)

name_to_category = {item["name_lower"]: item["category"] for item in item_catalog}
print(f"Loaded {len(baskets):,} baskets | Catalog: {len(item_catalog)} items")


STEP 1: Loading data...
Loaded 2,849,883 baskets | Catalog: 3000 items


In [3]:
print("\n" + "=" * 60)
print("STEP 2: Preparing sequences...")
print("=" * 60)

# Shuffle items within each basket — removes add-to-cart-order bias.
# Shopping order (Milk first vs last) is meaningless; what matters is
# which products co-occur in the same basket.
sequences = []
for basket in baskets:
    if len(basket) >= 2:
        s = list(basket)
        np.random.shuffle(s)
        sequences.append(s)

vocab = set(item for seq in sequences for item in seq)
print(f"Sequences: {len(sequences):,}")
print(f"Vocab:     {len(vocab):,} unique products")



STEP 2: Preparing sequences...
Sequences: 2,849,883
Vocab:     3,000 unique products


In [4]:
print("\n" + "=" * 60)
print("STEP 3: Training Item2Vec...")
print("=" * 60)

# Hyperparameter guide (do not change unless spot-checks look wrong):
#   vector_size=64  — good for ~3K vocab; use 128 for >10K items
#   window=5        — with shuffled baskets, captures most co-occurrences
#   min_count=5     — ignore items with <5 appearances (too noisy)
#   sg=1            — Skip-gram: better for rare items than CBOW
#   epochs=15       — enough for convergence; watch loss plateau

VECTOR_SIZE = 64
WINDOW      = 5
MIN_COUNT   = 5
SG          = 1      # Skip-gram
EPOCHS      = 15
WORKERS     = 4
SEED        = 42

t0 = time.time()
model = Word2Vec(
    sentences=sequences,
    vector_size=VECTOR_SIZE,
    window=WINDOW,
    min_count=MIN_COUNT,
    sg=SG,
    epochs=EPOCHS,
    workers=WORKERS,
    seed=SEED,
)
elapsed = time.time() - t0
print(f"Training done in {elapsed:.1f}s")
print(f"Vocabulary in model: {len(model.wv)} items")

model.save(f"{MODELS_DIR}/item2vec.model")
print(f"Saved model: {MODELS_DIR}/item2vec.model (offline analysis only — not deployed)")



STEP 3: Training Item2Vec...


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

Training done in 284.2s
Vocabulary in model: 3000 items
Saved model: ../data/models/item2vec.model (offline analysis only — not deployed)


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'


In [5]:
print("\n" + "=" * 60)
print("STEP 4: Validation — spot-check similarities...")
print("=" * 60)

test_items = ["Banana", "Whole Milk", "Organic Strawberries",
              "Bag of Organic Bananas", "Spaghetti"]

for item in test_items:
    if item in model.wv:
        similar = model.wv.most_similar(item, topn=6)
        print(f"\n  '{item}':")
        for name, score in similar:
            cat    = name_to_category.get(name.lower(), "?")
            tag    = "SUB" if score > 0.70 else "REL"
            print(f"    [{tag}] {score:.3f}  {name:40s}  [{cat}]")
    else:
        candidates = [w for w in model.wv.index_to_key if item.lower() in w.lower()]
        print(f"\n  '{item}' not in vocab. Candidates: {candidates[:3]}")



STEP 4: Validation — spot-check similarities...

  'Banana':
    [REL] 0.624  Honey Nut Cheerios                        [breakfast]
    [REL] 0.621  Bosc Pear                                 [produce]
    [REL] 0.609  Oven Roasted Turkey Breast                [deli]
    [REL] 0.599  Clementines, Bag                          [produce]
    [REL] 0.596  Raisin Bran Cereal                        [breakfast]
    [REL] 0.589  XL Emerald White Seedless Grapes          [produce]

  'Whole Milk':
    [SUB] 0.767  2% Reduced Fat Milk                       [dairy]
    [REL] 0.666  Chicken Thighs                            [meat_seafood]
    [REL] 0.651  Original Whole Fat Lactose Free Milk      [dairy]
    [REL] 0.593  Pure Granulated Cane Sugar                [grains_pasta]
    [REL] 0.586  1% Low Fat Milk                           [dairy]
    [REL] 0.581  Butter                                    [dairy]

  'Organic Strawberries':
    [SUB] 0.746  Organic Blueberries                       [pro

In [6]:
print("\n" + "=" * 60)
print("STEP 5: Pre-computing similarity lookup (all catalog items)...")
print("=" * 60)

# Deploy a lightweight JSON dict instead of the full gensim model.
# Runtime lookup: similarities['banana'] → [{name, score, category}, ...]

TOP_N = 10
item_similarities = {}
skipped = 0

for item_name in model.wv.index_to_key:
    try:
        similar = model.wv.most_similar(item_name, topn=TOP_N)
        item_similarities[item_name.lower()] = [
            {
                "name":     name,
                "score":    round(float(score), 3),
                "category": name_to_category.get(name.lower(), "other"),
            }
            for name, score in similar
        ]
    except Exception:
        skipped += 1

with open(f"{OUTPUT_DIR}/item_similarities.json", "w") as f:
    json.dump(item_similarities, f, indent=2)

size_mb = os.path.getsize(f"{OUTPUT_DIR}/item_similarities.json") / 1024**2
print(f"Saved: {OUTPUT_DIR}/item_similarities.json  ({size_mb:.1f} MB)")
print(f"Items computed: {len(item_similarities)}  |  Skipped: {skipped}")



STEP 5: Pre-computing similarity lookup (all catalog items)...
Saved: ../data/output/item_similarities.json  (3.2 MB)
Items computed: 2998  |  Skipped: 0


In [7]:
print("\n" + "=" * 60)
print("STEP 6: Extracting substitutes (score > 0.70)...")
print("=" * 60)

THRESHOLD = 0.70
substitutes = {}

for item_key, sim_list in item_similarities.items():
    subs = [s for s in sim_list if s["score"] >= THRESHOLD]
    if subs:
        substitutes[item_key] = subs[:5]

with open(f"{OUTPUT_DIR}/substitutes.json", "w") as f:
    json.dump(substitutes, f, indent=2)

print(f"Saved: {OUTPUT_DIR}/substitutes.json")
print(f"Items with substitutes: {len(substitutes)}")

# Spot-check
print("\nSubstitute examples:")
shown = 0
for k, subs in substitutes.items():
    if len(subs) >= 2 and shown < 5:
        print(f"  {k}:")
        for s in subs[:3]:
            print(f"    → {s['name']:40s}  score={s['score']:.3f}")
        shown += 1

all_scores = [s["score"] for v in item_similarities.values() for s in v]
catalog_lower = set(i["name_lower"] for i in item_catalog)
coverage = len(catalog_lower & set(item_similarities.keys())) / len(catalog_lower) * 100
print(f"\nCatalog coverage: {coverage:.1f}%")
print(f"Score mean/std:   {np.mean(all_scores):.3f} / {np.std(all_scores):.3f}")
print("\n✓ NOTEBOOK 03 COMPLETE — item_similarities.json + substitutes.json ready")



STEP 6: Extracting substitutes (score > 0.70)...
Saved: ../data/output/substitutes.json
Items with substitutes: 2765

Substitute examples:
  bag of organic bananas:
    → Organic Strawberries                      score=0.742
    → Organic Blueberries                       score=0.724
    → Organic Half & Half                       score=0.714
  organic strawberries:
    → Organic Blueberries                       score=0.746
    → Bag of Organic Bananas                    score=0.742
    → Organic Green Seedless Grapes             score=0.701
  organic baby spinach:
    → Organic Spring Mix                        score=0.802
    → Organic Egg Whites                        score=0.761
    → Organic Frozen Peas                       score=0.754
  strawberries:
    → Red Seedless Grapes                       score=0.813
    → Seedless Red Grapes                       score=0.778
    → Blackberries                              score=0.765
  limes:
    → Jalapeno Peppers                   