# Baseline Product Type Classification Pipeline

## Overview
This notebook establishes the ML pipeline structure for product type classification of Home Depot catalog items. It is designed as a **scaffold awaiting organic labels** derived from the dataset itself.

## Architecture
- **Data Source**: `data/scraped_data_output.json` (425 product records)
- **Features**: TF-IDF vectorization of combined `title` + `description` fields
- **Model**: Logistic Regression classifier (scikit-learn)
- **Split Strategy**: 80/20 train/test with stratification on unique title/model combinations

## Label Integration Strategy
Once organic product-type labels are established through data-driven discovery:
1. **Insert labels** in the designated section below (see `TODO: INSERT ORGANIC LABELS`)
2. Labels should be derived from title/description patterns, brand signals, and attribute clusters
3. Each product record needs a `product_type` field added to the DataFrame
4. Do NOT force-fit items into taxonomy leaves until confident labels exist

## Feature Engineering Notes
- **Current**: Simple TF-IDF on text concatenation
- **Future enhancements**:
  - Brand embeddings or one-hot encoding
  - Extracted attributes (dimensions, wattage, material) from `structured_specifications`
  - N-gram keyword buckets
  - Sentence embeddings (SentenceTransformers)

## Important Notes
- This notebook will run in skeleton form without errors
- No taxonomy file references are included (taxonomy mapping comes later)
- All TODOs must be addressed before actual training


## 1. Environment Setup

In [None]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
import warnings
warnings.filterwarnings('ignore')

print("‚úì Libraries imported successfully")

## 2. Data Loading

In [None]:
# Load scraped product data
data_path = Path('../data/scraped_data_output.json')

with open(data_path, 'r', encoding='utf-8') as f:
    products_raw = json.load(f)

# Convert to DataFrame
df = pd.DataFrame(products_raw)

print(f"Loaded {len(df)} product records")
print(f"\nColumns: {list(df.columns)}")
print(f"\nSample record:")
display(df[['title', 'brand', 'model', 'price']].head(3))

## 3. Data Exploration & Quality Checks

In [None]:
# Check for missing values in key fields
print("Missing values:")
print(df[['title', 'description', 'brand', 'model']].isnull().sum())

# Unique titles and models (for split preservation)
print(f"\nUnique titles: {df['title'].nunique()}")
print(f"Unique models: {df['model'].nunique()}")
print(f"Unique brands: {df['brand'].nunique()}")

# Check data types
print(f"\nDataFrame shape: {df.shape}")

## 4. Feature Engineering

In [None]:
# Create combined text field for TF-IDF
# Handle missing descriptions gracefully
df['description'] = df['description'].fillna('')
df['title'] = df['title'].fillna('')

# Combine title and description
df['combined_text'] = df['title'] + ' ' + df['description']

print(f"‚úì Created combined_text field")
print(f"Average text length: {df['combined_text'].str.len().mean():.0f} characters")

# Preview combined text
print("\nSample combined text (first 200 chars):")
print(df['combined_text'].iloc[0][:200] + "...")

## 5. ‚ö†Ô∏è ORGANIC LABEL INTEGRATION POINT ‚ö†Ô∏è

### TODO: INSERT ORGANIC LABELS HERE WHEN READY

Before training can begin, you must:

1. **Discover organic product types** from the dataset using:
   - Title/description keyword clustering
   - Brand-category associations
   - Attribute patterns from `structured_specifications`
   - Manual review of high-confidence clusters

2. **Add a `product_type` column** to the DataFrame:
   ```python
   # Example (replace with actual labels):
   df['product_type'] = None  # Fill with discovered labels
   ```

3. **Validation requirements**:
   - At least 5-10 examples per product type
   - Clear distinction between types
   - Confidence scores for each assignment

4. **Do NOT**:
   - Force-fit items into taxonomy_paths.txt categories yet
   - Use arbitrary synthetic labels
   - Skip human review of uncertain assignments (<0.7 confidence)

In [None]:
# ============================================================================
# PLACEHOLDER: Organic labels will be inserted here
# ============================================================================

# For now, create a dummy placeholder to allow the pipeline to run
# This will be replaced with actual organic labels from data discovery
df['product_type'] = 'UNLABELED'  # Placeholder - replace with real labels

print("‚ö†Ô∏è  WARNING: Using placeholder labels")
print("‚ö†Ô∏è  Training cannot proceed until organic labels are added")
print(f"\nLabel distribution:")
print(df['product_type'].value_counts())

## 6. Train/Test Split

**Strategy**: 80/20 split preserving unique titles/models to prevent data leakage

In [None]:
# Create unique identifier for deduplication
df['unique_id'] = df['title'] + '_' + df['model'].astype(str)

# Check for duplicates
n_duplicates = df['unique_id'].duplicated().sum()
if n_duplicates > 0:
    print(f"‚ö†Ô∏è  Found {n_duplicates} duplicate records (same title+model)")
    print(f"Removing duplicates to prevent data leakage...")
    df = df.drop_duplicates(subset='unique_id', keep='first')
    print(f"‚úì Deduplicated: {len(df)} unique records remaining")
else:
    print(f"‚úì No duplicates found")

# Perform stratified split based on product_type
# Note: This will fail with single class, but structure is ready for multi-class
try:
    train_df, test_df = train_test_split(
        df,
        test_size=0.2,
        random_state=42,
        stratify=df['product_type']  # Stratify when labels are available
    )
    print(f"‚úì Stratified split successful")
except ValueError as e:
    # Fallback to simple random split if stratification fails (e.g., single class)
    print(f"‚ÑπÔ∏è  Stratification not possible (expected with placeholder labels)")
    train_df, test_df = train_test_split(
        df,
        test_size=0.2,
        random_state=42
    )
    print(f"‚úì Simple random split performed")

print(f"\nTrain set: {len(train_df)} records ({len(train_df)/len(df)*100:.1f}%)")
print(f"Test set: {len(test_df)} records ({len(test_df)/len(df)*100:.1f}%)")

# Verify no title/model overlap between train and test
train_ids = set(train_df['unique_id'])
test_ids = set(test_df['unique_id'])
overlap = train_ids & test_ids
assert len(overlap) == 0, f"Data leakage detected: {len(overlap)} overlapping IDs"
print(f"‚úì Data leakage check passed: 0 overlapping IDs")

## 7. TF-IDF Feature Pipeline

In [None]:
# Initialize TF-IDF vectorizer
# Parameters tuned for product descriptions
tfidf = TfidfVectorizer(
    max_features=5000,      # Limit vocabulary size
    min_df=2,               # Ignore terms appearing in fewer than 2 documents
    max_df=0.8,             # Ignore terms appearing in >80% of documents
    ngram_range=(1, 2),     # Unigrams and bigrams
    strip_accents='unicode',
    lowercase=True,
    stop_words='english'
)

# Fit on training data only (prevent test set leakage)
X_train_tfidf = tfidf.fit_transform(train_df['combined_text'])
X_test_tfidf = tfidf.transform(test_df['combined_text'])

print(f"‚úì TF-IDF vectorization complete")
print(f"Feature matrix shape (train): {X_train_tfidf.shape}")
print(f"Feature matrix shape (test): {X_test_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf.vocabulary_)}")

# Show top features by IDF score
feature_names = tfidf.get_feature_names_out()
idf_scores = tfidf.idf_
top_features_idx = np.argsort(idf_scores)[:20]  # Lowest IDF = most common
print(f"\nSample features (most common): {[feature_names[i] for i in top_features_idx[:10]]}")

## 8. Classifier Scaffold

In [None]:
# Initialize Logistic Regression classifier
# Using parameters suitable for multi-class text classification
clf = LogisticRegression(
    max_iter=1000,           # Sufficient iterations for convergence
    multi_class='multinomial',  # Proper multi-class handling
    solver='lbfgs',          # Efficient for small-medium datasets
    random_state=42,
    class_weight='balanced', # Handle class imbalance
    n_jobs=-1                # Use all CPU cores
)

print(f"‚úì Classifier initialized: {clf.__class__.__name__}")
print(f"Parameters: {clf.get_params()}")

## 9. ‚ö†Ô∏è TRAINING LOOP PLACEHOLDER ‚ö†Ô∏è

### TODO: ADD TRAINING LOOP AFTER LABELS ARE READY

Once organic labels are integrated, uncomment and run the training code below:

In [None]:
# ============================================================================
# TODO: UNCOMMENT THIS SECTION AFTER ADDING ORGANIC LABELS
# ============================================================================

# # Extract labels
# y_train = train_df['product_type']
# y_test = test_df['product_type']

# # Train the classifier
# print("Training classifier...")
# clf.fit(X_train_tfidf, y_train)
# print("‚úì Training complete")

# # Make predictions
# y_train_pred = clf.predict(X_train_tfidf)
# y_test_pred = clf.predict(X_test_tfidf)

# # Evaluate
# train_acc = accuracy_score(y_train, y_train_pred)
# test_acc = accuracy_score(y_test, y_test_pred)

# print(f"\nTraining accuracy: {train_acc:.4f}")
# print(f"Test accuracy: {test_acc:.4f}")
# print(f"\nClassification Report (Test Set):")
# print(classification_report(y_test, y_test_pred))

# # Check if meets >=98% accuracy requirement
# if test_acc >= 0.98:
#     print("‚úì Model meets >=98% accuracy threshold")
# else:
#     print(f"‚ö†Ô∏è  Model accuracy ({test_acc:.2%}) below 98% threshold")

print("‚ö†Ô∏è  Training code is ready but waiting for organic labels")
print("‚ö†Ô∏è  Uncomment the section above after label integration")

## 10. Pipeline Status Summary

In [None]:
print("=" * 60)
print("BASELINE PRODUCT TYPE CLASSIFICATION PIPELINE - STATUS")
print("=" * 60)
print(f"\n‚úì Data loaded: {len(df)} records")
print(f"‚úì Features engineered: TF-IDF with {X_train_tfidf.shape[1]} features")
print(f"‚úì Train/test split: {len(train_df)}/{len(test_df)} (80/20)")
print(f"‚úì Classifier scaffold ready: {clf.__class__.__name__}")
print(f"\n‚ö†Ô∏è  BLOCKERS:")
print(f"   1. Organic product-type labels not yet assigned")
print(f"   2. Training loop commented out pending labels")
print(f"\nüìã NEXT STEPS:")
print(f"   1. Run data discovery to identify organic product types")
print(f"   2. Assign product_type labels in Section 5")
print(f"   3. Uncomment training code in Section 9")
print(f"   4. Validate >=98% accuracy requirement")
print(f"   5. Map confident labels to taxonomy (future phase)")
print("=" * 60)

## Additional Notes

### Model Persistence (Future)
Once trained, save the model and vectorizer:
```python
import joblib
joblib.dump(clf, '../models/product_type_classifier.pkl')
joblib.dump(tfidf, '../models/tfidf_vectorizer.pkl')
```

### Confidence Scoring (Future)
Use `predict_proba()` to identify uncertain predictions:
```python
proba = clf.predict_proba(X_test_tfidf)
max_proba = proba.max(axis=1)
uncertain = max_proba < 0.7  # Flag for human review
```

### Feature Enhancement Ideas
1. Add brand as categorical feature
2. Extract numeric attributes (wattage, dimensions) from structured_specifications
3. Use SentenceTransformers embeddings instead of TF-IDF
4. Include price bins as features
5. Leverage rating/review count as quality signals