## Smart Gift Planner: AI-Powered Gift Recommendation System

### Data Science Component - Holiday Data Jam Project


## What is this project about?

The **Smart Gift Planner** is an AI-powered recommendation system designed to solve a common holiday challenge: finding the perfect gift for someone based on their interests and your budget. Instead of spending hours browsing through thousands of products online, users can simply tell our system what the gift recipient is interested in (like "photography", "fitness", or "gaming") and set their budget, and the AI will instantly recommend the most relevant products.

## The Problem We're Solving

Gift shopping can be overwhelming with endless product options and limited time. People often struggle to:
- Find gifts that truly match someone's interests
- Stay within their budget while finding quality items
- Discover products they didn't know existed
- Make confident purchasing decisions

## Our Solution

We built an intelligent recommender system using **machine learning and natural language processing** that:

1. **Understands Context**: Uses TF-IDF (Term Frequency-Inverse Document Frequency) to understand the semantic meaning behind user queries
2. **Filters Smartly**: Automatically narrows down thousands of products to only those within the user's specified price range
3. **Ranks by Relevance**: Calculates similarity scores to show the most relevant gift suggestions first
4. **Explains Recommendations**: Provides transparency by showing why each product was recommended

## Project Goals & Objectives

### Primary Goal
Build an intelligent gift recommendation system that helps users discover personalized gift ideas based on recipient interests and budget constraints, using machine learning and natural language processing techniques.

### Part 1: Data Preprocessing Goals
**Objectives:**
- Load and explore the Amazon product dataset
- Clean data by handling missing values and duplicates
- Normalize text fields for accurate matching
- Engineer features by creating combined text fields
- Categorize prices into meaningful ranges
- Validate data quality

**Success Criteria:**
  Zero missing values in critical fields  
  All text normalized to lowercase  
  Combined features created for each product  
  Dataset saved as `amazon_categories.csv`

### Part 2: Model Development Goals
**Baseline Model:**
- Simple keyword-based recommender as benchmark
- Filter by price range and count keyword matches

**Enhanced Model:**
- TF-IDF-based recommender with semantic understanding
- Cosine similarity for relevance scoring
- Demonstrate improvement over baseline

**Success Criteria:**
  Both models return ranked recommendations  
  Enhanced model captures semantic relationships  
  Similarity scores are interpretable  


### Part 3: Visualization Goals
- Create price distribution visualizations
- Generate category distribution charts
- Visualize recommendation relevance scores
- Produce JSON outputs for SE team integration

###  Part 4: Documentation Goals
- Write comprehensive report
- Document preprocessing steps
- Explain model architectures
- Interpret visualization insights
- Provide conclusions and recommendations

In [11]:
import pandas as pd
import numpy as np
import json
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [12]:
df = pd.read_csv('amazon_products.csv')

print(f"\nColumn names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i}.{col}")

# Display first few rows
print(f"\nFirst 10 rows of the dataset:")
print(df.head(10))

# Dataset information
print("Dataset Information:")
df.info()

# Basic statistics
print("\nNumerical Columns Statistics:")
print(df.describe())

# Check data types
print("\nData Types:")
for col in df.columns:
    print(f"{col:30s} : {str(df[col].dtype):15s} | {df[col].nunique():6d} unique values")


Column names:
1.asin
2.title
3.imgUrl
4.productURL
5.stars
6.reviews
7.price
8.listPrice
9.category_id
10.isBestSeller
11.boughtInLastMonth

First 10 rows of the dataset:
         asin                                              title  \
0  B014TMV5YE  Sion Softside Expandable Roller Luggage, Black...   
1  B07GDLCQXV  Luggage Sets Expandable PC+ABS Durable Suitcas...   
2  B07XSCCZYG  Platinum Elite Softside Expandable Checked Lug...   
3  B08MVFKGJM  Freeform Hardside Expandable with Double Spinn...   
4  B01DJLKZBA  Winfield 2 Hardside Expandable Luggage with Sp...   
5  B07XSCD2R4  Maxlite 5 Softside Expandable Luggage with 4 S...   
6  B07MXF4G8K  Hard Shell Carry on Luggage Airline Approved, ...   
7  B07H515VCZ  Maxporter II 30" Hardside Spinner Trunk Luggag...   
8  B08BXBCNMQ  Omni 2 Hardside Expandable Luggage with Spinne...   
9  B0B9K44XTS  Luggage Sets Expandable Lightweight Suitcases ...   

                                              imgUrl  \
0  https://m.media-amaz

In [13]:
# Calculate missing values
missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(4)})

missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Percentage', ascending=False)

if len(missing_data) > 0:
    print("Columns with missing values:")
    print(missing_data.to_string(index=False))

Columns with missing values:
Column  Missing_Count  Missing_Percentage
 title              1              0.0001


In [17]:
initial_rows = len(df)

# Remove rows with missing critical fields
if 'price' in df.columns:
    before = len(df)
    df = df[df['price'].notna()]
    df = df[df['price'] > 0]
    print(f"Removed {before - len(df)} products with missing or invalid prices")

if 'title' in df.columns:
    before = len(df)
    df = df[df['title'].notna()]
    df = df[df['title'].str.strip() != '']
    print(f"Removed {before - len(df)} products with missing titles")

# Fill missing values in non-critical fields
if 'category' in df.columns:
    missing_cat = df['category'].isna().sum()
    df['category'] = df['category'].fillna('Uncategorized')
    print(f"Filled {missing_cat} missing categories with 'Uncategorized'")

# Handle text fields
text_columns = ['description', 'tags', 'features', 'about']
for col in text_columns:
    if col in df.columns:
        missing = df[col].isna().sum()
        df[col] = df[col].fillna('')
        if missing > 0:
            print(f"Filled {missing} missing values in '{col}' with empty strings")

print(f"\nRows removed: {initial_rows - len(df)}")
print(f"Final dataset size: {len(df)} rows")

Removed 0 products with missing or invalid prices
Removed 0 products with missing titles

Rows removed: 0
Final dataset size: 1393564 rows


In [18]:
def clean_text(text):
    """
    Clean and normalize text data
    
    Parameters:
    - text: input text string
    
    Returns:
    - cleaned text string
    """
    if pd.isna(text) or text == '':
        return ''
    
    # Convert to string and lowercase
    text = str(text).lower()
    
    # Remove special characters but keep spaces
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

In [19]:
if 'title' in df.columns:
    df['title_clean'] = df['title'].apply(clean_text)
    print(f"Cleaned {len(df)} product titles")

# Clean description
if 'description' in df.columns:
    df['description_clean'] = df['description'].apply(clean_text)
    print(f"Cleaned {len(df)} product descriptions")

# Clean category
if 'category' in df.columns:
    df['category_clean'] = df['category'].apply(clean_text)
    print(f"Cleaned {len(df)} product categories")

Cleaned 1393564 product titles


In [20]:
# Create combined features for recommendation
text_fields = []
if 'title_clean' in df.columns:
    text_fields.append('title_clean')
if 'category_clean' in df.columns:
    text_fields.append('category_clean')
if 'description_clean' in df.columns:
    text_fields.append('description_clean')

if text_fields:
    df['combined_features'] = df[text_fields].apply(
        lambda x: ' '.join(x.dropna().astype(str)), axis=1)
    print(f"Created 'combined_features' from: {', '.join(text_fields)}")
    
    # Show sample
    print("\nSample combined features:")
    print(df['combined_features'].iloc[0][:200] + "...")

# Create price range categories
if 'price' in df.columns:
    df['price_range'] = pd.cut(
        df['price'],
        bins=[0, 25, 50, 100, 200, float('inf')],
        labels=['Budget ($0-25)', 'Affordable ($25-50)', 
                'Mid-range ($50-100)', 'Premium ($100-200)', 
                'Luxury ($200+)'])
    print(f"Created 'price_range' categories")
    
    # Show distribution
    print("\nPrice range distribution:")
    print(df['price_range'].value_counts().sort_index())

Created 'combined_features' from: title_clean

Sample combined features:
sion softside expandable roller luggage black checked large 29 inch...
Created 'price_range' categories

Price range distribution:
price_range
Budget ($0-25)         854182
Affordable ($25-50)    307612
Mid-range ($50-100)    124589
Premium ($100-200)      67213
Luxury ($200+)          39968
Name: count, dtype: int64
