# Dataset Preparation for Association Mining

This notebook prepares the **What's Cooking** dataset from Kaggle for association mining analysis. The goal is to transform the hypergraph structure into a transaction-based format suitable for discovering patterns in culinary composition across different cuisines.

## Dataset Overview
- **Source**: Kaggle What's Cooking Competition
- **Structure**: Hypergraph with recipes as edges connecting ingredient nodes
- **Purpose**: Unsupervised pattern discovery in high-sparsity transactional culinary data

## 1. Import Required Libraries

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load Raw Data

Load the hypergraph dataset and ingredient normalization mapping.

In [2]:
# Load the What's Cooking hypergraph dataset
with open('dataset/kaggle-whats-cooking.json', 'r') as f:
    cooking_data = json.load(f)

# Load ingredient normalization mapping
with open('dataset/ingredients.json', 'r') as f:
    ingredients_mapping = json.load(f)

print("Loaded What's Cooking hypergraph dataset")
print("Loaded ingredient normalization mapping")
print(f"\nDataset structure keys: {list(cooking_data.keys())}")

Loaded What's Cooking hypergraph dataset
Loaded ingredient normalization mapping

Dataset structure keys: ['hypergraph-data', 'node-data', 'edge-data', 'edge-dict']


## 3. Understand the Hypergraph Structure

The dataset is structured as a hypergraph where:
- **Nodes** (node-data): Individual ingredients with unique IDs
- **Edges** (edge-dict): Recipes, each containing a list of ingredient IDs
- **Edge Labels** (edge-data): Cuisine type for each recipe (greek, italian, indian, etc.)

In [3]:
# Extract components
node_data = cooking_data['node-data']  # Ingredient ID -> name
edge_data = cooking_data['edge-data']  # Recipe ID -> cuisine type
edge_dict = cooking_data['edge-dict']  # Recipe ID -> list of ingredient IDs

print("=" * 60)
print("HYPERGRAPH STRUCTURE")
print("=" * 60)
print(f"Total ingredients (nodes): {len(node_data):,}")
print(f"Total recipes (edges): {len(edge_dict):,}")
print(f"Total edge labels: {len(edge_data):,}")

# Show sample data
print("\n" + "-" * 60)
print("Sample Node (Ingredient):")
print("-" * 60)
sample_node_id = list(node_data.keys())[0]
print(f"ID {sample_node_id}: {node_data[sample_node_id]['name']}")

print("\n" + "-" * 60)
print("Sample Edge (Recipe):")
print("-" * 60)
sample_recipe_id = list(edge_dict.keys())[0]
print(f"Recipe ID: {sample_recipe_id}")
print(f"Cuisine: {edge_data[sample_recipe_id]['name']}")
print(f"Ingredient IDs: {edge_dict[sample_recipe_id][:5]}... ({len(edge_dict[sample_recipe_id])} total)")

# Show what those ingredients are
print("\nIngredient names:")
for ing_id in edge_dict[sample_recipe_id][:5]:
    print(f"  - {node_data[ing_id]['name']}")

HYPERGRAPH STRUCTURE
Total ingredients (nodes): 6,714
Total recipes (edges): 39,774
Total edge labels: 39,774

------------------------------------------------------------
Sample Node (Ingredient):
------------------------------------------------------------
ID 5930: ginger paste

------------------------------------------------------------
Sample Edge (Recipe):
------------------------------------------------------------
Recipe ID: 0
Cuisine: greek
Ingredient IDs: ['5930', '3243', '2095', '4243', '2291']... (9 total)

Ingredient names:
  - ginger paste
  - sea salt
  - shortbread
  - puffed rice
  - chocolate


## 4. Create Ingredient ID to Normalized Name Mapping

Convert the ingredient mapping list into a dictionary for fast lookup.

In [4]:
# Create mapping: ingredient_id -> canonicalized (normalized) name
id_to_canonical = {}
for item in ingredients_mapping:
    id_to_canonical[item['id']] = item['canonicalized']

print(f"Created mapping for {len(id_to_canonical):,} ingredients")

# Show examples of normalization
print("\n" + "=" * 60)
print("INGREDIENT NORMALIZATION EXAMPLES")
print("=" * 60)
examples = ingredients_mapping[:5]
for ex in examples:
    print(f"{ex['ingredient']:30s} → {ex['canonicalized']}")

Created mapping for 6,714 ingredients

INGREDIENT NORMALIZATION EXAMPLES
ginger paste                   → ginger
sea salt                       → salt
shortbread                     → shortbread
chocolate                      → chocolate
puffed rice                    → rice


## 5. Transform to Transaction Format

Convert the hypergraph structure into a transactional dataset where each row is a recipe with its cuisine type and normalized ingredients.

In [None]:
# Create transaction dataset
transactions = []

for recipe_id, ingredient_ids in edge_dict.items():
    # Get cuisine type
    cuisine = edge_data[recipe_id]['name']
    
    # Get normalized ingredient names
    normalized_ingredients = []
    for ing_id in ingredient_ids:
        if ing_id in id_to_canonical:
            normalized_ingredients.append(id_to_canonical[ing_id])
    
    # Remove duplicates (some ingredients might normalize to the same thing)
    normalized_ingredients = list(set(normalized_ingredients))
    
    # Create transaction record
    transactions.append({
        'recipe_id': recipe_id,
        'cuisine': cuisine,
        'ingredients': ','.join(sorted(normalized_ingredients)),  # Comma-separated for CSV
        'ingredient_count': len(normalized_ingredients)
    })

# Convert to DataFrame
df = pd.DataFrame(transactions)

print(f"Created {len(df):,} transaction records")
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Export to CSV
df.to_csv('dataset/prepared_recipes_raw.csv', index=False)
print(f"\nExported to dataset/prepared_recipes_raw.csv")

Created 39,774 transaction records

DataFrame shape: (39774, 4)
Columns: ['recipe_id', 'cuisine', 'ingredients', 'ingredient_count']
