<a href="https://colab.research.google.com/github/darlon31/FlavorGraph/blob/HybridSystem/Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# CELL 1: Setup and Imports - Run this first
# Import all necessary libraries
import os
import sys
import pickle
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Install required packages if not already installed
!pip install -q scikit-learn numpy tqdm torch networkx matplotlib pandas

In [2]:
# CELL 2: Repository Setup
# This is written to be safely re-runnable - it won't cause errors if run multiple times
import os

# Clone repository if it doesn't exist, update if it does
if not os.path.exists('FlavorGraph'):
    !git clone https://github.com/darlon31/FlavorGraph.git  # Replace with your fork URL
else:
    print("Repository already exists")

# Change to the FlavorGraph directory
os.chdir('FlavorGraph')
print(f"Current working directory: {os.getcwd()}")

Cloning into 'FlavorGraph'...
remote: Enumerating objects: 343, done.[K
remote: Counting objects: 100% (94/94), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 343 (delta 49), reused 57 (delta 33), pack-reused 249 (from 1)[K
Receiving objects: 100% (343/343), 20.85 MiB | 8.46 MiB/s, done.
Resolving deltas: 100% (197/197), done.
Current working directory: /content/FlavorGraph


In [3]:
# CELL 3: Load Data
# Load the embeddings file
def load_embeddings():
    try:
        with open('output/kitchenette_embeddings.pkl', 'rb') as f:
            return pickle.load(f)
    except FileNotFoundError:
        print("Error: Embeddings file not found! Make sure it's in the output directory")
        return None

# Load the embeddings
embeddings = load_embeddings()
if embeddings:
    print(f"Successfully loaded embeddings with {len(embeddings)} ingredients")
else:
    print("Failed to load embeddings")

Successfully loaded embeddings with 3567 ingredients


In [4]:
# CELL 4: Define Functions
def find_food_pairings(ingredient, embeddings, top_k=5):
    """
    Find top-k ingredients that pair well with the input ingredient
    """
    if ingredient not in embeddings:
        return f"Ingredient '{ingredient}' not found in the database"

    # Get the embedding for our ingredient
    ingredient_embedding = embeddings[ingredient]

    # Calculate similarity with all other ingredients
    similarities = {}
    for name, emb in embeddings.items():
        if name != ingredient:
            sim = cosine_similarity([ingredient_embedding], [emb])[0][0]
            similarities[name] = sim

    # Return top-k similar ingredients
    return sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_k]

def test_ingredient(ingredient, top_k=5):
    """
    Test function to print pairings for a given ingredient
    """
    print(f"\nTop {top_k} pairings for {ingredient}:")
    pairings = find_food_pairings(ingredient, embeddings, top_k)
    if isinstance(pairings, str):
        print(pairings)
    else:
        for pair, score in pairings:
            print(f"- {pair}: {score:.3f}")

def list_random_ingredients(n=10):
    """
    List some random ingredients from the database
    """
    import random
    sample_ingredients = random.sample(list(embeddings.keys()), n)
    print("\nSample ingredients you can try:")
    for i, ingredient in enumerate(sample_ingredients, 1):
        print(f"{i}. {ingredient}")

In [5]:
# CELL 5: Testing Area - This is where you'll run your tests
# Example tests
print("Testing some common ingredients:")
test_ingredients = ['tomato', 'chocolate', 'beef', 'garlic']
for ingredient in test_ingredients:
    test_ingredient(ingredient)

# Show some random ingredients you can try
list_random_ingredients()

# You can test any ingredient by running:
# test_ingredient('your_ingredient_here')

Testing some common ingredients:

Top 5 pairings for tomato:
- onion: 0.842
- green: 0.841
- basil: 0.835
- rosemary: 0.829
- parsley: 0.826

Top 5 pairings for chocolate:
- fudge: 0.761
- peanut: 0.736
- strawberry: 0.727
- chip: 0.723
- mint: 0.723

Top 5 pairings for beef:
- meat: 0.893
- oregano: 0.870
- pepper: 0.867
- paprika: 0.864
- chicken: 0.861

Top 5 pairings for garlic:
- beef_mince: 0.499
- boneless_chicken: 0.490
- red_chili_peppers: 0.484
- brinjal: 0.474
- minced_beef: 0.471

Sample ingredients you can try:
1. pork_hocks
2. mixed_vegetables
3. beef_consomme
4. chili_flakes
5. orange_juice
6. tomatillo_salsa
7. fromage_blanc
8. dried_fruit
9. asafoetida_powder
10. distilled_white_vinegar


Up to this point we just set up the model and test the output. Below we are performing test to see how it actually works and if could be useful for our app.

In [6]:
# Let's analyze the scoring distribution for better understanding
def analyze_pairing_scores(ingredient, embeddings, n_samples=1000):
    """
    Analyze the distribution of similarity scores for an ingredient
    """
    if ingredient not in embeddings:
        return "Ingredient not found"

    ingredient_embedding = embeddings[ingredient]
    scores = []

    # Calculate similarity with all ingredients
    for name, emb in embeddings.items():
        if name != ingredient:
            sim = cosine_similarity([ingredient_embedding], [emb])[0][0]
            scores.append((name, sim))

    # Sort scores
    scores.sort(key=lambda x: x[1], reverse=True)

    # Calculate statistics
    all_scores = [s[1] for s in scores]
    avg_score = np.mean(all_scores)
    median_score = np.median(all_scores)

    print(f"\nAnalysis for {ingredient}:")
    print(f"Average similarity score: {avg_score:.3f}")
    print(f"Median similarity score: {median_score:.3f}")
    print(f"Score range: {min(all_scores):.3f} to {max(all_scores):.3f}")

    # Print top and bottom examples
    print("\nTop 5 pairings:")
    for name, score in scores[:5]:
        print(f"- {name}: {score:.3f}")

    print("\nBottom 5 pairings:")
    for name, score in scores[-5:]:
        print(f"- {name}: {score:.3f}")

    return scores

# Let's also create a function to find ingredients with specific characteristics
def find_ingredients_by_type(query_terms, embeddings, top_k=10):
    """
    Find ingredients that match certain characteristics
    """
    matching_ingredients = []
    for ingredient in embeddings.keys():
        if any(term.lower() in ingredient.lower() for term in query_terms):
            matching_ingredients.append(ingredient)

    print(f"\nFound {len(matching_ingredients)} ingredients matching {query_terms}:")
    for i, ingredient in enumerate(matching_ingredients[:top_k], 1):
        print(f"{i}. {ingredient}")

    return matching_ingredients

# Let's test these functions
print("Analyzing some ingredients for different categories:")

# Test with a protein
protein_scores = analyze_pairing_scores('chicken', embeddings)

# Find all available proteins
protein_ingredients = find_ingredients_by_type(['chicken', 'beef', 'fish', 'pork'], embeddings)

# Test with a vegetable
veggie_scores = analyze_pairing_scores('broccoli', embeddings)

Analyzing some ingredients for different categories:

Analysis for chicken:
Average similarity score: 0.108
Median similarity score: 0.074
Score range: -0.096 to 0.861

Top 5 pairings:
- beef: 0.861
- meat: 0.843
- pork: 0.838
- soup: 0.830
- duck: 0.824

Bottom 5 pairings:
- pizza_cheese: -0.069
- mashed_potatoes: -0.070
- chili_with_beans: -0.077
- skordalia: -0.081
- frozen_tater_tots: -0.096

Found 150 ingredients matching ['chicken', 'beef', 'fish', 'pork']:
1. beef_suet
2. beef_bouillon
3. pork_hocks
4. chicken_fat
5. butterfish
6. chicken_cutlets
7. chicken_base
8. chipped_beef
9. pork_mince
10. corned_beef_brisket

Analysis for broccoli:
Average similarity score: 0.123
Median similarity score: 0.091
Score range: -0.084 to 0.822

Top 5 pairings:
- ham: 0.822
- cornstarch: 0.822
- beef: 0.819
- soup: 0.817
- vinegar: 0.811

Bottom 5 pairings:
- roasted_garlic: -0.055
- powdered_sugar: -0.056
- confectioners'_sugar: -0.063
- ground_hazelnuts: -0.070
- goose_fat: -0.084


This output tells us several important things and suggests next steps for your app. Let me break it down:

**Score Distribution Analysis:**
Scores range roughly from -0.1 to 0.86
The average scores (0.108 for chicken, 0.123 for broccoli) are relatively low
This suggests the model is quite discriminative in its pairings
Interesting Patterns:
The model shows strong protein-to-protein relationships (chicken→beef: 0.861)
Some counterintuitive pairings (broccoli→ham: 0.822) might be based more on recipe occurrence than molecular compatibility

This analysis below will help us understand:

How much the model's recommendations are based on molecular similarity vs. recipe patterns
Which ingredients might need different weightings for your app
How to adjust the recommendations based on user preferences

This comprehensive analysis will help us understand:

Cuisine Patterns:
How well ingredients cluster by cuisine
Cross-cuisine ingredient relationships
Potential for cuisine-specific recommendations
Cooking Methods:
Ingredient transformations
Cooking method compatibility
Recipe technique suggestions
Seasonal Patterns:
Temporal ingredient relationships
Seasonal substitutions
Seasonal recipe recommendations
Network Structure:
Key ingredient hubs
Ingredient-compound relationships
Network-based recommendations

In [9]:
import pandas as pd
import numpy as np
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx
from datetime import datetime
import calendar

def load_and_analyze_data():
    """
    Load and analyze the FlavorGraph data structure
    """
    # Load ingredient categories
    print("Loading ingredient categories...")
    categories_df = pd.read_csv('input/dict_ingr2cate - Top300+FDB400+HyperFoods104=616.csv')
    print(f"Categories loaded: {len(categories_df)} ingredients")

    # Load nodes
    print("\nLoading nodes...")
    nodes_df = pd.read_csv('input/nodes_191120.csv')
    print(f"Nodes loaded: {len(nodes_df)} total nodes")
    print(f"Node types: {nodes_df['node_type'].value_counts().to_dict()}")

    # Load edges with correct column names
    print("\nLoading edges...")
    edges_df = pd.read_csv('input/edges_191120.csv')
    print(f"Edge columns: {edges_df.columns.tolist()}")
    print(f"Total edges: {len(edges_df)}")

    return categories_df, nodes_df, edges_df

def analyze_ingredient_categories(categories_df):
    """
    Analyze ingredient category distribution and relationships
    """
    category_counts = categories_df['category'].value_counts()
    print("\nIngredient Categories Distribution:")
    for category, count in category_counts.items():
        print(f"- {category}: {count} ingredients")

    return category_counts

def analyze_cuisine_patterns(embeddings):
    """
    Analyze cuisine-specific patterns in the embeddings
    """
    # Define cuisine-specific ingredient groups
    cuisine_markers = {
        'asian': ['soy_sauce', 'sesame_oil', 'ginger', 'rice_vinegar', 'mirin'],
        'italian': ['olive_oil', 'basil', 'parmesan', 'pasta', 'tomato'],
        'mexican': ['cilantro', 'lime', 'jalapeno', 'tortilla', 'cumin'],
        'indian': ['garam_masala', 'turmeric', 'cardamom', 'cumin', 'coriander'],
        'mediterranean': ['olive_oil', 'lemon', 'feta', 'oregano', 'garlic']
    }

    print("\nAnalyzing Cuisine Patterns:")
    cuisine_analysis = {}
    for cuisine, markers in cuisine_markers.items():
        print(f"\n{cuisine.upper()} Cuisine Analysis:")

        # Find matching ingredients
        cuisine_ingredients = []
        for marker in markers:
            matches = [ing for ing in embeddings.keys() if marker in ing.lower()]
            if matches:
                cuisine_ingredients.extend(matches)
                print(f"- Found {len(matches)} matches for '{marker}'")

        # Calculate similarities if we found any matches
        if cuisine_ingredients:
            # Calculate centroid
            cuisine_vectors = [embeddings[ing] for ing in cuisine_ingredients]
            cuisine_centroid = np.mean(cuisine_vectors, axis=0)

            # Find similar ingredients
            similarities = []
            for ing, emb in embeddings.items():
                if ing not in cuisine_ingredients:  # Exclude markers themselves
                    sim = cosine_similarity([cuisine_centroid], [emb])[0][0]
                    similarities.append((ing, sim))

            top_similar = sorted(similarities, key=lambda x: x[1], reverse=True)[:10]
            print("\nTop related ingredients:")
            for ing, score in top_similar:
                print(f"- {ing}: {score:.3f}")

            cuisine_analysis[cuisine] = {
                'markers_found': cuisine_ingredients,
                'top_similar': top_similar
            }

    return cuisine_analysis

def analyze_cooking_methods(embeddings):
    """
    Analyze cooking method-related patterns
    """
    cooking_methods = {
        'raw': ['fresh', 'raw', 'uncooked'],
        'baked': ['baked', 'roasted', 'grilled'],
        'fried': ['fried', 'sauteed', 'pan'],
        'boiled': ['boiled', 'steamed', 'poached'],
        'preserved': ['pickled', 'fermented', 'cured']
    }

    print("\nAnalyzing Cooking Methods:")
    method_analysis = {}
    for method, keywords in cooking_methods.items():
        print(f"\n{method.upper()} Method Analysis:")

        # Find ingredients associated with cooking method
        method_ingredients = []
        for kw in keywords:
            matches = [ing for ing in embeddings.keys() if kw in ing.lower()]
            if matches:
                method_ingredients.extend(matches)
                print(f"- Found {len(matches)} {kw} ingredients")

        if method_ingredients:
            # Calculate method centroid
            method_vectors = [embeddings[ing] for ing in method_ingredients]
            method_centroid = np.mean(method_vectors, axis=0)

            # Find similar ingredients
            similarities = []
            for ing, emb in embeddings.items():
                if ing not in method_ingredients:
                    sim = cosine_similarity([method_centroid], [emb])[0][0]
                    similarities.append((ing, sim))

            top_similar = sorted(similarities, key=lambda x: x[1], reverse=True)[:10]
            print("\nTop associated ingredients:")
            for ing, score in top_similar:
                print(f"- {ing}: {score:.3f}")

            method_analysis[method] = {
                'examples': method_ingredients,
                'top_similar': top_similar
            }

    return method_analysis

def analyze_seasonal_patterns(embeddings):
    """
    Analyze seasonal patterns in ingredients
    """
    seasonal_markers = {
        'spring': ['asparagus', 'pea', 'strawberry', 'rhubarb'],
        'summer': ['tomato', 'corn', 'zucchini', 'watermelon'],
        'fall': ['pumpkin', 'apple', 'sage', 'cranberry'],
        'winter': ['citrus', 'potato', 'kale', 'squash']
    }

    print("\nAnalyzing Seasonal Patterns:")
    seasonal_analysis = {}
    for season, markers in seasonal_markers.items():
        print(f"\n{season.upper()} Season Analysis:")

        season_ingredients = []
        for marker in markers:
            matches = [ing for ing in embeddings.keys() if marker in ing.lower()]
            if matches:
                season_ingredients.extend(matches)
                print(f"- Found {len(matches)} matches for '{marker}'")

        if season_ingredients:
            # Calculate season centroid
            season_vectors = [embeddings[ing] for ing in season_ingredients]
            season_centroid = np.mean(season_vectors, axis=0)

            # Find similar ingredients
            similarities = []
            for ing, emb in embeddings.items():
                if ing not in season_ingredients:
                    sim = cosine_similarity([season_centroid], [emb])[0][0]
                    similarities.append((ing, sim))

            top_similar = sorted(similarities, key=lambda x: x[1], reverse=True)[:10]
            print("\nTop seasonal associations:")
            for ing, score in top_similar:
                print(f"- {ing}: {score:.3f}")

            seasonal_analysis[season] = {
                'markers': season_ingredients,
                'top_similar': top_similar
            }

    return seasonal_analysis

def analyze_ingredient_network(nodes_df):
    """
    Analyze the ingredient network structure
    """
    print("\nAnalyzing Ingredient Network:")

    # Analyze node types
    node_types = nodes_df['node_type'].value_counts()
    print("\nNode Type Distribution:")
    for node_type, count in node_types.items():
        print(f"- {node_type}: {count}")

    # Analyze hub ingredients
    hub_ingredients = nodes_df[nodes_df['is_hub'] == 'hub']
    print(f"\nFound {len(hub_ingredients)} hub ingredients:")
    for _, row in hub_ingredients.head(10).iterrows():
        print(f"- {row['name']}")

    return node_types, hub_ingredients

# Run the complete analysis
print("Starting comprehensive FlavorGraph analysis...")

# Load data
categories_df, nodes_df, edges_df = load_and_analyze_data()

# Run individual analyses
category_stats = analyze_ingredient_categories(categories_df)
cuisine_patterns = analyze_cuisine_patterns(embeddings)
cooking_methods = analyze_cooking_methods(embeddings)
seasonal_patterns = analyze_seasonal_patterns(embeddings)
network_stats, hub_ingredients = analyze_ingredient_network(nodes_df)

print("\nAnalysis complete! This data can be used for:")
print("1. Recipe recommendations based on cuisine patterns")
print("2. Cooking method suggestions and substitutions")
print("3. Seasonal menu planning")
print("4. Understanding ingredient relationships and hubs")

Starting comprehensive FlavorGraph analysis...
Loading ingredient categories...
Categories loaded: 616 ingredients

Loading nodes...
Nodes loaded: 8298 total nodes
Node types: {'ingredient': 6653, 'compound': 1645}

Loading edges...
Edge columns: ['id_1', 'id_2', 'score', 'edge_type']
Total edges: 147179

Ingredient Categories Distribution:
- Plant/Vegetable: 147 ingredients
- Sauce/Powder/Dressing: 67 ingredients
- Fruit: 57 ingredients
- Cereal/Crop/Bean: 56 ingredients
- Dairy: 47 ingredients
- Bakery/Dessert/Snack: 38 ingredients
- Seafood: 35 ingredients
- Meat/Animal Product: 30 ingredients
- Nut/Seed: 27 ingredients
- Beverage Alcoholic: 26 ingredients
- Spice: 24 ingredients
- Beverage: 18 ingredients
- Dish/End Product: 16 ingredients
- Essential Oil/Fat: 14 ingredients
- Fungus: 6 ingredients
- Flower: 4 ingredients
- ETC: 4 ingredients

Analyzing Cuisine Patterns:

ASIAN Cuisine Analysis:
- Found 8 matches for 'soy_sauce'
- Found 3 matches for 'sesame_oil'
- Found 22 matches