To run this Fenic demo, click **Runtime** > **Run all**.

<div class="align-center">
<a href="https://github.com/typedef-ai/fenic"><img src="https://github.com/typedef-ai/fenic/blob/main/docs/images/typedef-fenic-logo-github-yellow.png?raw=true" height="50"></a>
<a href="https://discord.gg/GdqF3J7huR"><img src="https://github.com/typedef-ai/fenic/blob/main/docs/images/join-the-discord.png?raw=true" height="50"></a>
<a href="https://docs.fenic.ai/latest/"><img src="https://github.com/typedef-ai/fenic/blob/main/docs/images/documentation.png?raw=true" height="50"></a>

Questions? Join the Discord and ask away! For feature requests or to leave a star, visit our [GitHub](https://github.com/typedef-ai/fenic).

</div>

In [None]:
!pip uninstall -y sklearn-compat ibis-framework imbalanced-learn google-genai
!pip install polars==1.30.0
# === GOOGLE GEMINI ===
#!pip install fenic[google]
# === ANTHROPIC CLAUDE ===
#!pip install fenic[anthropic]
# === OPENAI (Default) ===
!pip install fenic

In [None]:
import os 
import getpass

# 🔌 MULTI-PROVIDER SETUP - Choose your preferred LLM provider
# Uncomment ONE of the provider sections below:

# === OPENAI (Default) ===
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# === GOOGLE GEMINI ===
# os.environ["GOOGLE_API_KEY"] = getpass.getpass("Google API Key:")

# === ANTHROPIC CLAUDE ===
# os.environ["ANTHROPIC_API_KEY"] = getpass.getpass("Anthropic API Key:")

# 🔤 Fuzzy String Matching + AI Verification

**Hook:** *"Find duplicates despite typos, then verify uncertain matches with AI"*

Real-world data is messy - names have typos, nicknames, and variations. Fenic's built-in `fc.text` fuzzy matching functions detect potential duplicates, then AI verification ensures perfect accuracy. Watch algorithms + intelligence work together.

**What you'll see in this 2-minute demo:**
- 🔍 **Built-in fuzzy functions** - `fc.text.compute_fuzzy_ratio()` for similarity scoring
- 📊 **Multiple algorithms** - Levenshtein, Jaro-Winkler, token-based matching
- 🧠 **AI verification** - Use `semantic.map` to verify uncertain matches
- ⚖️ **Hybrid precision** - Combine string algorithms + contextual AI reasoning

This showcases Fenic's comprehensive text processing capabilities with AI enhancement.

In [None]:
import fenic as fc

# ⚡ Configure for fuzzy matching operations
session = fc.Session.get_or_create(fc.SessionConfig(
    app_name="fuzzy_matching_demo",
    semantic=fc.SemanticConfig(
        language_models={
            "matcher": fc.OpenAILanguageModel(model_name="gpt-4o-mini", rpm=500, tpm=200_000),
            # "matcher": fc.GoogleDeveloperLanguageModel(model_name="gemini-2.5-flash-lite", rpm=1000, tpm=1_000_000),
            # "matcher": fc.AnthropicLanguageModel(model_name="claude-3-5-sonnet-20241022", rpm=500, tpm=200_000)
        }
    )
))

print("✅ Fuzzy string matching session configured")

## 📊 Step 1: Messy Customer Data

Real customer records with typos, nicknames, and variations that create duplicates:

In [None]:
# 📊 Customer records with name variations and duplicates
customers = session.create_dataframe([
    {"id": "C001", "name": "Sarah Johnson", "email": "sarah.johnson@company.com"},
    {"id": "C002", "name": "Sarah Jonson", "email": "s.johnson@company.com"},     # Typo!
    {"id": "C003", "name": "Michael Rodriguez", "email": "mrodriguez@firm.co"},
    {"id": "C004", "name": "Mike Rodriguez", "email": "m.rodriguez@firm.co"},    # Nickname!
    {"id": "C005", "name": "David Smith", "email": "david@techcorp.com"},
    {"id": "C006", "name": "Dave Smith", "email": "dave@othercorp.com"}          # Different company!
])

print("📊 Customer Database - Notice the duplicates and variations:")
customers.show()

## 🔍 Step 2: Multiple Fuzzy Algorithms

Apply Fenic's built-in fuzzy matching functions to find potential duplicates:

In [None]:
# 🔍 Create all name pairs for fuzzy comparison
# Sarah Johnson vs Sarah Jonson (typo), Mike vs Michael Rodriguez (nickname)
customer_pairs = session.create_dataframe([
    {"name_1": "Sarah Johnson", "name_2": "Sarah Jonson", "email_1": "sarah.johnson@company.com", "email_2": "s.johnson@company.com"},
    {"name_1": "Michael Rodriguez", "name_2": "Mike Rodriguez", "email_1": "mrodriguez@firm.co", "email_2": "m.rodriguez@firm.co"},
    {"name_1": "David Smith", "name_2": "Dave Smith", "email_1": "david@techcorp.com", "email_2": "dave@othercorp.com"},
    {"name_1": "Sarah Johnson", "name_2": "Michael Rodriguez", "email_1": "sarah.johnson@company.com", "email_2": "mrodriguez@firm.co"},
    {"name_1": "David Smith", "name_2": "Sarah Jonson", "email_1": "david@techcorp.com", "email_2": "s.johnson@company.com"},
])

# Apply multiple fuzzy matching algorithms
pairs = customer_pairs.select(
    "name_1", "name_2", "email_1", "email_2",
    
    # Multiple fuzzy matching algorithms
    fc.text.compute_fuzzy_ratio(fc.col("name_1"), fc.col("name_2"), method="levenshtein").alias("levenshtein"),
    fc.text.compute_fuzzy_token_sort_ratio(fc.col("name_1"), fc.col("name_2")).alias("token_sort"),
    
    # Composite score
    ((fc.text.compute_fuzzy_ratio(fc.col("name_1"), fc.col("name_2"), method="levenshtein") + 
     fc.text.compute_fuzzy_token_sort_ratio(fc.col("name_1"), fc.col("name_2"))) / 2.0
    ).alias("avg_score")
).filter(fc.col("avg_score") >= 70).order_by(fc.desc("avg_score")).cache()  # Cache for subsequent AI verification

print("🔍 FUZZY MATCHING RESULTS (70%+ similarity):")
pairs.select("name_1", "name_2", "levenshtein", "token_sort", "avg_score").show()

## 🧠 Step 3: AI Verification for Uncertain Matches

Use AI to verify uncertain matches by considering email domains and context:

In [None]:
# 🧠 AI verification for uncertain matches (70-89% similarity)
uncertain = pairs.filter((fc.col("avg_score") >= 70) & (fc.col("avg_score") < 90))

ai_verified = uncertain.select(
    "name_1", "name_2", "email_1", "email_2", "avg_score",
    fc.semantic.map(
        """Are these the same person?

        Person 1: "{{name_1}}" | Email: {{email_1}}
        Person 2: "{{name_2}}" | Email: {{email_2}}

        Consider: name variations, email domains, company context.
        Respond: SAME_PERSON or DIFFERENT_PEOPLE with brief reason.""",
        name_1=fc.col("name_1"), name_2=fc.col("name_2"), 
        email_1=fc.col("email_1"), email_2=fc.col("email_2"),
        model_alias="matcher"
    ).alias("ai_analysis")
).with_column(
    "verdict", 
    fc.when(fc.col("ai_analysis").contains("SAME_PERSON"), fc.lit("DUPLICATE")).otherwise(fc.lit("DIFFERENT"))
).cache()  # Cache AI verification results

print("🧠 AI VERIFICATION RESULTS:")
ai_verified.select("name_1", "name_2", "avg_score", "verdict", "ai_analysis").show()

duplicates = ai_verified.filter(fc.col("verdict") == "DUPLICATE").count()
print(f"\n✅ HYBRID RESULTS: {duplicates} verified duplicates found!")
print("   🔤 Algorithm caught typos: 'Johnson' vs 'Jonson'") 
print("   🧠 AI considered context: Same company = same person")

In [None]:
session.stop()