# Entity Resolution with Spindle

This notebook demonstrates **semantic entity resolution** – the process of identifying and linking duplicate entities in a knowledge graph using embeddings and LLMs.

## Overview

Entity resolution uses a three-phase approach

1. **Semantic Blocking**: Cluster similar entities using embeddings to reduce O(n^2) comparisons
2. **Semantic Matching**: Use LLM (Claude) to determine if entities are duplicates with confidence scores
3. **Graph Linking**: Create SAME_AS edges to preserve provenance while linking duplicates

This approach is based on techniques from ["The Rise of Semantic Entity Resolution"](https://towardsdatascience.com/the-rise-of-semantic-entity-resolution/).


## Setup


In [2]:
import os
import json
from pathlib import Path

# Spindle imports
from spindle import GraphStore
from spindle.vector_store import ChromaVectorStore, get_default_embedding_function
from spindle.entity_resolution import (
    EntityResolver,
    ResolutionConfig,
    resolve_entities,  # Convenience function
)

print("✅ Imports successful")


✅ Imports successful


## Check API Keys

Entity resolution requires:
- `ANTHROPIC_API_KEY` for LLM-based matching (Claude Sonnet 4)


In [3]:
# Check for required API key
if not os.getenv("ANTHROPIC_API_KEY"):
    print("⚠️  ANTHROPIC_API_KEY not set! Set it with: export ANTHROPIC_API_KEY='your-key-here'")
else:
    print("✅ ANTHROPIC_API_KEY configured")


✅ ANTHROPIC_API_KEY configured


## Initialize Stores

We need:
- **GraphStore**: To store entities and relationships
- **VectorStore**: To compute embeddings for semantic blocking


In [4]:
# Initialize graph store
graph_store = GraphStore("entity_resolution_demo")

# Initialize vector store with default embedding function
vector_store = ChromaVectorStore(
    collection_name="resolution_embeddings",
    embedding_function=get_default_embedding_function(),
)

print("✅ Stores initialized")
print(f"Graph DB path: {graph_store.db_path}")
collection_name = getattr(getattr(vector_store, "collection", None), "name", "<unknown>")
print(f"Vectors collection: {collection_name}")


✅ Stores initialized
Graph DB path: /Users/thalamus/Repos/spindle/graphs/entity_resolution_demo/graph.db
Vectors collection: resolution_embeddings


## Create Sample Data with Duplicates

Let's create a knowledge graph about tech companies with intentional duplicates:
- Name variations ("TechCorp" vs "Tech Corp")
- Abbreviations ("NYC" vs "New York City")
- Typos and misspellings


In [5]:
sample_triples = [
    # Company entities (with duplicates)
    {
        "subject": "TechCorp",
        "predicate": "has_type",
        "object": "Company",
        "subject_type": "Organization",
        "object_type": "Type",
        "subject_description": "A leading technology company.",
    },
    {
        "subject": "Tech Corp",  # DUPLICATE: Space variation
        "predicate": "has_type",
        "object": "Company",
        "subject_type": "Organization",
        "object_type": "Type",
        "subject_description": "A technology corporation.",
    },
    {
        "subject": "TechCorp Inc.",  # DUPLICATE: Legal suffix
        "predicate": "has_type",
        "object": "Company",
        "subject_type": "Organization",
        "object_type": "Type",
        "subject_description": "Technology company incorporated in Delaware.",
    },
    # Location entities (with duplicates)
    {
        "subject": "New York City",
        "predicate": "has_type",
        "object": "Location",
        "subject_type": "City",
        "object_type": "Type",
        "subject_description": "The most populous city in the United States.",
    },
    {
        "subject": "NYC",  # DUPLICATE: Abbreviation
        "predicate": "has_type",
        "object": "Location",
        "subject_type": "City",
        "object_type": "Type",
        "subject_description": "Major city in New York state.",
    },
    # People entities (with duplicates)
    {
        "subject": "John Smith",
        "predicate": "has_type",
        "object": "Person",
        "subject_type": "Person",
        "object_type": "Type",
        "subject_description": "CEO of TechCorp",
    },
    {
        "subject": "J. Smith",  # DUPLICATE: Abbreviated first name
        "predicate": "has_type",
        "object": "Person",
        "subject_type": "Person",
        "object_type": "Type",
        "subject_description": "Chief Executive Officer at Tech Corp",
    },
    # Relationships with duplicate entities
    {
        "subject": "TechCorp",
        "predicate": "located_in",
        "object": "New York City",
        "subject_type": "Organization",
        "object_type": "City",
    },
    {
        "subject": "Tech Corp",  # Same relationship with duplicate entities
        "predicate": "located_in",
        "object": "NYC",
        "subject_type": "Organization",
        "object_type": "City",
    },
    {
        "subject": "John Smith",
        "predicate": "works_at",
        "object": "TechCorp",
        "subject_type": "Person",
        "object_type": "Organization",
    },
    {
        "subject": "J. Smith",  # Same relationship with duplicate entities
        "predicate": "employed_by",  # Similar predicate
        "object": "Tech Corp",
        "subject_type": "Person",
        "object_type": "Organization",
    },
]


In [6]:
# Helper to load sample data using current GraphStore API
def load_sample_triples(store, triples, source_name="sample_data"):
    added_nodes = 0
    added_edges = 0

    for triple in triples:
        subject = triple["subject"]
        subject_type = triple.get("subject_type", "Unknown")
        subject_description = triple.get("subject_description", "")
        subject_metadata = {"sources": [source_name]}

        if store.add_node(
            name=subject,
            entity_type=subject_type,
            description=subject_description,
            metadata=subject_metadata,
        ):
            added_nodes += 1

        obj = triple["object"]
        object_type = triple.get("object_type", "Unknown")
        object_description = triple.get("object_description", "")
        object_metadata = {"sources": [source_name]}

        if store.add_node(
            name=obj,
            entity_type=object_type,
            description=object_description,
            metadata=object_metadata,
        ):
            added_nodes += 1

        metadata = {"sources": [source_name]}
        edge_result = store.add_edge(subject, triple["predicate"], obj, metadata=metadata)
        if edge_result.get("success"):
            added_edges += 1

    return added_nodes, added_edges

# Add triples to graph store
nodes_added, edges_added = load_sample_triples(graph_store, sample_triples)

print(f"✅ Added sample data (nodes: {nodes_added}, edges: {edges_added})")
print(f"\nExpected duplicates")
print(" - TechCorp, Tech Corp, TechCorp Inc.")
print(" - New York City, NYC")
print(" - John Smith, J. Smith")


✅ Added sample data (nodes: 10, edges: 11)

Expected duplicates
 - TechCorp, Tech Corp, TechCorp Inc.
 - New York City, NYC
 - John Smith, J. Smith


In [7]:
# Get all nodes
nodes = graph_store.nodes()
print(f"Total nodes: {len(nodes)}")
print("\nNode names:")
for node in nodes:
    print(f" - {node['name']} ({node.get('type', 'unknown')})")


Total nodes: 10

Node names:
 - TECHCORP (Organization)
 - COMPANY (Type)
 - TECH CORP (Organization)
 - TECHCORP INC. (Organization)
 - NEW YORK CITY (City)
 - LOCATION (Type)
 - NYC (City)
 - JOHN SMITH (Person)
 - PERSON (Type)
 - J. SMITH (Person)


In [8]:
# Get all edges
edges = graph_store.edges()
print(f"Total edges: {len(edges)}")
print("\nEdges:")
for edge in edges:
    print(f" - {edge['subject']} --[{edge['predicate']}]--> {edge['object']}")


Total edges: 11

Edges:
 - TECHCORP --[HAS_TYPE]--> COMPANY
 - TECHCORP --[LOCATED_IN]--> NEW YORK CITY
 - TECH CORP --[HAS_TYPE]--> COMPANY
 - TECH CORP --[LOCATED_IN]--> NYC
 - TECHCORP INC. --[HAS_TYPE]--> COMPANY
 - NEW YORK CITY --[HAS_TYPE]--> LOCATION
 - NYC --[HAS_TYPE]--> LOCATION
 - JOHN SMITH --[HAS_TYPE]--> PERSON
 - JOHN SMITH --[WORKS_AT]--> TECHCORP
 - J. SMITH --[HAS_TYPE]--> PERSON
 - J. SMITH --[EMPLOYED_BY]--> TECH CORP


## Configure Entity Resolution

### Understanding Configuration Parameters

- **blocking_threshold** (0-1): Cosine similarity threshold for clustering
  - Higher = stricter clusters, fewer comparisons
  - Lower = looser clusters, more comparisons
  - Default: 0.85
- **matching_threshold** (0-1): LLM confidence threshold for duplicates
  - 0.95 = high confidence only
  - 0.75 = medium+ confidence
  - 0.50 = low+ confidence
  - Default: 0.8
- **clustering_method**: Algorithm for blocking
  - `hierarchical`: Agglomerative clustering (default)
  - `kmeans`: k-means for faster processing
  - `hdbscan`: Density-based for variable cluster sizes


In [9]:
# Create resolution configuration
config = ResolutionConfig(
    blocking_threshold=0.85,  # Similarity threshold for clustering
    matching_threshold=0.75,  # Accept medium+ confidence matches
    clustering_method="hierarchical",
    batch_size=20,  # Entities per LLM call
    min_cluster_size=2,  # Minimum entities to consider
    max_cluster_size=50,  # Maximum entities per cluster
)

print("✅ Configuration created")
print(f"Blocking threshold: {config.blocking_threshold}")
print(f"Matching threshold: {config.matching_threshold}")
print(f"Clustering method: {config.clustering_method}")


✅ Configuration created
Blocking threshold: 0.85
Matching threshold: 0.75
Clustering method: hierarchical


## Run Entity Resolution

The resolution pipeline:
1. **Serialize** entities to text representations
2. Compute embeddings using the vector store
3. **Cluster** entities using semantic blocking
4. **Match** entities within clusters using LLM
5. Create **SAME_AS** edges for duplicates
6. **Find connected components** to identify duplicate clusters


In [10]:
# Create resolver
resolver = EntityResolver(config=config)

# Provide domain context for better matching
context = """
This is a knowledge graph about technology companies and people.
Common variations:
- Corp, Corporation, Inc., LLC are equivalent suffixes
- Abbreviations like NYC = New York City are common
- First name abbreviations (J. = John) are common
"""

# Run resolution
print("Running entity resolution...")
result = resolver.resolve_entities(
    graph_store=graph_store,
    vector_store=vector_store,
    apply_to_nodes=True,
    apply_to_edges=True,
    context=context,
)

print("\n Resolution complete!")


Running entity resolution...
2025-11-10T21:59:08.031 [BAML [92mINFO[0m] [35mFunction MatchEntities[0m:
    [33mClient: CustomSonnet4 (claude-sonnet-4-20250514) - 9991ms. StopReason: end_turn. Tokens(in/out): 621/496[0m
    [34m---PROMPT---[0m
    [2m[43muser: [0m[2mYou are an expert in entity resolution for knowledge graphs. Your task is to identify duplicate entities within a group of semantically similar entities.**Context about this knowledge graph:**
    
    This is a knowledge graph about technology companies and people.
    Common variations:
    - Corp, Corporation, Inc., LLC are equivalent suffixes
    - Abbreviations like NYC = New York City are common
    - First name abbreviations (J. = John) are common
    
    
    **Entities to analyze:**
    ---
    ID: TECHCORP
    Type: Organization
    Description: A leading technology company.
    Attributes: {}
    ---
    ID: TECH CORP
    Type: Organization
    Description: A technology corporation.
    Attributes: {}

## Analyze Results


In [11]:
print("=" * 60)
print("RESOLUTION SUMMARY")
print("=" * 60)
print(f"Nodes processed: {result.total_nodes_processed}")
print(f"Edges processed: {result.total_edges_processed}")
print(f"Blocks created: {result.blocks_created}")
print(f"SAME_AS edges created: {result.same_as_edges_created}")
print(f"Duplicate clusters: {result.duplicate_clusters}")
print(f"Execution time: {result.execution_time_seconds:.2f}s")
print("\n" + "=" * 60)


RESOLUTION SUMMARY
Nodes processed: 10
Edges processed: 19
Blocks created: 8
SAME_AS edges created: 14
Duplicate clusters: 2
Execution time: 87.67s



## Node Matches

Examine which nodes were identified as duplicates:


In [12]:
print(f"\nNode matches found: {len(result.node_matches)}\n")
for i, match in enumerate(result.node_matches, 1):
    if match.is_duplicate:
        print(f"Match {i}:")
        print(f"  Entity 1: {match.entity1_id}")
        print(f"  Entity 2: {match.entity2_id}")
        print(f"  Confidence: {match.confidence:.2f}")
        print(f"  Reasoning: {match.reasoning}")
        print()



Node matches found: 4

Match 1:
  Entity 1: TECHCORP
  Entity 2: TECH CORP
  Confidence: 0.95
  Reasoning: These entities have nearly identical names with only a space difference ('TECHCORP' vs 'TECH CORP'). Both are Organizations with very similar descriptions about being technology companies. This is a clear case of name variation for the same entity.

Match 2:
  Entity 1: TECHCORP
  Entity 2: TECHCORP INC.
  Confidence: 0.95
  Reasoning: These entities represent the same company with and without the corporate suffix. 'TECHCORP' and 'TECHCORP INC.' refer to the same entity, as 'Inc.' is just the legal incorporation designation. Both are Organizations describing the same technology company, with the second providing additional detail about Delaware incorporation.

Match 3:
  Entity 1: TECH CORP
  Entity 2: TECHCORP INC.
  Confidence: 0.95
  Reasoning: Following transitivity from the previous matches, these represent the same entity. 'TECH CORP' (spaced version) and 'TECHCORP INC.' (w

## Edge Matches

Examine which edges were identified as duplicates:


In [13]:
print(f"\nEdge matches found: {len(result.edge_matches)}\n")
for i, match in enumerate(result.edge_matches, 1):
    if match.is_duplicate:
        print(f"Match {i}:")
        print(f"  Edge 1: {match.edge1_id}")
        print(f"  Edge 2: {match.edge2_id}")
        print(f"  Confidence: {match.confidence:.2f}")
        print(f"  Reasoning: {match.reasoning}")
        print()



Edge matches found: 10

Match 1:
  Edge 1: TECHCORP|HAS_TYPE|COMPANY
  Edge 2: TECH CORP|HAS_TYPE|COMPANY
  Confidence: 0.95
  Reasoning: These edges have identical predicates (HAS_TYPE) and objects (COMPANY). The subjects 'TECHCORP' and 'TECH CORP' represent the same entity - one is simply the spaced version of the other company name. This is a common variation in company naming conventions.

Match 2:
  Edge 1: TECHCORP|HAS_TYPE|COMPANY
  Edge 2: TECHCORP INC.|HAS_TYPE|COMPANY
  Confidence: 0.95
  Reasoning: These edges have identical predicates (HAS_TYPE) and objects (COMPANY). The subjects 'TECHCORP' and 'TECHCORP INC.' represent the same entity, as 'Inc.' is a common corporate suffix that doesn't change the fundamental identity of the company. According to the context, suffixes like Inc. are considered equivalent variations.

Match 3:
  Edge 1: TECH CORP|HAS_TYPE|COMPANY
  Edge 2: TECHCORP INC.|HAS_TYPE|COMPANY
  Confidence: 0.95
  Reasoning: These edges have identical predicates 

## Duplicate Clusters

Groups of connected entities via SAME_AS relationships:


In [14]:
from spindle.entity_resolution import get_duplicate_clusters

clusters = get_duplicate_clusters(graph_store)

print(f"\nDuplicate clusters found: {len(clusters)}\n")
for i, cluster in enumerate(clusters, 1):
    print(f"Cluster {i}: {sorted(cluster)}")



Duplicate clusters found: 2

Cluster 1: ['TECH CORP', 'TECHCORP', 'TECHCORP INC.']
Cluster 2: ['J. SMITH', 'JOHN SMITH']


## Query Resolved Graph

### View SAME_AS Edges


In [15]:
# Query all SAME_AS edges
same_as_edges = graph_store.query_by_pattern(predicate="SAME_AS")

print(f"\nSAME_AS edges in graph: {len(same_as_edges)}\n")
for edge in same_as_edges:
    print(f"{edge['subject']} --[SAME_AS]--> {edge['object']}")

    # Show confidence and reasoning from metadata
    metadata = edge.get("metadata", {})
    if isinstance(metadata, str):
        metadata = json.loads(metadata)

    confidence = metadata.get("confidence", "N/A")
    if confidence == "N/A":
        print("  Confidence: N/A")
    else:
        print(f"  Confidence: {confidence:.2f}")

    reasoning = metadata.get("reasoning")
    if reasoning:
        print(f"  Reasoning: {reasoning}")
    print()



SAME_AS edges in graph: 8

TECHCORP --[SAME_AS]--> TECHCORP INC.
  Confidence: 0.95

TECHCORP --[SAME_AS]--> TECH CORP
  Confidence: 0.95

TECH CORP --[SAME_AS]--> TECHCORP INC.
  Confidence: 0.95

TECH CORP --[SAME_AS]--> TECHCORP
  Confidence: 0.95

TECHCORP INC. --[SAME_AS]--> TECH CORP
  Confidence: 0.95

TECHCORP INC. --[SAME_AS]--> TECHCORP
  Confidence: 0.95

JOHN SMITH --[SAME_AS]--> J. SMITH
  Confidence: 0.95

J. SMITH --[SAME_AS]--> JOHN SMITH
  Confidence: 0.95



### Get Canonical Entity Names

Find the "primary" entity name for each duplicate cluster:


In [16]:
print("\nCanonical entity mappings\n")

for cluster in clusters:
    canonical = min(cluster)  # Alphabetically first
    print(f"Canonical: {canonical}")
    for entity in sorted(cluster):
        if entity != canonical:
            print(f"  - {entity}")
    print()



Canonical entity mappings

Canonical: TECH CORP
  - TECHCORP
  - TECHCORP INC.

Canonical: J. SMITH
  - JOHN SMITH



## Export Results

Save resolution results for later analysis:


In [17]:
# Convert result to dictionary
result_dict = result.to_dict()

# Pretty print
print("\nResolution results as JSON:\n")
print(json.dumps(result_dict, indent=2))



Resolution results as JSON:

{
  "total_nodes_processed": 10,
  "total_edges_processed": 19,
  "blocks_created": 8,
  "same_as_edges_created": 14,
  "duplicate_clusters": 2,
  "node_match_count": 4,
  "edge_match_count": 10,
  "execution_time_seconds": 87.674432,
  "config": {
    "blocking_threshold": 0.85,
    "matching_threshold": 0.75,
    "clustering_method": "hierarchical"
  }
}


In [18]:
# Optional: Save to file
output_path = Path("resolution_results.json")

with open(output_path, "w") as f:
    json.dump(result_dict, f, indent=2)

print(f"✅ Results saved to {output_path}")


✅ Results saved to resolution_results.json


## Experiment: Different Configurations

### Conservative Configuration (High Precision)

Only match entities with very high confidence:


In [19]:
# Create a fresh graph for comparison
graph_store_conservative = GraphStore("resolution_conservative")
load_sample_triples(
    graph_store_conservative,
    sample_triples,
    source_name="sample_data_conservative",
)

# Conservative config
conservative_config = ResolutionConfig(
    blocking_threshold=0.90,  # Higher threshold
    matching_threshold=0.85,  # Only high confidence
    clustering_method="hierarchical",
    min_cluster_size=3,  # Require at least 3 entities
)

# Run resolution
resolver_conservative = EntityResolver(config=conservative_config)
result_conservative = resolver_conservative.resolve_entities(
    graph_store=graph_store_conservative,
    vector_store=vector_store,
    apply_to_nodes=True,
    apply_to_edges=False,
    context=context,
)

print("\nConservative configuration results:")
print(f"  SAME_AS edges: {result_conservative.same_as_edges_created}")
print(f"  Clusters: {result_conservative.duplicate_clusters}")
print(f"  Node matches: {len(result_conservative.node_matches)}")



Conservative configuration results:
  SAME_AS edges: 0
  Clusters: 0
  Node matches: 0


### Aggressive Configuration (High Recall)

Match more entities with lower confidence thresholds:


In [20]:
# Create a fresh graph for comparison
graph_store_aggressive = GraphStore("resolution_aggressive")
load_sample_triples(
    graph_store_aggressive,
    sample_triples,
    source_name="sample_data_aggressive",
)

# Aggressive config
aggressive_config = ResolutionConfig(
    blocking_threshold=0.75,  # Lower threshold
    matching_threshold=0.65,  # Accept lower confidence
    clustering_method="hierarchical",
    min_cluster_size=2,
)

# Run resolution
resolver_aggressive = EntityResolver(config=aggressive_config)
result_aggressive = resolver_aggressive.resolve_entities(
    graph_store=graph_store_aggressive,
    vector_store=vector_store,
    apply_to_nodes=True,
    apply_to_edges=False,
    context=context,
)

print("\nAggressive configuration results:")
print(f"  SAME_AS edges: {result_aggressive.same_as_edges_created}")
print(f"  Clusters: {result_aggressive.duplicate_clusters}")
print(f"  Node matches: {len(result_aggressive.node_matches)}")


2025-11-10T22:04:06.455 [BAML [92mINFO[0m] [35mFunction MatchEntities[0m:
    [33mClient: CustomSonnet4 (claude-sonnet-4-20250514) - 12888ms. StopReason: end_turn. Tokens(in/out): 621/577[0m
    [34m---PROMPT---[0m
    [2m[43muser: [0m[2mYou are an expert in entity resolution for knowledge graphs. Your task is to identify duplicate entities within a group of semantically similar entities.**Context about this knowledge graph:**
    
    This is a knowledge graph about technology companies and people.
    Common variations:
    - Corp, Corporation, Inc., LLC are equivalent suffixes
    - Abbreviations like NYC = New York City are common
    - First name abbreviations (J. = John) are common
    
    
    **Entities to analyze:**
    ---
    ID: TECHCORP
    Type: Organization
    Description: A leading technology company.
    Attributes: {}
    ---
    ID: TECH CORP
    Type: Organization
    Description: A technology corporation.
    Attributes: {}
    ---
    ID: TECHCORP IN

### Compare Configurations


In [21]:
import pandas as pd

comparison = pd.DataFrame([
    {
        "Config": "Default",
        "Blocking Threshold": 0.85,
        "Matching Threshold": 0.75,
        "SAME_AS Edges": result.same_as_edges_created,
        "Clusters": result.duplicate_clusters,
        "Matches": len(result.node_matches),
    },
    {
        "Config": "Conservative",
        "Blocking Threshold": 0.90,
        "Matching Threshold": 0.85,
        "SAME_AS Edges": result_conservative.same_as_edges_created,
        "Clusters": result_conservative.duplicate_clusters,
        "Matches": len(result_conservative.node_matches),
    },
    {
        "Config": "Aggressive",
        "Blocking Threshold": 0.75,
        "Matching Threshold": 0.65,
        "SAME_AS Edges": result_aggressive.same_as_edges_created,
        "Clusters": result_aggressive.duplicate_clusters,
        "Matches": len(result_aggressive.node_matches),
    },
])

print("\nConfiguration comparison\n")
print(comparison.to_string(index=False))



Configuration comparison

      Config  Blocking Threshold  Matching Threshold  SAME_AS Edges  Clusters  Matches
     Default                0.85                0.75             14         2        4
Conservative                0.90                0.85              0         0        0
  Aggressive                0.75                0.65              4         2        4


## Convenience Function

You can also use the `resolve_entities` convenience function:


In [22]:
# Create a fresh graph
graph_store_simple = GraphStore("resolution_simple")
load_sample_triples(
    graph_store_simple,
    sample_triples,
    source_name="sample_data_default",
)

# Use convenience function (uses default config)
result_simple = resolve_entities(
    graph_store=graph_store_simple,
    vector_store=vector_store,
    apply_to_nodes=True,
    apply_to_edges=True,
    context=context,
)

print("\nConvenience function results:")
print(f"  SAME_AS edges: {result_simple.same_as_edges_created}")
print(f"  Clusters: {result_simple.duplicate_clusters}")


2025-11-10T22:05:32.368 [BAML [92mINFO[0m] [35mFunction MatchEntities[0m:
    [33mClient: CustomSonnet4 (claude-sonnet-4-20250514) - 9367ms. StopReason: end_turn. Tokens(in/out): 621/464[0m
    [34m---PROMPT---[0m
    [2m[43muser: [0m[2mYou are an expert in entity resolution for knowledge graphs. Your task is to identify duplicate entities within a group of semantically similar entities.**Context about this knowledge graph:**
    
    This is a knowledge graph about technology companies and people.
    Common variations:
    - Corp, Corporation, Inc., LLC are equivalent suffixes
    - Abbreviations like NYC = New York City are common
    - First name abbreviations (J. = John) are common
    
    
    **Entities to analyze:**
    ---
    ID: TECHCORP
    Type: Organization
    Description: A leading technology company.
    Attributes: {}
    ---
    ID: TECH CORP
    Type: Organization
    Description: A technology corporation.
    Attributes: {}
    ---
    ID: TECHCORP INC

## Best Practices & Troubleshooting

### Configuration Guidelines

**Conservative (high precision)**
- blocking_threshold=0.90
- matching_threshold=0.85
- min_cluster_size=3
- Use when false positives are costly

**Aggressive (high recall)**
- blocking_threshold=0.75
- matching_threshold=0.65
- min_cluster_size=2
- Use when missing duplicates is costly

**Balanced (default)**
- Uses ResolutionConfig() defaults
- Good starting point for most use cases

### Common Issues

**No matches found**
- Lower blocking_threshold to 0.75–0.80
- Lower matching_threshold to 0.70
- Check embeddings are working
- Verify entities actually have duplicates

**Too many false positives**
- Raise matching_threshold to 0.85–0.90
- Raise blocking_threshold to 0.90+
- Improve context with domain information
- Check entity descriptions for quality

**LLM errors**
- Verify ANTHROPIC_API_KEY is set
- Reduce batch_size to avoid rate limits
- Check for malformed entity data


## Next Steps

- **Larger datasets**: Try resolution on a knowledge graph with 100+ entities
- **Custom clustering**: Experiment with HDBSCAN for variable cluster sizes
- **Edge resolution**: Focus on duplicate edge detection
- **Integrations**: Use resolved graphs in downstream applications
- **Monitoring**: Track resolution quality metrics over time


## References

- [Entity Resolution Documentation](../../docs/ENTITY_RESOLUTION.md)
- [Demo Script](../../demos/example_entity_resolution.py)
- [Tests](../../tests/test_entity_resolution.py)
- [The Rise of Semantic Entity Resolution](https://towardsdatascience.com/the-rise-of-semantic-entity-resolution/)
