# Advanced Duplicate Detection with Datalab

This tutorial demonstrates the enhanced duplicate detection capabilities in Cleanlab's Datalab, including exact duplicate detection, near-duplicate detection with configurable similarity thresholds, and performance optimization techniques.

**Overview of what we'll do in this tutorial:**

- Understand the difference between exact and near-duplicate detection
- Use similarity thresholds for fine-tuned cosine similarity matching
- Compare performance across different dataset sizes
- Apply duplicate detection to real-world scenarios (text, embeddings)
- Optimize detection for large datasets

The enhanced duplicate detection supports:
- **Exact duplicates**: Identical examples (distance = 0)
- **Near duplicates**: Highly similar examples based on configurable thresholds
- **Cosine similarity thresholds**: Direct similarity control (0-1 range)
- **Multiple distance metrics**: Euclidean, cosine, manhattan, etc.
- **Scalable detection**: Optimized for datasets up to 1M+ rows

## 1. Install and import required dependencies

In [None]:
# Package installation (hidden on docs website).
dependencies = ["cleanlab", "matplotlib", "scikit-learn", "numpy", "pandas"]

if "google.colab" in str(get_ipython()):  # Check if it's running in Google Colab
    %pip install cleanlab  # for colab
    cmd = ' '.join([dep for dep in dependencies if dep != "cleanlab"])
    %pip install $cmd
else:
    dependencies_test = [dependency.split('>')[0] if '>' in dependency 
                         else dependency.split('<')[0] if '<' in dependency 
                         else dependency.split('=')[0] for dependency in dependencies]
    missing_dependencies = []
    for dependency in dependencies_test:
        try:
            __import__(dependency)
        except ImportError:
            missing_dependencies.append(dependency)

    if len(missing_dependencies) > 0:
        print("Missing required dependencies:")
        print(*missing_dependencies, sep=", ")
        print("\nPlease install them before running the rest of this notebook.")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import warnings

from cleanlab import Datalab

# Set random seed for reproducibility
np.random.seed(42)

## 2. Basic Duplicate Detection: Exact vs Near Duplicates

Let's start with a simple example to understand the difference between exact and near-duplicate detection.

In [None]:
def create_duplicate_dataset():
    """Create a dataset with exact and near duplicates for demonstration."""
    
    # Create base data
    np.random.seed(42)
    n_samples = 100
    n_features = 5
    
    # Generate random base data
    X_base = np.random.randn(n_samples, n_features)
    y_base = np.random.choice(['A', 'B', 'C'], size=n_samples)
    
    # Add exact duplicates
    exact_duplicate_indices = [10, 25, 40]  # Duplicate these examples
    X_exact_dups = X_base[exact_duplicate_indices].copy()
    y_exact_dups = y_base[exact_duplicate_indices].copy()
    
    # Add near duplicates (very small noise)
    near_duplicate_indices = [15, 30, 45]
    X_near_dups = X_base[near_duplicate_indices].copy()
    X_near_dups += np.random.normal(0, 0.01, X_near_dups.shape)  # Add tiny noise
    y_near_dups = y_base[near_duplicate_indices].copy()
    
    # Combine all data
    X_combined = np.vstack([X_base, X_exact_dups, X_near_dups])
    y_combined = np.hstack([y_base, y_exact_dups, y_near_dups])
    
    return X_combined, y_combined, exact_duplicate_indices, near_duplicate_indices

X, y, exact_indices, near_indices = create_duplicate_dataset()
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
print(f"Exact duplicate pairs: {[(i, i+len(X)//2) for i in exact_indices]}")
print(f"Near duplicate pairs: {[(i, i+len(X)//2+len(exact_indices)) for i in near_indices]}")

### 2.1 Detecting Only Exact Duplicates

First, let's detect only exact duplicates using the `exact_duplicates_only=True` parameter.

In [None]:
# Create Datalab instance
data_dict = {"features": X, "labels": y}
lab_exact = Datalab(data_dict, label_name="labels")

# Detect only exact duplicates
lab_exact.find_issues(
    features=X,
    issue_types={
        "near_duplicate": {
            "exact_duplicates_only": True,
            "metric": "euclidean"
        }
    }
)

print("=== EXACT DUPLICATES ONLY ===")
exact_issues = lab_exact.get_issues("near_duplicate")
exact_duplicates = exact_issues[exact_issues["is_near_duplicate_issue"]]
print(f"Found {len(exact_duplicates)} exact duplicate examples")
print("\nExact duplicate examples:")
print(exact_duplicates[["near_duplicate_score"]].head(10))

### 2.2 Detecting Near Duplicates with Default Threshold

Now let's detect both exact and near duplicates using the default distance threshold.

In [None]:
# Create new Datalab instance for near duplicates
lab_near = Datalab(data_dict, label_name="labels")

# Detect near duplicates with default threshold
lab_near.find_issues(
    features=X,
    issue_types={
        "near_duplicate": {
            "exact_duplicates_only": False,
            "metric": "euclidean",
            "threshold": 0.1  # More permissive threshold
        }
    }
)

print("=== NEAR DUPLICATES (DEFAULT THRESHOLD) ===")
near_issues = lab_near.get_issues("near_duplicate")
all_duplicates = near_issues[near_issues["is_near_duplicate_issue"]]
print(f"Found {len(all_duplicates)} near duplicate examples")
print("\nNear duplicate examples (lowest scores = most similar):")
print(all_duplicates[["near_duplicate_score"]].sort_values("near_duplicate_score").head(10))

## 3. Cosine Similarity with Configurable Thresholds

The enhanced duplicate detection supports direct similarity thresholds for cosine similarity, which is particularly useful for text embeddings and high-dimensional data.

In [None]:
def create_text_embeddings_dataset():
    """Create a text dataset with embeddings for similarity testing."""
    
    # Sample text data with intentional duplicates and near-duplicates
    texts = [
        "The quick brown fox jumps over the lazy dog",
        "A fast brown fox leaps over a sleepy dog",  # Near duplicate
        "The weather is sunny today",
        "Machine learning is fascinating",
        "Deep learning models are powerful",
        "The quick brown fox jumps over the lazy dog",  # Exact duplicate
        "Today the weather is sunny",  # Near duplicate
        "Artificial intelligence is the future",
        "Python is a great programming language",
        "Data science involves statistics and programming",
        "Machine learning is really fascinating",  # Near duplicate
        "The weather is sunny today",  # Exact duplicate
    ]
    
    # Create TF-IDF embeddings
    vectorizer = TfidfVectorizer(max_features=100, stop_words='english')
    embeddings = vectorizer.fit_transform(texts).toarray()
    
    # Create labels
    labels = ['text'] * len(texts)
    
    return embeddings, labels, texts

text_embeddings, text_labels, original_texts = create_text_embeddings_dataset()
print(f"Text embeddings shape: {text_embeddings.shape}")
print(f"Number of texts: {len(original_texts)}")
print("\nSample texts:")
for i, text in enumerate(original_texts[:5]):
    print(f"{i}: {text}")

### 3.1 High Similarity Threshold (95%)

Let's detect only very similar texts using a 95% cosine similarity threshold.

In [None]:
# Create Datalab for text similarity
text_data = {"embeddings": text_embeddings, "labels": text_labels}
lab_text_high = Datalab(text_data, label_name="labels")

# Detect duplicates with 95% similarity threshold
lab_text_high.find_issues(
    features=text_embeddings,
    issue_types={
        "near_duplicate": {
            "metric": "cosine",
            "similarity_threshold": 0.95  # 95% similarity
        }
    }
)

print("=== TEXT DUPLICATES (95% SIMILARITY) ===")
text_issues_high = lab_text_high.get_issues("near_duplicate")
high_sim_duplicates = text_issues_high[text_issues_high["is_near_duplicate_issue"]]
print(f"Found {len(high_sim_duplicates)} highly similar texts")

if len(high_sim_duplicates) > 0:
    print("\nHighly similar text pairs:")
    for idx in high_sim_duplicates.index:
        score = high_sim_duplicates.loc[idx, "near_duplicate_score"]
        print(f"Index {idx} (score: {score:.4f}): {original_texts[idx]}")

### 3.2 Medium Similarity Threshold (80%)

Now let's use a more permissive 80% similarity threshold to catch more near-duplicates.

In [None]:
# Create new Datalab for medium similarity
lab_text_medium = Datalab(text_data, label_name="labels")

# Detect duplicates with 80% similarity threshold
lab_text_medium.find_issues(
    features=text_embeddings,
    issue_types={
        "near_duplicate": {
            "metric": "cosine",
            "similarity_threshold": 0.80  # 80% similarity
        }
    }
)

print("=== TEXT DUPLICATES (80% SIMILARITY) ===")
text_issues_medium = lab_text_medium.get_issues("near_duplicate")
medium_sim_duplicates = text_issues_medium[text_issues_medium["is_near_duplicate_issue"]]
print(f"Found {len(medium_sim_duplicates)} similar texts")

if len(medium_sim_duplicates) > 0:
    print("\nSimilar text pairs:")
    for idx in medium_sim_duplicates.index:
        score = medium_sim_duplicates.loc[idx, "near_duplicate_score"]
        print(f"Index {idx} (score: {score:.4f}): {original_texts[idx]}")

### 3.3 Similarity Threshold Comparison

Let's compare the results across different similarity thresholds.

In [None]:
# Test different similarity thresholds
thresholds = [0.99, 0.95, 0.90, 0.85, 0.80, 0.75]
results = []

for threshold in thresholds:
    lab_temp = Datalab(text_data, label_name="labels")
    lab_temp.find_issues(
        features=text_embeddings,
        issue_types={
            "near_duplicate": {
                "metric": "cosine",
                "similarity_threshold": threshold
            }
        }
    )
    
    issues = lab_temp.get_issues("near_duplicate")
    num_duplicates = len(issues[issues["is_near_duplicate_issue"]])
    results.append((threshold, num_duplicates))

# Create comparison DataFrame
comparison_df = pd.DataFrame(results, columns=["Similarity Threshold", "Duplicates Found"])
print("=== SIMILARITY THRESHOLD COMPARISON ===")
print(comparison_df)

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(comparison_df["Similarity Threshold"], comparison_df["Duplicates Found"], 
         marker='o', linewidth=2, markersize=6)
plt.xlabel("Cosine Similarity Threshold")
plt.ylabel("Number of Duplicates Found")
plt.title("Duplicate Detection vs Similarity Threshold")
plt.grid(True, alpha=0.3)
plt.xlim(0.74, 1.0)
plt.show()

## 4. Performance Scaling and Optimization

Let's test the performance of duplicate detection across different dataset sizes and explore optimization techniques.

In [None]:
import time

def benchmark_duplicate_detection(sizes=[100, 500, 1000, 2000]):
    """Benchmark duplicate detection performance across dataset sizes."""
    
    results = []
    
    for size in sizes:
        print(f"\nTesting dataset size: {size}")
        
        # Create test dataset
        np.random.seed(42)
        X_test = np.random.randn(size, 50)  # 50-dimensional features
        y_test = np.random.choice(['A', 'B', 'C'], size=size)
        
        # Add some duplicates
        n_duplicates = min(10, size // 10)
        duplicate_indices = np.random.choice(size, n_duplicates, replace=False)
        for i, dup_idx in enumerate(duplicate_indices):
            if dup_idx + size // 2 < size:
                X_test[dup_idx + size // 2] = X_test[dup_idx]  # Create exact duplicate
        
        data_test = {"features": X_test, "labels": y_test}
        
        # Benchmark with cosine similarity
        lab_bench = Datalab(data_test, label_name="labels")
        
        start_time = time.time()
        lab_bench.find_issues(
            features=X_test,
            issue_types={
                "near_duplicate": {
                    "metric": "cosine",
                    "similarity_threshold": 0.95,
                    "k": min(10, size - 1)  # Adaptive k
                }
            }
        )
        end_time = time.time()
        
        execution_time = end_time - start_time
        issues = lab_bench.get_issues("near_duplicate")
        num_found = len(issues[issues["is_near_duplicate_issue"]])
        
        results.append({
            "Dataset Size": size,
            "Execution Time (s)": round(execution_time, 3),
            "Duplicates Found": num_found,
            "Time per Sample (ms)": round(execution_time * 1000 / size, 3)
        })
        
        print(f"  Time: {execution_time:.3f}s, Found: {num_found} duplicates")
    
    return pd.DataFrame(results)

# Run benchmark
benchmark_results = benchmark_duplicate_detection([100, 500, 1000, 2000])
print("\n=== PERFORMANCE BENCHMARK ===")
print(benchmark_results)

In [None]:
# Plot performance scaling
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Total execution time
ax1.plot(benchmark_results["Dataset Size"], benchmark_results["Execution Time (s)"], 
         marker='o', linewidth=2, markersize=6, color='blue')
ax1.set_xlabel("Dataset Size")
ax1.set_ylabel("Execution Time (seconds)")
ax1.set_title("Total Execution Time vs Dataset Size")
ax1.grid(True, alpha=0.3)

# Time per sample
ax2.plot(benchmark_results["Dataset Size"], benchmark_results["Time per Sample (ms)"], 
         marker='s', linewidth=2, markersize=6, color='red')
ax2.set_xlabel("Dataset Size")
ax2.set_ylabel("Time per Sample (milliseconds)")
ax2.set_title("Time per Sample vs Dataset Size")
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Real-World Application: Document Deduplication

Let's apply the enhanced duplicate detection to a practical document deduplication scenario.

In [None]:
def create_document_dataset():
    """Create a realistic document dataset with various types of duplicates."""
    
    documents = [
        # Research papers (with duplicates)
        "Machine learning algorithms for predictive analytics in healthcare systems",
        "Deep neural networks applications in computer vision and image recognition",
        "Natural language processing techniques for sentiment analysis in social media",
        "Machine learning algorithms for predictive analytics in healthcare systems",  # Exact
        "ML algorithms for predictive analytics in healthcare systems",  # Near
        
        # News articles
        "Stock market reaches new highs amid economic recovery signals",
        "Technology companies report strong quarterly earnings and revenue growth",
        "Climate change impacts global agriculture and food security concerns",
        "Stock market hits new peaks as economic recovery shows promising signs",  # Near
        "Technology companies report strong quarterly earnings and revenue growth",  # Exact
        
        # Product descriptions
        "High-performance laptop with 16GB RAM and 512GB SSD storage capacity",
        "Wireless bluetooth headphones with noise cancellation and 20-hour battery life",
        "Professional digital camera with 24MP sensor and 4K video recording",
        "High performance laptop featuring 16GB RAM and 512GB SSD storage",  # Near
        "Professional digital camera with 24MP sensor and 4K video recording",  # Exact
        
        # Unique documents
        "Quantum computing breakthrough promises revolutionary computational capabilities",
        "Renewable energy innovations drive sustainable development initiatives worldwide",
        "Artificial intelligence ethics guidelines for responsible AI development",
        "Space exploration missions reveal new insights about planetary formation",
        "Biotechnology advances enable personalized medicine treatments for patients"
    ]
    
    # Create embeddings using TF-IDF
    vectorizer = TfidfVectorizer(
        max_features=200, 
        stop_words='english',
        ngram_range=(1, 2)  # Include bigrams for better similarity detection
    )
    
    embeddings = vectorizer.fit_transform(documents).toarray()
    
    # Create categories
    categories = (
        ['research'] * 5 + 
        ['news'] * 5 + 
        ['product'] * 5 + 
        ['misc'] * 5
    )
    
    return embeddings, categories, documents

doc_embeddings, doc_categories, documents = create_document_dataset()
print(f"Document dataset: {len(documents)} documents, {doc_embeddings.shape[1]} features")
print(f"Categories: {np.unique(doc_categories)}")

### 5.1 Multi-Threshold Document Analysis

Let's analyze the document dataset using multiple similarity thresholds to understand the duplicate landscape.

In [None]:
def analyze_documents_at_threshold(embeddings, categories, documents, threshold):
    """Analyze document duplicates at a specific similarity threshold."""
    
    doc_data = {"embeddings": embeddings, "categories": categories}
    lab_docs = Datalab(doc_data, label_name="categories")
    
    lab_docs.find_issues(
        features=embeddings,
        issue_types={
            "near_duplicate": {
                "metric": "cosine",
                "similarity_threshold": threshold
            }
        }
    )
    
    issues = lab_docs.get_issues("near_duplicate")
    duplicates = issues[issues["is_near_duplicate_issue"]].sort_values("near_duplicate_score")
    
    print(f"\n=== SIMILARITY THRESHOLD: {threshold:.1%} ===")
    print(f"Found {len(duplicates)} duplicate documents")
    
    if len(duplicates) > 0:
        print("\nDuplicate documents:")
        for idx in duplicates.index:
            score = duplicates.loc[idx, "near_duplicate_score"]
            category = doc_categories[idx]
            print(f"\n[{idx}] Score: {score:.4f} | Category: {category}")
            print(f"Text: {documents[idx][:80]}...")
    
    return len(duplicates), duplicates

# Analyze at different thresholds
thresholds_to_test = [0.99, 0.90, 0.80, 0.70]
threshold_results = []

for threshold in thresholds_to_test:
    num_found, _ = analyze_documents_at_threshold(
        doc_embeddings, doc_categories, documents, threshold
    )
    threshold_results.append((threshold, num_found))

### 5.2 Detailed Duplicate Analysis

Let's examine the detected duplicates in detail and calculate pairwise similarities.

In [None]:
# Use optimal threshold for detailed analysis
optimal_threshold = 0.85
doc_data = {"embeddings": doc_embeddings, "categories": doc_categories}
lab_final = Datalab(doc_data, label_name="categories")

lab_final.find_issues(
    features=doc_embeddings,
    issue_types={
        "near_duplicate": {
            "metric": "cosine",
            "similarity_threshold": optimal_threshold
        }
    }
)

final_issues = lab_final.get_issues("near_duplicate")
final_duplicates = final_issues[final_issues["is_near_duplicate_issue"]].sort_values("near_duplicate_score")

print(f"=== DETAILED DUPLICATE ANALYSIS (Threshold: {optimal_threshold:.1%}) ===")
print(f"Total duplicates found: {len(final_duplicates)}")

# Calculate pairwise similarities for all duplicate pairs
if len(final_duplicates) > 0:
    print("\n=== PAIRWISE SIMILARITY ANALYSIS ===")
    
    # Get duplicate sets information
    duplicate_info = lab_final.get_info("near_duplicate")
    duplicate_sets = duplicate_info.get("near_duplicate_sets", [])
    
    # Find actual duplicate pairs
    duplicate_pairs = []
    for i, dup_set in enumerate(duplicate_sets):
        if len(dup_set) > 0:  # This example has duplicates
            for j in dup_set:
                if i < j:  # Avoid duplicate pairs
                    similarity = cosine_similarity(
                        doc_embeddings[i:i+1], 
                        doc_embeddings[j:j+1]
                    )[0, 0]
                    duplicate_pairs.append((i, j, similarity))
    
    # Display duplicate pairs
    for i, (idx1, idx2, similarity) in enumerate(sorted(duplicate_pairs, key=lambda x: x[2], reverse=True)):
        print(f"\n--- Duplicate Pair {i+1} (Similarity: {similarity:.3f}) ---")
        print(f"[{idx1}] {documents[idx1]}")
        print(f"[{idx2}] {documents[idx2]}")
        print(f"Categories: {doc_categories[idx1]} vs {doc_categories[idx2]}")

## 6. Best Practices and Recommendations

Based on our experiments, here are key recommendations for using the enhanced duplicate detection:

In [None]:
print("=== DUPLICATE DETECTION BEST PRACTICES ===")
print()
print("1. CHOOSING THE RIGHT METRIC:")
print("   • Cosine similarity: Best for text embeddings, high-dimensional sparse data")
print("   • Euclidean distance: Good for dense numerical features, image embeddings")
print("   • Manhattan distance: Robust to outliers, good for mixed data types")
print()
print("2. SIMILARITY THRESHOLD GUIDELINES:")
print("   • 0.95-0.99: Very strict, catches only near-identical content")
print("   • 0.85-0.95: Moderate, good balance for most applications")
print("   • 0.70-0.85: Permissive, may catch semantically related content")
print("   • <0.70: Very permissive, high false positive rate")
print()
print("3. PERFORMANCE OPTIMIZATION:")
print("   • Use exact_duplicates_only=True for preprocessing steps")
print("   • Reduce k for large datasets (k=5-10 usually sufficient)")
print("   • Consider batch processing for datasets >50k rows")
print("   • Use sparse feature representations when possible")
print()
print("4. VALIDATION STRATEGIES:")
print("   • Test multiple thresholds on a sample of your data")
print("   • Manually review detected pairs to calibrate thresholds")
print("   • Consider domain-specific similarity requirements")
print("   • Monitor false positive/negative rates")
print()
print("5. COMMON USE CASES:")
print("   • Data cleaning: Use high thresholds (0.95+) to remove obvious duplicates")
print("   • Content deduplication: Medium thresholds (0.85-0.95) for similar content")
print("   • Similarity search: Lower thresholds (0.70-0.85) for related items")
print("   • Quality assurance: exact_duplicates_only for data validation")

## 7. Advanced Configuration Examples

Here are some advanced configuration examples for specific use cases:

In [None]:
# Example configurations for different scenarios

print("=== ADVANCED CONFIGURATION EXAMPLES ===")
print()

# Configuration 1: Strict data cleaning
print("1. STRICT DATA CLEANING:")
print("   Purpose: Remove obvious duplicates during data preprocessing")
strict_config = {
    "near_duplicate": {
        "exact_duplicates_only": True,
        "metric": "euclidean"
    }
}
print(f"   Config: {strict_config}")
print()

# Configuration 2: Text content deduplication
print("2. TEXT CONTENT DEDUPLICATION:")
print("   Purpose: Find similar articles, documents, or web content")
text_config = {
    "near_duplicate": {
        "metric": "cosine",
        "similarity_threshold": 0.88,
        "k": 15
    }
}
print(f"   Config: {text_config}")
print()

# Configuration 3: Image similarity detection
print("3. IMAGE SIMILARITY DETECTION:")
print("   Purpose: Find similar images using deep learning embeddings")
image_config = {
    "near_duplicate": {
        "metric": "cosine",
        "similarity_threshold": 0.92,
        "k": 10
    }
}
print(f"   Config: {image_config}")
print()

# Configuration 4: Product recommendation similarity
print("4. PRODUCT RECOMMENDATION SIMILARITY:")
print("   Purpose: Find related products for recommendation systems")
product_config = {
    "near_duplicate": {
        "metric": "cosine",
        "similarity_threshold": 0.75,
        "k": 20
    }
}
print(f"   Config: {product_config}")
print()

# Configuration 5: Large dataset optimization
print("5. LARGE DATASET OPTIMIZATION:")
print("   Purpose: Efficient processing for datasets >10k rows")
large_config = {
    "near_duplicate": {
        "metric": "cosine",
        "similarity_threshold": 0.90,
        "k": 5,  # Reduced k for performance
        "threshold": 0.1  # Fallback for non-cosine metrics
    }
}
print(f"   Config: {large_config}")

## 8. Performance Monitoring and Troubleshooting

Here's how to monitor performance and troubleshoot common issues:

In [None]:
def demonstrate_error_handling():
    """Demonstrate common errors and how to handle them."""
    
    print("=== COMMON ERRORS AND SOLUTIONS ===")
    print()
    
    # Error 1: Invalid similarity threshold
    print("1. INVALID SIMILARITY THRESHOLD:")
    try:
        test_data = {"features": np.random.randn(10, 5), "labels": ['A'] * 10}
        lab_error = Datalab(test_data, label_name="labels")
        lab_error.find_issues(
            features=test_data["features"],
            issue_types={
                "near_duplicate": {
                    "metric": "cosine",
                    "similarity_threshold": 1.5  # Invalid: > 1
                }
            }
        )
    except ValueError as e:
        print(f"   Error: {e}")
        print("   Solution: Use similarity_threshold between 0 and 1")
    print()
    
    # Error 2: k too large
    print("2. K TOO LARGE FOR DATASET:")
    try:
        small_data = {"features": np.random.randn(5, 3), "labels": ['A'] * 5}
        lab_error2 = Datalab(small_data, label_name="labels")
        lab_error2.find_issues(
            features=small_data["features"],
            issue_types={
                "near_duplicate": {
                    "k": 10  # Too large for 5 samples
                }
            }
        )
    except ValueError as e:
        print(f"   Error: {str(e)[:80]}...")
        print("   Solution: Use k < dataset_size (recommended: k <= 10)")
    print()
    
    # Error 3: Empty dataset
    print("3. EMPTY DATASET:")
    try:
        empty_data = {"features": np.array([]).reshape(0, 5), "labels": []}
        lab_error3 = Datalab(empty_data, label_name="labels")
        lab_error3.find_issues(
            features=empty_data["features"],
            issue_types={"near_duplicate": {}}
        )
        print("   Handled gracefully - no error raised")
    except Exception as e:
        print(f"   Error: {e}")
    print()
    
    print("=== PERFORMANCE TIPS ===")
    print("• Monitor memory usage for large datasets")
    print("• Use similarity_threshold instead of distance threshold for cosine metric")
    print("• Consider exact_duplicates_only=True for initial data cleaning")
    print("• Reduce k for faster processing on large datasets")
    print("• Profile your code to identify bottlenecks")

demonstrate_error_handling()

## Conclusion

This tutorial demonstrated the enhanced duplicate detection capabilities in Cleanlab's Datalab, including:

- **Exact duplicate detection** for data cleaning and validation
- **Configurable similarity thresholds** for fine-tuned control
- **Performance optimization** techniques for large datasets
- **Real-world applications** like document deduplication
- **Best practices** for different use cases

The enhanced duplicate detection provides powerful tools for:
- **Data quality improvement**: Remove exact duplicates during preprocessing
- **Content deduplication**: Find similar articles, documents, or media
- **Similarity search**: Identify related items for recommendations
- **Quality assurance**: Validate data integrity and consistency

For more advanced usage and customization options, check out the [Datalab documentation](https://docs.cleanlab.ai/stable/cleanlab/datalab/guide/issue_type_description.html#near-duplicate-issue) and explore the various configuration parameters available.

### Next Steps
- Experiment with different similarity thresholds on your own data
- Try different distance metrics based on your data type
- Consider batch processing for very large datasets
- Integrate duplicate detection into your data preprocessing pipeline