# Extrapolation Detection with MLflow Faithfulness Metrics

This notebook demonstrates how to use MLflow faithfulness metrics to detect **extrapolation** in LLM outputs. Extrapolation occurs when the model goes beyond the information provided in the context, making inferences, generalizations, or predictions that aren't directly supported.

## What is Extrapolation?

Extrapolation is when an LLM:
- **Makes predictions** beyond what the data supports (e.g., "This trend will continue...")
- **Generalizes** from specific examples (e.g., "All X are Y" from one instance)
- **Infers causation** from correlation (e.g., "A causes B" when only association is mentioned)
- **Adds unsupported conclusions** (e.g., "Therefore, this means...")
- **Speculates** about intentions or future outcomes

### Why Extrapolation Matters:
| Impact | Example |
|--------|---------|
| **Overconfidence** | User trusts speculation as fact |
| **Poor Decisions** | Acting on unsupported predictions |
| **Liability** | Legal/medical advice beyond evidence |
| **Misinformation** | Presenting inference as established fact |

### Faithfulness Score for Extrapolation:
| Score Range | Interpretation | Risk Level |
|-------------|----------------|------------|
| >= 0.75 | Grounded Response | Safe |
| 0.55 - 0.75 | Minor Extrapolation | Low |
| 0.35 - 0.55 | Moderate Extrapolation | Medium |
| < 0.35 | Severe Extrapolation | High |


## 1. Setup and Installation


In [None]:
%pip install -q mlflow sentence-transformers pandas numpy scikit-learn


In [None]:
%restart_python


In [None]:
import numpy as np
import pandas as pd
import re
from typing import List, Dict
import warnings
warnings.filterwarnings('ignore')

from sentence_transformers import SentenceTransformer, CrossEncoder
from sklearn.metrics.pairwise import cosine_similarity

print("‚úÖ Libraries loaded successfully!")


## 2. Extrapolation Detector

We use the same faithfulness metrics but also check for extrapolation indicators - words and phrases that suggest the model is going beyond the source material.


In [None]:
class ExtrapolationDetector:
    """
    Detects extrapolation where the model goes beyond the provided context,
    making unsupported inferences, predictions, or generalizations.
    """
    
    # Words/phrases that indicate extrapolation
    EXTRAPOLATION_INDICATORS = {
        'prediction': ['will', 'would', 'going to', 'likely to', 'expected to', 'probably will'],
        'speculation': ['might', 'could', 'may', 'possibly', 'perhaps', 'potentially'],
        'generalization': ['always', 'never', 'all', 'every', 'none', 'everyone', 'no one'],
        'inference': ['therefore', 'thus', 'hence', 'consequently', 'as a result', 'this means'],
        'causation': ['because of this', 'caused by', 'leads to', 'results in', 'due to this'],
        'assumption': ['obviously', 'clearly', 'certainly', 'definitely', 'must be', 'has to be'],
        'future': ['in the future', 'eventually', 'soon', 'over time', 'ultimately']
    }

    def __init__(self, 
                 embedding_model: str = "all-MiniLM-L6-v2",
                 nli_model: str = "cross-encoder/nli-deberta-v3-small"):
        """Initialize with embedding and NLI models."""
        print(f"Loading embedding model: {embedding_model}...")
        self.embedding_model = SentenceTransformer(embedding_model)
        
        print(f"Loading NLI model: {nli_model}...")
        self.nli_model = CrossEncoder(nli_model)
        
        print("‚úÖ Extrapolation detector ready!")

    def compute_semantic_faithfulness(self, answer: str, context: str) -> float:
        """Compute faithfulness using semantic similarity."""
        answer_embedding = self.embedding_model.encode([answer])[0]
        context_embedding = self.embedding_model.encode([context])[0]
        similarity = cosine_similarity([answer_embedding], [context_embedding])[0][0]
        return max(0, min(1, (similarity + 1) / 2))

    def compute_nli_faithfulness(self, answer: str, context: str) -> float:
        """Compute faithfulness using NLI entailment."""
        scores = self.nli_model.predict([(context, answer)])[0]
        if isinstance(scores, (int, float)):
            return 1 / (1 + np.exp(-scores))  # sigmoid
        return float(scores[2])  # entailment score

    def compute_token_overlap(self, answer: str, context: str) -> float:
        """Compute simple token overlap faithfulness."""
        stop_words = {'the', 'a', 'an', 'is', 'are', 'was', 'were', 'be', 'been', 
                      'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from',
                      'and', 'but', 'or', 'it', 'its', 'this', 'that', 'who'}
        
        answer_tokens = set(answer.lower().split()) - stop_words
        context_tokens = set(context.lower().split()) - stop_words
        
        if not answer_tokens:
            return 1.0
        
        overlap = answer_tokens.intersection(context_tokens)
        return len(overlap) / len(answer_tokens)

    def detect_extrapolation_indicators(self, answer: str) -> Dict:
        """Detect linguistic indicators of extrapolation in the answer."""
        answer_lower = answer.lower()
        found_indicators = {}
        total_count = 0
        
        for category, indicators in self.EXTRAPOLATION_INDICATORS.items():
            matches = [ind for ind in indicators if ind in answer_lower]
            if matches:
                found_indicators[category] = matches
                total_count += len(matches)
        
        # Calculate penalty based on number of indicators found
        indicator_penalty = min(0.3, total_count * 0.05)  # Max 30% penalty
        
        return {
            "indicators_found": found_indicators,
            "indicator_count": total_count,
            "indicator_penalty": indicator_penalty
        }

    def detect_extrapolation(self, answer: str, context: str) -> Dict:
        """
        Detect if an answer contains extrapolation beyond the context.
        
        Returns:
            Dictionary with scores and extrapolation verdict
        """
        semantic = self.compute_semantic_faithfulness(answer, context)
        nli = self.compute_nli_faithfulness(answer, context)
        overlap = self.compute_token_overlap(answer, context)
        
        # Detect extrapolation indicators
        indicator_analysis = self.detect_extrapolation_indicators(answer)
        
        # Base combined score
        base_score = 0.45 * nli + 0.35 * semantic + 0.20 * overlap
        
        # Apply penalty for extrapolation indicators
        combined = max(0, base_score - indicator_analysis["indicator_penalty"])
        
        # Determine extrapolation category
        if combined >= 0.75:
            category = "‚úÖ Grounded Response"
            is_extrapolation = False
            risk = "Low"
        elif combined >= 0.55:
            category = "‚ö†Ô∏è Minor Extrapolation"
            is_extrapolation = True
            risk = "Low"
        elif combined >= 0.35:
            category = "üî∂ Moderate Extrapolation"
            is_extrapolation = True
            risk = "Medium"
        else:
            category = "üö´ Severe Extrapolation"
            is_extrapolation = True
            risk = "High"
        
        return {
            "semantic_score": semantic,
            "nli_score": nli,
            "overlap_score": overlap,
            "base_score": base_score,
            "indicator_penalty": indicator_analysis["indicator_penalty"],
            "combined_score": combined,
            "indicators_found": indicator_analysis["indicators_found"],
            "indicator_count": indicator_analysis["indicator_count"],
            "category": category,
            "is_extrapolation": is_extrapolation,
            "risk_level": risk
        }

# Initialize the detector
detector = ExtrapolationDetector()


## 3. Test Examples: Detecting Extrapolation

We'll test several categories of extrapolation:
1. **Future Predictions** - Predicting outcomes not stated in the context
2. **Generalizations** - Extending specific facts to general claims
3. **Causal Inference** - Inferring causation from correlation
4. **Speculation** - Adding uncertain claims not in the source
5. **Unsupported Conclusions** - Drawing conclusions beyond the evidence


In [None]:
# Define test examples with context, grounded answer, and extrapolated answer
test_examples = [
    {
        "name": "Example 1: Future Prediction",
        "category": "Prediction",
        "context": "Company XYZ reported a 15% increase in revenue for Q3 2024 compared to Q3 2023. The company attributed this growth to expansion into Asian markets.",
        "grounded_answer": "Company XYZ's revenue grew 15% in Q3 2024 year-over-year, driven by Asian market expansion.",
        "extrapolated_answer": "Company XYZ's revenue grew 15% in Q3 2024. This trend will likely continue, and the company is expected to double its revenue within 3 years."
    },
    {
        "name": "Example 2: Generalization",
        "category": "Generalization",
        "context": "A study of 50 participants found that those who exercised for 30 minutes daily showed improved mood scores after 8 weeks.",
        "grounded_answer": "A study of 50 participants showed that daily 30-minute exercise improved mood scores over 8 weeks.",
        "extrapolated_answer": "Exercise always improves mood in everyone. All people who exercise for 30 minutes daily will definitely experience better mental health."
    },
    {
        "name": "Example 3: Causal Inference",
        "category": "Causation",
        "context": "Research shows that countries with higher chocolate consumption per capita also have more Nobel Prize winners per capita.",
        "grounded_answer": "There is a correlation between chocolate consumption and Nobel Prize winners across countries.",
        "extrapolated_answer": "Eating chocolate causes improved cognitive function, which leads to more Nobel Prize winners. Therefore, eating more chocolate will make you smarter."
    },
    {
        "name": "Example 4: Speculation",
        "category": "Speculation",
        "context": "The CEO resigned from the company effective immediately. No reason was given in the official press release.",
        "grounded_answer": "The CEO resigned immediately. The company did not provide a reason in their press release.",
        "extrapolated_answer": "The CEO resigned, possibly due to internal conflicts or financial irregularities. This could indicate serious problems within the company that may affect stock prices."
    },
    {
        "name": "Example 5: Unsupported Conclusion",
        "category": "Conclusion",
        "context": "Product A costs $50 and has 4.2 stars on reviews. Product B costs $75 and has 4.5 stars on reviews.",
        "grounded_answer": "Product A is cheaper ($50, 4.2 stars) while Product B is more expensive ($75) with slightly higher ratings (4.5 stars).",
        "extrapolated_answer": "Product B is clearly the better choice because higher price means better quality. You should definitely buy Product B as it will last longer and provide more value."
    },
    {
        "name": "Example 6: Medical Extrapolation",
        "category": "Medical",
        "context": "A clinical trial showed that Drug X reduced blood pressure by an average of 10 mmHg in patients with hypertension over 12 weeks.",
        "grounded_answer": "Clinical trials showed Drug X reduced blood pressure by an average of 10 mmHg in hypertension patients over 12 weeks.",
        "extrapolated_answer": "Drug X will cure hypertension permanently. All patients should switch to this medication as it will eventually eliminate the need for other treatments."
    },
    {
        "name": "Example 7: Economic Extrapolation",
        "category": "Economic",
        "context": "Unemployment fell from 5.2% to 4.8% last quarter. The technology sector added the most jobs during this period.",
        "grounded_answer": "Unemployment dropped from 5.2% to 4.8% last quarter, with the technology sector leading job growth.",
        "extrapolated_answer": "The economy is booming and will continue to improve. Unemployment will inevitably reach historic lows, and everyone should invest in tech stocks immediately."
    }
]

print(f"üìã Loaded {len(test_examples)} extrapolation test examples across {len(set(e['category'] for e in test_examples))} categories")


### 3.1 Running Extrapolation Detection


In [None]:
def print_detection_result(name: str, category: str, context: str, answer: str, result: Dict, answer_type: str):
    """Pretty print the extrapolation detection result."""
    print(f"\n{'='*80}")
    print(f"üìù {name}")
    print(f"   Category: {category} | Type: {answer_type}")
    print(f"{'='*80}")
    print(f"\nüìÑ Context: {context[:100]}...")
    print(f"\nüí¨ Answer: {answer}")
    print(f"\nüìä SCORES:")
    print(f"   ‚Ä¢ Semantic Similarity: {result['semantic_score']:.3f}")
    print(f"   ‚Ä¢ NLI Entailment:      {result['nli_score']:.3f}")
    print(f"   ‚Ä¢ Token Overlap:       {result['overlap_score']:.3f}")
    print(f"   ‚Ä¢ Base Score:          {result['base_score']:.3f}")
    print(f"   ‚Ä¢ Indicator Penalty:   -{result['indicator_penalty']:.3f}")
    print(f"   ‚Ä¢ Final Score:         {result['combined_score']:.3f}")
    
    if result['indicators_found']:
        print(f"\nüö® EXTRAPOLATION INDICATORS FOUND:")
        for cat, indicators in result['indicators_found'].items():
            print(f"   ‚Ä¢ {cat}: {', '.join(indicators)}")
    
    print(f"\nüéØ VERDICT: {result['category']}")
    print(f"   ‚Ä¢ Is Extrapolation: {'YES ‚ö†Ô∏è' if result['is_extrapolation'] else 'NO ‚úì'}")
    print(f"   ‚Ä¢ Risk Level: {result['risk_level']}")

# Run detection on all examples
print("üîç EXTRAPOLATION DETECTION RESULTS")
print("="*80)

all_results = []

for example in test_examples:
    # Test grounded answer
    grounded_result = detector.detect_extrapolation(
        example["grounded_answer"], 
        example["context"]
    )
    print_detection_result(
        example["name"], 
        example["category"],
        example["context"], 
        example["grounded_answer"], 
        grounded_result, 
        "GROUNDED ANSWER"
    )
    all_results.append({
        "example": example["name"],
        "category": example["category"],
        "type": "Grounded",
        "combined_score": grounded_result["combined_score"],
        "indicator_count": grounded_result["indicator_count"],
        "is_extrapolation": grounded_result["is_extrapolation"],
        "verdict": grounded_result["category"]
    })
    
    # Test extrapolated answer
    extrapolated_result = detector.detect_extrapolation(
        example["extrapolated_answer"], 
        example["context"]
    )
    print_detection_result(
        example["name"], 
        example["category"],
        example["context"], 
        example["extrapolated_answer"], 
        extrapolated_result, 
        "EXTRAPOLATED ANSWER"
    )
    all_results.append({
        "example": example["name"],
        "category": example["category"],
        "type": "Extrapolated",
        "combined_score": extrapolated_result["combined_score"],
        "indicator_count": extrapolated_result["indicator_count"],
        "is_extrapolation": extrapolated_result["is_extrapolation"],
        "verdict": extrapolated_result["category"]
    })


### 3.2 Summary Results Table


In [None]:
# Create summary DataFrame
results_df = pd.DataFrame(all_results)

print("\n" + "="*80)
print("üìä SUMMARY: EXTRAPOLATION DETECTION RESULTS")
print("="*80)

# Format the dataframe for display
display_df = results_df.copy()
display_df["combined_score"] = display_df["combined_score"].apply(lambda x: f"{x:.3f}")
display_df["is_extrapolation"] = display_df["is_extrapolation"].apply(lambda x: "‚ö†Ô∏è YES" if x else "‚úì NO")
display_df.columns = ["Example", "Category", "Answer Type", "Score", "Indicators", "Extrapolation?", "Verdict"]

print("\n")
print(display_df.to_string(index=False))

# Calculate accuracy
grounded_correct = sum(1 for r in all_results if r["type"] == "Grounded" and not r["is_extrapolation"])
extrapolated_detected = sum(1 for r in all_results if r["type"] == "Extrapolated" and r["is_extrapolation"])
total = len(test_examples)

print(f"\n{'='*80}")
print("üìà DETECTION ACCURACY")
print("="*80)
print(f"   ‚Ä¢ Grounded answers correctly identified:   {grounded_correct}/{total} ({grounded_correct/total*100:.0f}%)")
print(f"   ‚Ä¢ Extrapolations detected:                 {extrapolated_detected}/{total} ({extrapolated_detected/total*100:.0f}%)")
print(f"   ‚Ä¢ Overall accuracy:                        {(grounded_correct+extrapolated_detected)/(total*2)*100:.0f}%")


### 3.3 Extrapolation Indicator Analysis


In [None]:
# Analyze indicator patterns
print("\n" + "="*80)
print("üìä EXTRAPOLATION INDICATOR ANALYSIS")
print("="*80)

# Count indicators by type
indicator_counts = {}
for example in test_examples:
    result = detector.detect_extrapolation(example["extrapolated_answer"], example["context"])
    for cat, indicators in result["indicators_found"].items():
        if cat not in indicator_counts:
            indicator_counts[cat] = 0
        indicator_counts[cat] += len(indicators)

print("\nüìå INDICATOR TYPES DETECTED IN EXTRAPOLATED ANSWERS:")
for cat, count in sorted(indicator_counts.items(), key=lambda x: x[1], reverse=True):
    bar = "‚ñà" * min(count * 3, 30)
    print(f"   {cat:15} {bar} ({count})")

# Category analysis
print(f"\n{'='*80}")
print("üìä ANALYSIS BY EXTRAPOLATION CATEGORY")
print("="*80)

categories = results_df["category"].unique()

for cat in categories:
    cat_results = results_df[results_df["category"] == cat]
    grounded_score = cat_results[cat_results["type"] == "Grounded"]["combined_score"].values[0]
    extrap_score = cat_results[cat_results["type"] == "Extrapolated"]["combined_score"].values[0]
    extrap_indicators = cat_results[cat_results["type"] == "Extrapolated"]["indicator_count"].values[0]
    score_gap = grounded_score - extrap_score
    
    print(f"\nüìå {cat}:")
    print(f"   Grounded Score:        {grounded_score:.3f}")
    print(f"   Extrapolated Score:    {extrap_score:.3f}")
    print(f"   Indicators Found:      {extrap_indicators}")
    print(f"   Detection Gap:         {score_gap:.3f} {'‚úì' if score_gap > 0.20 else '‚ö†Ô∏è'}")


## 4. Interactive Testing: Try Your Own Examples


In [None]:
def test_extrapolation(context: str, answer: str):
    """
    Test if an answer contains extrapolation given a context.
    
    Usage:
        test_extrapolation(
            context="Sales increased 10% last quarter.",
            answer="Sales will continue to grow next year."
        )
    """
    result = detector.detect_extrapolation(answer, context)
    
    print("="*60)
    print("üîç EXTRAPOLATION TEST")
    print("="*60)
    print(f"\nüìÑ Context:\n{context}")
    print(f"\nüí¨ Answer:\n{answer}")
    print(f"\nüìä SCORES:")
    print(f"   ‚Ä¢ Semantic:   {result['semantic_score']:.3f}")
    print(f"   ‚Ä¢ NLI:        {result['nli_score']:.3f}")
    print(f"   ‚Ä¢ Overlap:    {result['overlap_score']:.3f}")
    print(f"   ‚Ä¢ Base:       {result['base_score']:.3f}")
    print(f"   ‚Ä¢ Penalty:    -{result['indicator_penalty']:.3f}")
    print(f"   ‚Ä¢ Final:      {result['combined_score']:.3f}")
    
    if result['indicators_found']:
        print(f"\nüö® INDICATORS FOUND:")
        for cat, indicators in result['indicators_found'].items():
            print(f"   ‚Ä¢ {cat}: {', '.join(indicators)}")
    
    print(f"\nüéØ RESULT: {result['category']}")
    print(f"   Risk Level: {result['risk_level']}")
    print("="*60)
    
    return result

# Example: Grounded answer
print("TEST 1: Grounded Answer")
test_extrapolation(
    context="The study found that participants who meditated for 10 minutes daily reported lower stress levels after 4 weeks.",
    answer="A study showed that 10 minutes of daily meditation reduced reported stress levels over 4 weeks."
)


In [None]:
# Example: Extrapolated answer with predictions
print("TEST 2: Extrapolated Answer - Predictions")
test_extrapolation(
    context="The study found that participants who meditated for 10 minutes daily reported lower stress levels after 4 weeks.",
    answer="Meditation will cure anxiety in everyone. All people should meditate because it will definitely eliminate stress permanently."
)


In [None]:
# Example: Extrapolated answer with causal inference
print("TEST 3: Extrapolated Answer - Causal Inference")
test_extrapolation(
    context="Students who used the new learning app scored 5% higher on tests than those who didn't.",
    answer="The app causes better learning outcomes. Therefore, schools must adopt this app because it leads to improved test scores for all students."
)


## 5. Score Comparison Visualization


In [None]:
# Visual comparison of scores
grounded_scores = results_df[results_df["type"] == "Grounded"]["combined_score"]
extrap_scores = results_df[results_df["type"] == "Extrapolated"]["combined_score"]

print("="*60)
print("üìä SCORE COMPARISON: GROUNDED vs EXTRAPOLATED")
print("="*60)

print("\nüìó GROUNDED ANSWERS:")
print(f"   Average Score: {np.mean(grounded_scores):.3f}")
print(f"   Min Score:     {np.min(grounded_scores):.3f}")
print(f"   Max Score:     {np.max(grounded_scores):.3f}")

print("\nüìï EXTRAPOLATED ANSWERS:")
print(f"   Average Score: {np.mean(extrap_scores):.3f}")
print(f"   Min Score:     {np.min(extrap_scores):.3f}")
print(f"   Max Score:     {np.max(extrap_scores):.3f}")

print("\nüìâ SEPARATION METRICS:")
score_diff = np.mean(grounded_scores) - np.mean(extrap_scores)
print(f"   Score Gap: {score_diff:.3f}")

# Visual bar representation
print("\nüìä VISUAL COMPARISON:")
print(f"   Grounded:     {'‚ñà' * int(np.mean(grounded_scores) * 20):<20} {np.mean(grounded_scores):.2f}")
print(f"   Extrapolated: {'‚ñà' * int(np.mean(extrap_scores) * 20):<20} {np.mean(extrap_scores):.2f}")
print(f"   Threshold:    {'‚îÄ' * 11}‚îÇ{'‚îÄ' * 8}  0.55 (extrapolation cutoff)")

# Per-category visualization
print("\nüìä BY CATEGORY:")
for cat in categories:
    cat_data = results_df[results_df["category"] == cat]
    g_score = cat_data[cat_data["type"] == "Grounded"]["combined_score"].values[0]
    e_score = cat_data[cat_data["type"] == "Extrapolated"]["combined_score"].values[0]
    print(f"   {cat:13} Grounded: {'‚ñà' * int(g_score * 15):<15} {g_score:.2f}  |  Extrap: {'‚ñà' * int(e_score * 15):<15} {e_score:.2f}")


## 6. Key Takeaways


In [None]:
print("""
‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë                    KEY TAKEAWAYS: EXTRAPOLATION DETECTION                    ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

‚úÖ WHAT WE DEMONSTRATED:

1. EXTRAPOLATION TYPES DETECTED:
   ‚Ä¢ Future predictions (will, expected to, going to)
   ‚Ä¢ Generalizations (always, all, everyone, never)
   ‚Ä¢ Causal inferences (therefore, leads to, causes)
   ‚Ä¢ Speculation (might, could, possibly, perhaps)
   ‚Ä¢ Unsupported conclusions (clearly, obviously, definitely)

2. DETECTION APPROACH:
   ‚Ä¢ Semantic + NLI + Overlap as base faithfulness score
   ‚Ä¢ Linguistic indicator detection for extrapolation signals
   ‚Ä¢ Penalty system for detected indicators
   ‚Ä¢ Combined scoring provides robust detection

3. KEY INDICATORS OF EXTRAPOLATION:
   ‚Ä¢ Prediction words: will, would, going to, expected to
   ‚Ä¢ Generalization words: always, never, all, every
   ‚Ä¢ Inference markers: therefore, thus, hence, consequently
   ‚Ä¢ Certainty words: definitely, certainly, obviously, clearly

4. RECOMMENDED THRESHOLDS:
   ‚Ä¢ >= 0.75: Safe - grounded response
   ‚Ä¢ 0.55-0.75: Minor extrapolation - may be acceptable
   ‚Ä¢ 0.35-0.55: Moderate extrapolation - needs review
   ‚Ä¢ < 0.35: Severe extrapolation - flag or block

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

üîó NEXT STEPS:
   ‚Ä¢ Integrate with MLflow for production monitoring
   ‚Ä¢ Customize indicator lists for your domain
   ‚Ä¢ Combine with confidence calibration techniques
   ‚Ä¢ Build domain-specific extrapolation rules (e.g., medical, financial)

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê

‚ö†Ô∏è IMPORTANT CONSIDERATIONS:
   ‚Ä¢ Some extrapolation may be acceptable (e.g., "This suggests...")
   ‚Ä¢ Context matters: scientific papers vs. casual conversations
   ‚Ä¢ Indicator-based detection may have false positives
   ‚Ä¢ Consider the criticality of the application when setting thresholds

‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê
""")
