## Step 1: Initial Experiment with Basic Prompts
**Description:** This step sets up the environment, imports necessary libraries (Groq, Pandas, VaderSentiment), and defines four initial prompt strategies: Direct, Few-Shot, Chain-of-Thought (CoT), and Hybrid (combining Vader sentiment scores). It runs a pilot experiment on 50 sampled reviews from 'Book1.csv'.

**Result:** The Direct and Few-Shot strategies tied for the best performance with 70% accuracy, while the Hybrid approach lagged behind at 54%.

In [None]:
import json
import pandas as pd
from groq import Groq
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# -----------------------------
# SENTIMENT PREPROCESSOR
# -----------------------------
analyzer = SentimentIntensityAnalyzer()

def preprocess_sentiment(text):
    score = analyzer.polarity_scores(text)["compound"]  
    label = (
        "very_negative" if score <= -0.5 else
        "negative" if score < 0 else
        "neutral" if score < 0.3 else
        "positive" if score < 0.6 else
        "very_positive"
    )
    return score, label


# -----------------------------
# API KEY + CLIENT
# -----------------------------

client = Groq(api_key=GROQ_API_KEY)


# -----------------------------
# PROMPT TEMPLATES
# -----------------------------
def direct_prompt(review):
    return f"""
You are a Yelp rating classifier.
Given a review, output JSON ONLY:

{{
 "predicted_stars": <1-5 integer>,
 "explanation": "<brief reasoning>"
}}

Review:
\"\"\"{review}\"\"\"
"""


def few_shot_prompt(review):
    return f"""
You are a Yelp rating classifier. Learn from examples:

Example 1:
Review: "Amazing food and super friendly people!"
Output: {{"predicted_stars": 5, "explanation": "Strong positive sentiment."}}

Example 2:
Review: "Cold food. Slow service."
Output: {{"predicted_stars": 2, "explanation": "Mostly negative."}}

Example 3:
Review: "It's fine. Not great, not terrible."
Output: {{"predicted_stars": 3, "explanation": "Neutral experience."}}

Now classify the following:

Review:
\"\"\"{review}\"\"\"

Return JSON ONLY.
"""


def chain_of_thought_prompt(review):
    return f"""
You classify Yelp reviews into star ratings (1‚Äì5).

INTERNAL RULES (Do NOT reveal reasoning):
- 1 = extremely negative
- 2 = mostly negative
- 3 = mixed/average
- 4 = mostly positive
- 5 = extremely positive

Think step-by-step internally but DO NOT show the steps.

Output ONLY JSON:

{{
 "predicted_stars": <1-5>,
 "explanation": "<brief summary>"
}}

Review:
\"\"\"{review}\"\"\"
"""


def hybrid_prompt(review, sentiment_score, sentiment_label):
    return f"""
You are a Yelp star-rating classifier.

You will receive:
1. The original review text.
2. A precomputed sentiment score from a rule-based analyzer.
3. A sentiment label based on thresholds.

Use BOTH the metadata AND the review to determine the final star rating (1‚Äì5).

Mapping rules:
- Very negative sentiment ‚Üí 1 or 2
- Mixed or neutral ‚Üí 3
- Mostly positive ‚Üí 4
- Very positive ‚Üí 5

Now produce JSON ONLY:

{{
 "predicted_stars": <1-5>,
 "explanation": "<reason considering review + sentiment metadata>"
}}

Review Text:
\"\"\"{review}\"\"\"

Rule-Based Metadata:
- sentiment_score: {sentiment_score}
- sentiment_label: "{sentiment_label}"

Return JSON only.
"""


# -----------------------------
# MODEL CALL
# -----------------------------
def call_llama(prompt):
    response = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.0
    )
    return response.choices[0].message.content


# -----------------------------
# JSON PARSER
# -----------------------------
def parse_json(output):
    try:
        data = json.loads(output)
        return int(data["predicted_stars"]), True
    except:
        return None, False


# -----------------------------
# EXPERIMENT RUNNER
# -----------------------------
def run_experiment(df, prompt_type):
    preds = []
    json_valid_count = 0

    for review in tqdm(df["text"], desc=prompt_type):

        if prompt_type == "direct":
            prompt = direct_prompt(review)

        elif prompt_type == "fewshot":
            prompt = few_shot_prompt(review)

        elif prompt_type == "cot":
            prompt = chain_of_thought_prompt(review)

        elif prompt_type == "hybrid":
            score, label = preprocess_sentiment(review)
            prompt = hybrid_prompt(review, score, label)

        else:
            raise ValueError("Invalid prompt type")

        raw_output = call_llama(prompt)
        stars, valid = parse_json(raw_output)

        if valid and stars is not None:
            preds.append(stars)
            json_valid_count += 1
        else:
            preds.append(3)  # fallback

    accuracy = accuracy_score(df["stars"], preds)
    json_validity = json_valid_count / len(df)
    return preds, accuracy, json_validity


# -----------------------------
# MAIN EXECUTION
# -----------------------------
if __name__ == "__main__":
    df = pd.read_csv("./Book1.csv")
    df = df.sample(50, random_state=42).reset_index(drop=True)

    results = {}

    for style in ["direct", "fewshot", "cot", "hybrid"]:
        preds, acc, json_rate = run_experiment(df, style)

        results[style] = {
            "accuracy": acc,
            "json_validity": json_rate
        }

        df[f"pred_{style}"] = preds

    print("\n===== FINAL RESULTS =====")
    print(json.dumps(results, indent=4))

    df.to_csv("predicted_results1.csv", index=False)
    print("\nSaved predictions ‚Üí predicted_results1.csv")


direct: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [01:19<00:00,  1.58s/it]
fewshot: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [02:45<00:00,  3.32s/it]
cot: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [02:35<00:00,  3.11s/it]
hybrid: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [03:23<00:00,  4.07s/it]


===== FINAL RESULTS =====
{
    "direct": {
        "accuracy": 0.7,
        "json_validity": 1.0
    },
    "fewshot": {
        "accuracy": 0.7,
        "json_validity": 0.98
    },
    "cot": {
        "accuracy": 0.64,
        "json_validity": 1.0
    },
    "hybrid": {
        "accuracy": 0.54,
        "json_validity": 1.0
    }
}

Saved predictions ‚Üí predicted_results1.csv





## Step 2: Refined Strategies and Ensembles (Small Scale)
**Description:** This step introduces more sophisticated prompting techniques: Structured Analysis (breaking down components), Numerical Reasoning (scoring dimensions), and Contrastive (positives vs negatives). It also implements Ensemble methods (Majority Vote and Weighted) and runs a test on 20 reviews from 'yelp.csv'.

**Result:** Structured Analysis and Ensemble methods achieved the highest accuracy of 75%, showing a 15.4% improvement over the Direct baseline on this small sample.

In [None]:
import json
import pandas as pd
from groq import Groq
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time
import numpy as np


# API KEY + CLIENT

client = Groq(api_key=GROQ_API_KEY)


# ===== PROMPTS =====
def prompt_direct(review):
    """Direct approach baseline"""
    return f"""Rate as 1-5 stars. JSON ONLY: {{"predicted_stars": <1-5>}}
Review: "{review}\""""

def prompt_structured_analysis(review):
    """Breaks down review into specific components"""
    return f"""
You are a Yelp rating expert. Analyze this review COMPONENT BY COMPONENT:

Review: "{review}"

Output ONLY valid JSON with this exact structure:
{{
 "service_quality": "positive/negative/neutral/not_mentioned",
 "food_quality": "positive/negative/neutral/not_mentioned",
 "ambiance": "positive/negative/neutral/not_mentioned",
 "value": "positive/negative/neutral/not_mentioned",
 "overall_sentiment": "very_negative/negative/neutral/positive/very_positive",
 "predicted_stars": <1-5>,
 "explanation": "<one sentence>"
}}
"""


def prompt_numerical_reasoning(review):
    """Uses numerical scoring before mapping to stars"""
    return f"""
You are a Yelp classifier. Score each dimension 0-10, then convert to stars.

Review: "{review}"

Scoring (0-10 for each):
- Satisfaction level: ___
- Would recommend: ___
- Problem severity (inverse): ___

Output ONLY valid JSON:
{{
 "satisfaction_score": <0-10>,
 "recommendation_score": <0-10>,
 "problem_severity": <0-10>,
 "predicted_stars": <1-5>,
 "explanation": "<reasoning>"
}}
"""


def prompt_contrastive(review):
    """Explicitly compares positive and negative aspects"""
    return f"""
You are a Yelp rater. Compare positives vs negatives.

Review: "{review}"

List main positives (if any):
- [positive aspects]

List main negatives (if any):
- [negative aspects]

Weighting: Which dominates?

Output ONLY valid JSON:
{{
 "positive_count": <number>,
 "negative_count": <number>,
 "dominant_sentiment": "positive/negative/neutral",
 "predicted_stars": <1-5>,
 "explanation": "<comparison-based reasoning>"
}}
"""





def get_available_model():
    """Get first available text model from Groq"""
    try:
        models = client.models.list()
        # Filter for actual chat models (exclude guards, prompts guards, etc)
        text_models = [
            m.id for m in models.data 
            if any(x in m.id.lower() for x in ['llama-3.3', 'llama-3.1', 'mixtral'])
            and 'guard' not in m.id.lower()
            and 'prompt' not in m.id.lower()
        ]
        if text_models:
            print(f"\n‚úì Available models: {text_models}")
            model_choice = text_models[0]
            print(f"‚úì Using: {model_choice}\n")
            return model_choice
        else:
            print(f"\n‚úì Using fallback: llama-3.3-70b-versatile\n")
            return "llama-3.3-70b-versatile"
    except Exception as e:
        print(f"\n‚úì Using fallback: llama-3.3-70b-versatile\n")
        return "llama-3.3-70b-versatile"

GROQ_MODEL = get_available_model()

# MODEL CALL with retry logic
def call_llama(prompt, model=None, max_retries=3):
    """Call Groq API with retry logic and throttling."""
    if model is None:
        model = GROQ_MODEL
    
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0,
                max_tokens=200
            )
            time.sleep(0.3)
            return response.choices[0].message.content
        except Exception as e:
            if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
                wait_time = (2 ** attempt) * 3
                tqdm.write(f"\nRate limit. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise


# JSON PARSER
def parse_json(output):
    try:
        data = json.loads(output)
        stars = int(data.get("predicted_stars"))
        if 1 <= stars <= 5:
            return stars, True
        return None, False
    except:
        return None, False


# EXPERIMENT RUNNER
def run_experiment(df, prompt_type):
    preds = []
    json_valid_count = 0
    failed_count = 0

    prompt_map = {
        "direct": prompt_direct,
        "structured": prompt_structured_analysis,
        "numerical": prompt_numerical_reasoning,
        "contrastive": prompt_contrastive
    }
    
    prompt_fn = prompt_map.get(prompt_type)
    if not prompt_fn:
        raise ValueError(f"Invalid prompt type: {prompt_type}")

    for review in tqdm(df["text"], desc=prompt_type, leave=False):
        try:
            prompt = prompt_fn(review)
            raw = call_llama(prompt)
            stars, valid = parse_json(raw)

            if valid and stars is not None:
                preds.append(stars)
                json_valid_count += 1
            else:
                preds.append(3)
                failed_count += 1
        except Exception as e:
            tqdm.write(f"\nError: {str(e)[:80]}")
            preds.append(3)
            failed_count += 1

    accuracy = accuracy_score(df["stars"], preds)
    json_validity = json_valid_count / len(df)
    return preds, accuracy, json_validity, failed_count


# ENSEMBLE FUNCTIONS
def create_ensemble(individual_preds_dict):
    """Majority vote ensemble"""
    n = len(next(iter(individual_preds_dict.values())))
    ensemble_preds = []
    
    for idx in range(n):
        votes = [individual_preds_dict[model][idx] for model in sorted(individual_preds_dict.keys())]
        consensus = int(np.median(votes))
        ensemble_preds.append(consensus)
    
    return ensemble_preds


def create_weighted_ensemble(individual_preds_dict, weights):
    """Weighted ensemble based on accuracy"""
    n = len(next(iter(individual_preds_dict.values())))
    ensemble_preds = []
    
    for idx in range(n):
        weighted_sum = sum(individual_preds_dict[model][idx] * weights[model] 
                          for model in sorted(individual_preds_dict.keys()))
        weighted_avg = weighted_sum / sum(weights.values())
        ensemble_preds.append(round(weighted_avg))
    
    return ensemble_preds


# MAIN
if __name__ == "__main__":
    print("\n" + "="*70)
    print("YELP REVIEW CLASSIFIER - Multi-Prompt Ensemble")
    print("="*70)
    
    print("\nLoading data...")
    df = pd.read_csv("./yelp.csv")
    df = df.sample(20, random_state=42).reset_index(drop=True)
    print(f"Loaded {len(df)} reviews")

    results = {}
    individual_preds = {}
    
    prompt_styles = ["direct","structured", "numerical", "contrastive"]

    print(f"\nRunning experiments on {len(df)} reviews...")
    print("-" * 70)
    

    for style in prompt_styles:
        print(f"\n‚ñ∂ Testing {style.upper()}...")
        preds, acc, json_rate, failed = run_experiment(df, style)
        results[style] = {
            "accuracy": round(acc, 4),
            "json_validity": round(json_rate, 4),
            "failed_parses": failed
        }
        individual_preds[style] = preds
        df[f"pred_{style}"] = preds
        print(f"  ‚úì Accuracy: {acc:.4f} | JSON Valid: {json_rate:.4f} | Failed: {failed}")

    print("\n" + "="*70)
    print("INDIVIDUAL MODEL RESULTS")
    print("="*70)
    
    sorted_results = sorted(results.items(), key=lambda x: x[1]["accuracy"], reverse=True)
    for i, (style, metrics) in enumerate(sorted_results, 1):
        acc = metrics["accuracy"]
        improvement = ((acc - 0.650) / 0.650) * 100
        print(f"{i}. {style:15} | Accuracy: {acc:.4f} | {improvement:+.1f}% vs baseline")

    print("\n" + "="*70)
    print("ENSEMBLE STRATEGIES")
    print("="*70)

    # Majority Vote Ensemble
    print("\n‚ñ∂ MAJORITY VOTE ENSEMBLE...")
    ensemble_majority = create_ensemble(individual_preds)
    ensemble_majority_acc = accuracy_score(df["stars"], ensemble_majority)
    df["pred_ensemble_majority"] = ensemble_majority
    print(f"  ‚úì Accuracy: {ensemble_majority_acc:.4f}")

    # Weighted Ensemble
    print("\n‚ñ∂ WEIGHTED ENSEMBLE (by accuracy)...")
    weights = {k: v["accuracy"] for k, v in results.items()}
    ensemble_weighted = create_weighted_ensemble(individual_preds, weights)
    ensemble_weighted_acc = accuracy_score(df["stars"], ensemble_weighted)
    df["pred_ensemble_weighted"] = ensemble_weighted
    print(f"  ‚úì Accuracy: {ensemble_weighted_acc:.4f}")

    print("\n" + "="*70)
    print("FINAL COMPARISON")
    print("="*70)

    all_results = {
        **results,
        "ensemble_majority": {"accuracy": round(ensemble_majority_acc, 4)},
        "ensemble_weighted": {"accuracy": round(ensemble_weighted_acc, 4)}
    }

    final_sorted = sorted(all_results.items(), key=lambda x: x[1]["accuracy"], reverse=True)
    
    for i, (name, metrics) in enumerate(final_sorted, 1):
        acc = metrics["accuracy"]
        improvement = ((acc - 0.650) / 0.650) * 100
        medal = ["ü•á", "ü•à", "ü•â"][i-1] if i <= 3 else "  "
        print(f"{medal} {i}. {name:20} | Accuracy: {acc:.4f} | {improvement:+.1f}%")

    print("\n" + "="*70)
    winner = final_sorted[0]
    improvement_pct = ((winner[1]['accuracy'] - 0.650) / 0.650) * 100
    print(f"üèÜ OVERALL WINNER: {winner[0].upper()}")
    print(f"   Final Accuracy: {winner[1]['accuracy']:.4f} (+{improvement_pct:.1f}%)")
    print("="*70)

    # Save outputs
    df.to_csv("predicted_results_final.csv", index=False)
    with open("results_summary.json", "w") as f:
        json.dump(all_results, f, indent=2)

    print(f"\n‚úì Saved predictions ‚Üí predicted_results_final.csv")
    print(f"‚úì Saved summary ‚Üí results_summary.json")


‚úì Available models: ['llama-3.1-8b-instant', 'llama-3.3-70b-versatile']
‚úì Using: llama-3.1-8b-instant


YELP REVIEW CLASSIFIER - Multi-Prompt Ensemble

Loading data...
Loaded 20 reviews

Running experiments on 20 reviews...
----------------------------------------------------------------------

‚ñ∂ Testing DIRECT...


                                                       

  ‚úì Accuracy: 0.6500 | JSON Valid: 1.0000 | Failed: 0

‚ñ∂ Testing STRUCTURED...


                                                           

  ‚úì Accuracy: 0.7500 | JSON Valid: 1.0000 | Failed: 0

‚ñ∂ Testing NUMERICAL...


                                                          

  ‚úì Accuracy: 0.7000 | JSON Valid: 1.0000 | Failed: 0

‚ñ∂ Testing CONTRASTIVE...


                                                            

  ‚úì Accuracy: 0.7000 | JSON Valid: 1.0000 | Failed: 0

INDIVIDUAL MODEL RESULTS
1. structured      | Accuracy: 0.7500 | +15.4% vs baseline
2. numerical       | Accuracy: 0.7000 | +7.7% vs baseline
3. contrastive     | Accuracy: 0.7000 | +7.7% vs baseline
4. direct          | Accuracy: 0.6500 | +0.0% vs baseline

ENSEMBLE STRATEGIES

‚ñ∂ MAJORITY VOTE ENSEMBLE...
  ‚úì Accuracy: 0.7500

‚ñ∂ WEIGHTED ENSEMBLE (by accuracy)...
  ‚úì Accuracy: 0.7500

FINAL COMPARISON
ü•á 1. structured           | Accuracy: 0.7500 | +15.4%
ü•à 2. ensemble_majority    | Accuracy: 0.7500 | +15.4%
ü•â 3. ensemble_weighted    | Accuracy: 0.7500 | +15.4%
   4. numerical            | Accuracy: 0.7000 | +7.7%
   5. contrastive          | Accuracy: 0.7000 | +7.7%
   6. direct               | Accuracy: 0.6500 | +0.0%

üèÜ OVERALL WINNER: STRUCTURED
   Final Accuracy: 0.7500 (+15.4%)

‚úì Saved predictions ‚Üí predicted_results_final.csv
‚úì Saved summary ‚Üí results_summary.json




## Step 3: Large Scale Validation (200 Reviews)
**Description:** This step scales the experiment from Step 2 to 200 randomly sampled reviews to test the robustness of the strategies and ensembles on a larger dataset.

**Result:** The simple Direct approach surprisingly performed best with 67.5% accuracy, outperforming the more complex Structured and Ensemble methods on the larger dataset.

In [33]:
import json
import pandas as pd
from groq import Groq
from tqdm import tqdm
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import time
import numpy as np

# MAIN
if __name__ == "__main__":
    print("\n" + "="*70)
    print("YELP REVIEW CLASSIFIER - Multi-Prompt Ensemble")
    print("="*70)
    
    print("\nLoading data...")
    df = pd.read_csv("./yelp.csv")
    df = df.sample(200, random_state=42).reset_index(drop=True)
    print(f"Loaded {len(df)} reviews")

    results = {}
    individual_preds = {}
    
    prompt_styles = ["direct","structured", "numerical", "contrastive"]

    print(f"\nRunning experiments on {len(df)} reviews...")
    print("-" * 70)
    

    for style in prompt_styles:
        print(f"\n‚ñ∂ Testing {style.upper()}...")
        preds, acc, json_rate, failed = run_experiment(df, style)
        results[style] = {
            "accuracy": round(acc, 4),
            "json_validity": round(json_rate, 4),
            "failed_parses": failed
        }
        individual_preds[style] = preds
        df[f"pred_{style}"] = preds
        print(f"  ‚úì Accuracy: {acc:.4f} | JSON Valid: {json_rate:.4f} | Failed: {failed}")

    print("\n" + "="*70)
    print("INDIVIDUAL MODEL RESULTS")
    print("="*70)
    
    sorted_results = sorted(results.items(), key=lambda x: x[1]["accuracy"], reverse=True)
    for i, (style, metrics) in enumerate(sorted_results, 1):
        acc = metrics["accuracy"]
        improvement = ((acc - 0.650) / 0.650) * 100
        print(f"{i}. {style:15} | Accuracy: {acc:.4f} | {improvement:+.1f}% vs baseline")

    print("\n" + "="*70)
    print("ENSEMBLE STRATEGIES")
    print("="*70)

    # Majority Vote Ensemble
    print("\n‚ñ∂ MAJORITY VOTE ENSEMBLE...")
    ensemble_majority = create_ensemble(individual_preds)
    ensemble_majority_acc = accuracy_score(df["stars"], ensemble_majority)
    df["pred_ensemble_majority"] = ensemble_majority
    print(f"  ‚úì Accuracy: {ensemble_majority_acc:.4f}")

    # Weighted Ensemble
    print("\n‚ñ∂ WEIGHTED ENSEMBLE (by accuracy)...")
    weights = {k: v["accuracy"] for k, v in results.items()}
    ensemble_weighted = create_weighted_ensemble(individual_preds, weights)
    ensemble_weighted_acc = accuracy_score(df["stars"], ensemble_weighted)
    df["pred_ensemble_weighted"] = ensemble_weighted
    print(f"  ‚úì Accuracy: {ensemble_weighted_acc:.4f}")

    print("\n" + "="*70)
    print("FINAL COMPARISON")
    print("="*70)

    all_results = {
        **results,
        "ensemble_majority": {"accuracy": round(ensemble_majority_acc, 4)},
        "ensemble_weighted": {"accuracy": round(ensemble_weighted_acc, 4)}
    }

    final_sorted = sorted(all_results.items(), key=lambda x: x[1]["accuracy"], reverse=True)
    
    for i, (name, metrics) in enumerate(final_sorted, 1):
        acc = metrics["accuracy"]
        improvement = ((acc - 0.650) / 0.650) * 100
        medal = ["ü•á", "ü•à", "ü•â"][i-1] if i <= 3 else "  "
        print(f"{medal} {i}. {name:20} | Accuracy: {acc:.4f} | {improvement:+.1f}%")

    print("\n" + "="*70)
    winner = final_sorted[0]
    improvement_pct = ((winner[1]['accuracy'] - 0.650) / 0.650) * 100
    print(f"üèÜ OVERALL WINNER: {winner[0].upper()}")
    print(f"   Final Accuracy: {winner[1]['accuracy']:.4f} (+{improvement_pct:.1f}%)")
    print("="*70)

    # Save outputs
    df.to_csv("predicted_results_final_200.csv", index=False)
    with open("results_summary_final.json", "w") as f:
        json.dump(all_results, f, indent=2)

    print(f"\n‚úì Saved predictions ‚Üí predicted_results_final.csv")
    print(f"‚úì Saved summary ‚Üí results_summary.json")


YELP REVIEW CLASSIFIER - Multi-Prompt Ensemble

Loading data...
Loaded 200 reviews

Running experiments on 200 reviews...
----------------------------------------------------------------------

‚ñ∂ Testing DIRECT...


                                                         

  ‚úì Accuracy: 0.6750 | JSON Valid: 1.0000 | Failed: 0

‚ñ∂ Testing STRUCTURED...


                                                             

  ‚úì Accuracy: 0.6700 | JSON Valid: 0.9950 | Failed: 1

‚ñ∂ Testing NUMERICAL...


                                                            

  ‚úì Accuracy: 0.6200 | JSON Valid: 1.0000 | Failed: 0

‚ñ∂ Testing CONTRASTIVE...


                                                              

  ‚úì Accuracy: 0.5850 | JSON Valid: 1.0000 | Failed: 0

INDIVIDUAL MODEL RESULTS
1. direct          | Accuracy: 0.6750 | +3.8% vs baseline
2. structured      | Accuracy: 0.6700 | +3.1% vs baseline
3. numerical       | Accuracy: 0.6200 | -4.6% vs baseline
4. contrastive     | Accuracy: 0.5850 | -10.0% vs baseline

ENSEMBLE STRATEGIES

‚ñ∂ MAJORITY VOTE ENSEMBLE...
  ‚úì Accuracy: 0.6700

‚ñ∂ WEIGHTED ENSEMBLE (by accuracy)...
  ‚úì Accuracy: 0.6700

FINAL COMPARISON
ü•á 1. direct               | Accuracy: 0.6750 | +3.8%
ü•à 2. structured           | Accuracy: 0.6700 | +3.1%
ü•â 3. ensemble_majority    | Accuracy: 0.6700 | +3.1%
   4. ensemble_weighted    | Accuracy: 0.6700 | +3.1%
   5. numerical            | Accuracy: 0.6200 | -4.6%
   6. contrastive          | Accuracy: 0.5850 | -10.0%

üèÜ OVERALL WINNER: DIRECT
   Final Accuracy: 0.6750 (+3.8%)

‚úì Saved predictions ‚Üí predicted_results_final.csv
‚úì Saved summary ‚Üí results_summary.json


