# TARS Colab: Real-Usecase Agent Self-Improvement Evaluation

This notebook tests whether the **same support agent** improves over an ordered sequence of real-world-style customer support conversations.

Use case: an ecommerce support assistant evolving from v1.0 to v1.5.


## 1) Setup

If you are running in Colab, clone your repo first. If you're already inside the repo, skip the clone cell.

In [None]:
# Optional (Colab): clone the repo
# !git clone https://github.com/<your-org-or-user>/tars.git
# %cd tars

!python -m pip install --upgrade pip
!pip install -e .


## 2) Inspect the real-usecase dataset

Dataset file: `examples/customer_support_progression.jsonl`.
Each record is a full conversation at a point in time.

In [None]:
import json
from pathlib import Path

dataset_path = Path('examples/customer_support_progression.jsonl')
rows = [json.loads(line) for line in dataset_path.read_text().splitlines() if line.strip()]

print(f'Loaded {len(rows)} conversations')
print('Conversation IDs in order:')
for r in rows:
    print('-', r['timestamp'], r['conversation_id'], r.get('metadata', {}).get('agent_version'))

print('\nSample first conversation:')
print(json.dumps(rows[0], indent=2))


## 3) Offline deterministic test (no Gemini key required)

This validates repository functionality end-to-end by monkeypatching `GeminiEvaluator` with a deterministic progression scorer.
The scorer is heuristic-based and should output an improving trajectory for this dataset.

In [None]:
from tars_analyzer import analyzer
from tars_analyzer.models import ConversationProgress, ProgressionEvaluation

class HeuristicProgressionEvaluator:
    def __init__(self, model='gemini-2.0-flash'):
        self.model = model

    def evaluate_progression(self, conversations):
        conversations = sorted(conversations, key=lambda c: c.timestamp)

        def score_text(text: str) -> float:
            text_l = text.lower()
            score = 4.5
            if any(k in text_l for k in ['sorry', 'thanks', 'appreciate']):
                score += 0.8
            if any(k in text_l for k in ['i checked', 'i found', 'reviewed', 'logs']):
                score += 1.0
            if any(k in text_l for k in ['done', 'submitted', 'confirmed', 'generated', 'scheduled']):
                score += 1.2
            if any(k in text_l for k in ['timeline', 'eta', 'business days', 'reference', '#']):
                score += 0.9
            if any(k in text_l for k in ['contingency', 'fallback', 'options', 'trade-off']):
                score += 0.6
            return min(10.0, score)

        qualities = []
        for convo in conversations:
            agent_text = ' '.join(t.content for t in convo.turns if t.role.lower() == 'agent')
            qualities.append(round(score_text(agent_text), 2))

        ranks = {idx: rank for rank, idx in enumerate(sorted(range(len(qualities)), key=lambda i: qualities[i]), start=1)}

        per = []
        for i, convo in enumerate(conversations):
            prev = qualities[i - 1] if i > 0 else qualities[i]
            per.append(
                ConversationProgress(
                    conversation_id=convo.conversation_id,
                    rank=ranks[i],
                    overall_agent_quality=qualities[i],
                    improvement_vs_previous=round(0.0 if i == 0 else qualities[i] - prev, 2),
                    notes='Heuristic offline score based on empathy, actionability, and follow-through.',
                )
            )

        delta = qualities[-1] - qualities[0]
        label = 'improving' if delta > 0.4 else 'declining' if delta < -0.4 else 'flat'
        return ProgressionEvaluation(
            overall_summary='Offline heuristic progression evaluation completed.',
            trajectory_label=label,
            trajectory_confidence=7.0,
            per_conversation=per,
        )

original = analyzer.GeminiEvaluator
analyzer.GeminiEvaluator = HeuristicProgressionEvaluator
try:
    offline_report = analyzer.analyze_conversations(dataset_path, 'output_offline')
finally:
    analyzer.GeminiEvaluator = original

print('Offline trajectory:', offline_report['trajectory']['label'])
print('Offline first→last delta:', offline_report['trend_delta_first_to_last'])


In [None]:
import json
print(json.dumps(offline_report['trajectory'], indent=2))
!cat output_offline/report.md


## 4) Optional live Gemini run (recommended)

This is the actual model-based evaluation for your real-usecase dataset.

In [None]:
import os
from getpass import getpass

if not os.getenv('GEMINI_API_KEY'):
    os.environ['GEMINI_API_KEY'] = getpass('Enter GEMINI_API_KEY: ')


In [None]:
!python -m tars_analyzer.cli examples/customer_support_progression.jsonl --out output_gemini --model gemini-2.0-flash


In [None]:
import json
from pathlib import Path

gemini_report = json.loads(Path('output_gemini/report.json').read_text())
print('Gemini trajectory:', gemini_report['trajectory']['label'])
print('Gemini first→last delta:', gemini_report['trend_delta_first_to_last'])
print('Gemini quality scores:', gemini_report['overall_agent_quality_scores'])

!cat output_gemini/report.md


## 5) Visualize progression

This chart should make the self-improvement trend easy to inspect.

In [None]:
import matplotlib.pyplot as plt

scores = gemini_report['overall_agent_quality_scores']
labels = [a['conversation_id'] for a in gemini_report['analyses']]

plt.figure(figsize=(10, 4))
plt.plot(labels, scores, marker='o')
plt.xticks(rotation=30, ha='right')
plt.ylabel('Overall Agent Quality (0-10)')
plt.xlabel('Conversation ID (time ordered)')
plt.title(f"Trajectory: {gemini_report['trajectory']['label']} | Δ first→last = {gemini_report['trend_delta_first_to_last']:+}")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
