# Data Preparation: TechGadgets Support Bot

This notebook prepares the Bitext customer support dataset for fine-tuning by:
1. Loading and exploring the dataset
2. Adapting responses to TechGadgets branding and policies
3. Creating balanced training (400) and validation (100) sets
4. Generating OpenAI JSONL format files

## Step 1: Load Dataset

In [9]:
from datasets import load_dataset
import pandas as pd
import json
import os
from collections import Counter
import random

# Set random seed for reproducibility
random.seed(42)

# Load dataset
print("Loading Bitext customer support dataset...")
dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")
df = pd.DataFrame(dataset["train"])

print(f"\nTotal examples: {len(df)}")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

Loading Bitext customer support dataset...

Total examples: 26872
Columns: ['flags', 'instruction', 'category', 'intent', 'response']

First few rows:


Unnamed: 0,flags,instruction,category,intent,response
0,B,question about cancelling order {{Order Number}},ORDER,cancel_order,I've understood you have a question regarding ...
1,BQZ,i have a question about cancelling oorder {{Or...,ORDER,cancel_order,I've been informed that you have a question ab...
2,BLQZ,i need help cancelling puchase {{Order Number}},ORDER,cancel_order,I can sense that you're seeking assistance wit...
3,BL,I need to cancel purchase {{Order Number}},ORDER,cancel_order,I understood that you need assistance with can...
4,BCELN,"I cannot afford this order, cancel purchase {{...",ORDER,cancel_order,I'm sensitive to the fact that you're facing f...


## Step 2: Explore Dataset Structure

In [17]:
# Category distribution
print("Category Distribution:")
category_counts = df["category"].value_counts()
print(f"Total categories: {df['category'].nunique()}")
print(category_counts)
print("\n" + "="*50 + "\n")

# Intent distribution
print("Intent Distribution:")
intent_counts = df["intent"].value_counts()
print(f"Total intents: {df['intent'].nunique()}")
print(intent_counts.head(15))
print("\n" + "="*50 + "\n")

# Sample instructions
print("Sample Instructions:")
for i in range(5):
    print(f"{i+1}. {df['instruction'].iloc[i]}")

Category Distribution:
Total categories: 11
category
ACCOUNT         5986
ORDER           3988
REFUND          2992
INVOICE         1999
CONTACT         1999
PAYMENT         1998
FEEDBACK        1997
DELIVERY        1994
SHIPPING        1970
SUBSCRIPTION     999
CANCEL           950
Name: count, dtype: int64


Intent Distribution:
Total intents: 27
intent
check_invoice               1000
complaint                   1000
contact_customer_service    1000
edit_account                1000
switch_account              1000
check_payment_methods        999
contact_human_agent          999
delivery_period              999
get_invoice                  999
newsletter_subscription      999
payment_issue                999
registration_problems        999
cancel_order                 998
place_order                  998
track_refund                 998
Name: count, dtype: int64


Sample Instructions:
1. question about cancelling order {{Order Number}}
2. i have a question about cancelling oorder {

### Note: NLP-Based Brand Injection

This notebook uses **TextBlob** (a free, open-source NLP library) for intelligent brand injection. Instead of manually handling hundreds of edge cases, TextBlob automatically:

- **Detects sentence boundaries** intelligently (handles periods, exclamations, questions)
- **Works for any text pattern** - no need to hardcode specific phrases
- **Handles edge cases** automatically (abbreviations, numbers, etc.)

**Installation**: TextBlob is included in `requirements.txt`. If you encounter issues, install manually:
```bash
pip install textblob
```

The code includes a **fallback method** using regex if TextBlob is not available, so it will work either way.

## Step 3: TechGadgets Adaptation Functions

In [11]:
# Import NLP library for intelligent text processing
# TextBlob uses NLTK for sentence segmentation - handles any text pattern intelligently
try:
    from textblob import TextBlob
    NLP_AVAILABLE = True
    # Note: TextBlob may download NLTK data on first use (automatic)
except ImportError:
    print("Warning: TextBlob not available. Install with: pip install textblob")
    print("This will use a simpler fallback method for brand injection.")
    NLP_AVAILABLE = False
    import re

# TechGadgets Company Information
TECHGADGETS_POLICIES = {
    "return": "We offer a 30-day money-back guarantee on all purchases.",
    "shipping_standard": "Standard shipping takes 3-5 business days.",
    "shipping_express": "Express 2-day shipping is available for $9.99.",
    "support_hours": "Our customer support is available 24/7 via chat, and Mon-Fri 9AM-6PM via phone.",
    "warranty": "All products come with a 1-year manufacturer warranty.",
    "price_match": "We match competitor prices to ensure you get the best deal."
}

# Comprehensive placeholder replacements
PLACEHOLDER_MAP = {
    # Order and tracking related
    "{{Order Number}}": "your order number",
    "{{Order Status}}": "your order status",
    "{{Order Tracking}}": "your order tracking information",
    "{{Tracking Number}}": "your tracking number",
    
    # Account and profile related
    "{{Account Type}}": "your account type",
    "{{Account Category}}": "your account category",
    "{{Account Change}}": "your account changes",
    "{{Account Recovery Page URL}}": "www.techgadgets.com/account/recover",
    "{{Profile}}": "your profile",
    "{{Profile Type}}": "your profile type",
    "{{Profile Settings}}": "your profile settings",
    "{{Settings}}": "your account settings",
    
    # Support and contact related
    "{{Customer Support Hours}}": "24/7 via chat, Mon-Fri 9AM-6PM via phone",
    "{{Customer Support Phone Number}}": "1-800-TECH-HELP",
    "{{Customer Support Email}}": "support@techgadgets.com",
    "{{Toll-Free Number}}": "1-800-TECH-HELP",
    
    # Website and portal related
    "{{Online Company Portal Info}}": "TechGadgets Account Portal (www.techgadgets.com/account)",
    "{{Website URL}}": "www.techgadgets.com",
    "{{Online Order Interaction}}": "My Orders section",
    "{{Login Page URL}}": "www.techgadgets.com/login",
    
    # Policy related
    "{{Cancellation Policy}}": "30-day money-back guarantee",
    "{{Refund Policy}}": "30-day money-back guarantee",
    "{{Return Policy}}": "30-day money-back guarantee",
    
    # Invoice and payment related
    "{{Invoice Number}}": "your invoice number",
    "{{Refund Amount}}": "the refund amount",
    "{{Money Amount}}": "the amount",
    "{{Currency Symbol}}": "$",
    
    # Security and access related
    "{{Forgot Password}}": "password recovery page",
    "{{Forgot PIN}}": "PIN recovery page",
    "{{Forgot Key}}": "access key recovery page",
    "{{Forgot Access Key}}": "access key recovery page",
    "{{Recover Key}}": "access key recovery page",
    "{{Restore Access Key}}": "access key recovery page",
    "{{Reset PIN}}": "PIN reset page",
    "{{Manage PIN}}": "PIN management page",
    "{{PIN Code}}": "your PIN code",
    "{{Security}}": "security settings",
    "{{Privacy}}": "privacy settings",
    
    # Location and delivery related
    "{{Delivery City}}": "your delivery city",
    "{{Delivery Country}}": "your delivery country",
    "{{Store Location}}": "our store locations",
    
    # Personal information (generic replacements)
    "{{Person Name}}": "your name",
    "{{Client Last Name}}": "your last name",
    "{{Salutation}}": "",
    
    # Time and date related
    "{{Date Range}}": "the specified date range",
    "{{Timeframe}}": "the specified timeframe",
    
    # Account upgrade
    "{{Upgrade Account}}": "account upgrade options"
}

def replace_placeholders(text):
    """Replace dataset placeholders with TechGadgets-specific values"""
    for placeholder, replacement in PLACEHOLDER_MAP.items():
        text = text.replace(placeholder, replacement)
    return text

def inject_brand(response):
    """
    Ensure TechGadgets is mentioned in the response using NLP for intelligent insertion.
    Uses TextBlob for sentence segmentation to find natural insertion points.
    """
    response_lower = response.lower()
    
    # Check if TechGadgets is already mentioned
    if "techgadgets" in response_lower:
        return response
    
    response_stripped = response.strip()
    if not response_stripped:
        return "At TechGadgets, " + response
    
    # Use NLP-based approach if available
    if NLP_AVAILABLE:
        try:
            blob = TextBlob(response_stripped)
            sentences = blob.sentences
            
            # Strategy 1: Insert after the first sentence (most natural)
            if len(sentences) > 0:
                first_sentence = str(sentences[0])
                # Find where the first sentence ends in the original response
                first_sentence_end = response_stripped.find(first_sentence) + len(first_sentence)
                
                # Clean up any trailing whitespace
                while first_sentence_end < len(response_stripped) and response_stripped[first_sentence_end] in ' \t':
                    first_sentence_end += 1
                
                # Insert "At TechGadgets," after the first sentence
                if first_sentence_end > 0 and first_sentence_end < len(response_stripped):
                    # Check if there's already punctuation, add comma if needed
                    if response_stripped[first_sentence_end - 1] not in '.!?':
                        response = response_stripped[:first_sentence_end] + ". At TechGadgets," + response_stripped[first_sentence_end:]
                    else:
                        response = response_stripped[:first_sentence_end] + " At TechGadgets," + response_stripped[first_sentence_end:]
                else:
                    # Fallback: prepend
                    response = "At TechGadgets, " + response_stripped
            else:
                # No sentences detected, prepend
                response = "At TechGadgets, " + response_stripped
        except Exception as e:
            # If NLP fails, fall back to regex-based sentence detection
            import re
            sentence_end = re.search(r'[.!?]\s+', response_stripped)
            if sentence_end:
                insertion_point = sentence_end.end()
                response = response_stripped[:insertion_point] + "At TechGadgets, " + response_stripped[insertion_point:]
            else:
                if response_stripped[0].isupper():
                    response = "At TechGadgets, " + response_stripped[0].lower() + response_stripped[1:]
                else:
                    response = "At TechGadgets, " + response_stripped
    else:
        # Fallback: Simple regex-based sentence detection
        import re
        # Find first sentence boundary (., !, or ? followed by space or end)
        sentence_end = re.search(r'[.!?]\s+', response_stripped)
        if sentence_end:
            insertion_point = sentence_end.end()
            response = response_stripped[:insertion_point] + "At TechGadgets, " + response_stripped[insertion_point:]
        else:
            # No sentence boundary found, prepend
            if response_stripped[0].isupper():
                response = "At TechGadgets, " + response_stripped[0].lower() + response_stripped[1:]
            else:
                response = "At TechGadgets, " + response_stripped
    
    return response

def adapt_policy_info(instruction, response, intent=None):
    """Inject TechGadgets-specific policies based on instruction keywords"""
    instruction_lower = instruction.lower()
    response_lower = response.lower()
    
    # Keywords mapping to policies
    policy_additions = []
    
    # Return/Refund policy
    if any(kw in instruction_lower for kw in ["return", "refund", "money back", "send back", "reimburse"]):
        if "30-day" not in response_lower and "money-back" not in response_lower:
            policy_additions.append(TECHGADGETS_POLICIES["return"])
    
    # Shipping policy
    if any(kw in instruction_lower for kw in ["shipping", "delivery", "ship", "arrive", "when will", "how long"]):
        if "3-5 business days" not in response_lower and "standard shipping" not in response_lower:
            policy_additions.append(TECHGADGETS_POLICIES["shipping_standard"])
        if "express" in instruction_lower or "2-day" in instruction_lower:
            if "$9.99" not in response_lower and "express" not in response_lower:
                policy_additions.append(TECHGADGETS_POLICIES["shipping_express"])
    
    # Support hours
    if any(kw in instruction_lower for kw in ["support", "help", "contact", "customer service", "phone", "chat"]):
        if "24/7" not in response_lower and "9AM-6PM" not in response_lower:
            policy_additions.append(TECHGADGETS_POLICIES["support_hours"])
    
    # Warranty
    if any(kw in instruction_lower for kw in ["warranty", "guarantee", "coverage"]):
        if "1-year" not in response_lower and "manufacturer warranty" not in response_lower:
            policy_additions.append(TECHGADGETS_POLICIES["warranty"])
    
    # Price match
    if any(kw in instruction_lower for kw in ["price", "cost", "match", "cheaper", "competitor"]):
        if "price match" not in response_lower and "competitor" not in response_lower:
            policy_additions.append(TECHGADGETS_POLICIES["price_match"])
    
    # Append policies if not already present
    if policy_additions:
        response += " " + " ".join(policy_additions)
    
    return response

def ensure_quality(response):
    """Final quality checks and cleanup"""
    import re
    
    # Remove any remaining placeholders (catch-all for any missed ones)
    remaining_placeholders = re.findall(r'\{\{[^}]+\}\}', response)
    if remaining_placeholders:
        # Replace with generic text
        for placeholder in remaining_placeholders:
            # Extract the key part of the placeholder
            key = placeholder.strip('{}').replace(' ', '_').lower()
            response = response.replace(placeholder, f"the {key}")
    
    # Ensure TechGadgets is mentioned (final check)
    if "techgadgets" not in response.lower():
        response = "At TechGadgets, " + response
    
    # Clean up spacing
    response = " ".join(response.split())
    
    # Remove any double periods or commas
    response = response.replace("..", ".")
    response = response.replace(",,", ",")
    
    return response

def adapt_to_techgadgets(instruction, response, intent=None):
    """Complete adaptation pipeline for a single example"""
    # Step 1: Replace placeholders
    adapted_response = replace_placeholders(response)
    
    # Step 2: Inject brand
    adapted_response = inject_brand(adapted_response)
    
    # Step 3: Add policy information
    adapted_response = adapt_policy_info(instruction, adapted_response, intent)
    
    # Step 4: Final quality checks
    adapted_response = ensure_quality(adapted_response)
    
    return adapted_response

## Step 4: Test Adaptation Function

In [12]:
# Test the adaptation function on a few examples
print("Testing adaptation function:\n")
for i in range(3):
    original = df.iloc[i]
    adapted = adapt_to_techgadgets(
        original["instruction"],
        original["response"],
        original["intent"]
    )
    
    print(f"\nExample {i+1}:")
    print(f"Instruction: {original['instruction']}")
    print(f"\nOriginal Response (first 200 chars): {original['response'][:200]}...")
    print(f"\nAdapted Response (first 300 chars): {adapted[:300]}...")
    print(f"\nTechGadgets mentioned: {'TechGadgets' in adapted}")
    print("-" * 80)

Testing adaptation function:


Example 1:
Instruction: question about cancelling order {{Order Number}}

Original Response (first 200 chars): I've understood you have a question regarding canceling order {{Order Number}}, and I'm here to provide you with the information you need. Please go ahead and ask your question, and I'll do my best to...

Adapted Response (first 300 chars): I've understood you have a question regarding canceling order your order number, and I'm here to provide you with the information you need. At TechGadgets, Please go ahead and ask your question, and I'll do my best to assist you....

TechGadgets mentioned: True
--------------------------------------------------------------------------------

Example 2:
Instruction: i have a question about cancelling oorder {{Order Number}}

Original Response (first 200 chars): I've been informed that you have a question about canceling order {{Order Number}}. I'm here to assist you! Please go ahead and let me know what specifi

## Step 5: Create Balanced Training and Validation Sets

In [13]:
def select_diverse_samples(df, n_train=400, n_val=100, seed=42):
    """Select diverse samples ensuring category and intent diversity"""
    random.seed(seed)
    
    # Shuffle the dataframe
    df_shuffled = df.sample(frac=1, random_state=seed).reset_index(drop=True)
    
    # Calculate proportional distribution based on category
    category_counts = df_shuffled["category"].value_counts()
    total_examples = n_train + n_val
    
    # Select training samples (proportional to category distribution)
    train_indices = []
    val_indices = []
    
    # Track selected indices to avoid duplicates
    selected_indices = set()
    
    # First, select training samples with category diversity
    categories = df_shuffled["category"].unique()
    samples_per_category_train = n_train // len(categories)
    
    for category in categories:
        category_df = df_shuffled[df_shuffled["category"] == category]
        category_indices = category_df.index.tolist()
        
        # Sample from this category
        available = [idx for idx in category_indices if idx not in selected_indices]
        if len(available) >= samples_per_category_train:
            sampled = random.sample(available, samples_per_category_train)
            train_indices.extend(sampled)
            selected_indices.update(sampled)
    
    # Fill remaining training slots randomly
    remaining_train = n_train - len(train_indices)
    available = [idx for idx in df_shuffled.index if idx not in selected_indices]
    if remaining_train > 0 and len(available) >= remaining_train:
        sampled = random.sample(available, remaining_train)
        train_indices.extend(sampled)
        selected_indices.update(sampled)
    
    # Select validation samples (similar approach)
    samples_per_category_val = n_val // len(categories)
    
    for category in categories:
        category_df = df_shuffled[df_shuffled["category"] == category]
        category_indices = category_df.index.tolist()
        
        available = [idx for idx in category_indices if idx not in selected_indices]
        if len(available) >= samples_per_category_val:
            sampled = random.sample(available, samples_per_category_val)
            val_indices.extend(sampled)
            selected_indices.update(sampled)
    
    # Fill remaining validation slots
    remaining_val = n_val - len(val_indices)
    available = [idx for idx in df_shuffled.index if idx not in selected_indices]
    if remaining_val > 0 and len(available) >= remaining_val:
        sampled = random.sample(available, remaining_val)
        val_indices.extend(sampled)
        selected_indices.update(sampled)
    
    train_df = df_shuffled.loc[train_indices].copy()
    val_df = df_shuffled.loc[val_indices].copy()
    
    return train_df, val_df

# Select samples
print("Selecting diverse training and validation samples...")
train_df, val_df = select_diverse_samples(df, n_train=400, n_val=100, seed=42)

print(f"\nTraining set: {len(train_df)} examples")
print(f"Validation set: {len(val_df)} examples")
print(f"\nTraining category distribution:")
print(train_df["category"].value_counts())
print(f"\nValidation category distribution:")
print(val_df["category"].value_counts())

Selecting diverse training and validation samples...

Training set: 400 examples
Validation set: 100 examples

Training category distribution:
category
PAYMENT         37
ACCOUNT         37
REFUND          37
ORDER           37
CONTACT         36
INVOICE         36
SUBSCRIPTION    36
CANCEL          36
FEEDBACK        36
DELIVERY        36
SHIPPING        36
Name: count, dtype: int64

Validation category distribution:
category
CONTACT         10
INVOICE          9
PAYMENT          9
SUBSCRIPTION     9
CANCEL           9
ACCOUNT          9
REFUND           9
FEEDBACK         9
DELIVERY         9
SHIPPING         9
ORDER            9
Name: count, dtype: int64


In [14]:
def write_jsonl(df_subset, output_path):
    """Write dataframe to OpenAI JSONL format"""
    os.makedirs(os.path.dirname(output_path), exist_ok=True)
    
    with open(output_path, "w", encoding="utf-8") as f:
        for idx, row in df_subset.iterrows():
            adapted_response = adapt_to_techgadgets(
                row["instruction"],
                row["response"],
                row["intent"]
            )
            # Safety net: ensure every written response mentions TechGadgets
            if "techgadgets" not in adapted_response.lower():
                adapted_response = "At TechGadgets, " + adapted_response.lstrip()
            
            record = {
                "messages": [
                    {
                        "role": "system",
                        "content": (
                            "You are a helpful customer support assistant for "
                            "TechGadgets, an online electronics store. "
                            "Always be friendly and professional. "
                            "Always mention TechGadgets in your responses. "
                            "Use company policies: 30-day money-back guarantee, "
                            "standard shipping 3-5 business days, express 2-day shipping for $9.99, "
                            "24/7 chat support, Mon-Fri 9AM-6PM phone support, "
                            "1-year manufacturer warranty, and price matching."
                        )
                    },
                    {
                        "role": "user",
                        "content": row["instruction"]
                    },
                    {
                        "role": "assistant",
                        "content": adapted_response
                    }
                ]
            }
            f.write(json.dumps(record, ensure_ascii=False) + "\n")
    
    # Verify all examples have TechGadgets
    missing_count = 0
    with open(output_path, "r", encoding="utf-8") as f:
        for line in f:
            record = json.loads(line)
            if 'TechGadgets' not in record['messages'][2]['content'] and 'techgadgets' not in record['messages'][2]['content'].lower():
                missing_count += 1
    
    print(f"Created {output_path} with {len(df_subset)} examples")
    if missing_count > 0:
        print(f"⚠️  WARNING: {missing_count} examples are missing TechGadgets!")
        print("   This means the function definitions need to be re-run.")
        print("   SOLUTION: Restart kernel and re-run Step 3 (function definitions) before Step 6.")
    else:
        print(f"✅ Verified: All {len(df_subset)} examples mention TechGadgets!")

# Create data directory
os.makedirs("../data", exist_ok=True)

# Generate training JSONL
print("Generating training_data.jsonl...")
write_jsonl(train_df, "../data/training_data.jsonl")

# Generate validation JSONL
print("\nGenerating validation_data.jsonl...")
write_jsonl(val_df, "../data/validation_data.jsonl")

print("\n✅ Data preparation complete!")

Generating training_data.jsonl...
Created ../data/training_data.jsonl with 400 examples
✅ Verified: All 400 examples mention TechGadgets!

Generating validation_data.jsonl...
Created ../data/validation_data.jsonl with 100 examples
✅ Verified: All 100 examples mention TechGadgets!

✅ Data preparation complete!


## Step 7: Validate JSONL Files

In [15]:
def validate_jsonl(file_path):
    """Validate JSONL file format"""
    errors = []
    line_count = 0
    
    with open(file_path, "r", encoding="utf-8") as f:
        for line_num, line in enumerate(f, 1):
            line_count += 1
            try:
                obj = json.loads(line.strip())
                
                # Check for messages key
                if "messages" not in obj:
                    errors.append(f"Line {line_num}: Missing 'messages' key")
                    continue
                
                # Validate message structure
                messages = obj["messages"]
                if not isinstance(messages, list):
                    errors.append(f"Line {line_num}: 'messages' must be a list")
                    continue
                
                # Check for required roles
                roles = [msg.get("role") for msg in messages]
                if "system" not in roles:
                    errors.append(f"Line {line_num}: Missing 'system' role")
                if "user" not in roles:
                    errors.append(f"Line {line_num}: Missing 'user' role")
                if "assistant" not in roles:
                    errors.append(f"Line {line_num}: Missing 'assistant' role")
                
                # Validate each message
                for msg in messages:
                    if "role" not in msg:
                        errors.append(f"Line {line_num}: Message missing 'role'")
                    elif msg["role"] not in ["system", "user", "assistant"]:
                        errors.append(f"Line {line_num}: Invalid role '{msg['role']}'")
                    
                    if "content" not in msg:
                        errors.append(f"Line {line_num}: Message missing 'content'")
                    elif not isinstance(msg["content"], str):
                        errors.append(f"Line {line_num}: 'content' must be a string")
                    elif len(msg["content"].strip()) == 0:
                        errors.append(f"Line {line_num}: 'content' is empty")
                        
            except json.JSONDecodeError as e:
                errors.append(f"Line {line_num}: Invalid JSON - {str(e)}")
    
    if errors:
        print(f"❌ Validation failed with {len(errors)} errors:")
        for error in errors[:10]:  # Show first 10 errors
            print(f"  {error}")
        if len(errors) > 10:
            print(f"  ... and {len(errors) - 10} more errors")
        return False
    else:
        print(f"✅ {file_path} is valid! ({line_count} examples)")
        return True

# Validate both files
print("Validating training_data.jsonl...")
validate_jsonl("../data/training_data.jsonl")

print("\nValidating validation_data.jsonl...")
validate_jsonl("../data/validation_data.jsonl")

Validating training_data.jsonl...
✅ ../data/training_data.jsonl is valid! (400 examples)

Validating validation_data.jsonl...
✅ ../data/validation_data.jsonl is valid! (100 examples)


True

## Step 8: Quality Check - Sample Adapted Responses

In [16]:
# Load and display a few examples from the generated JSONL
print("Sample Training Examples:\n")
print("⚠️  NOTE: If you see 'TechGadgets mentioned: False' below, the data files need to be regenerated.")
print("   Re-run the 'write_jsonl' cell (Step 6) to apply the updated NLP-based brand injection.\n")

with open("../data/training_data.jsonl", "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        if i >= 3:  # Show first 3 examples
            break
        
        record = json.loads(line)
        has_brand = 'TechGadgets' in record['messages'][2]['content']
        
        print(f"\nExample {i+1}:")
        print(f"User: {record['messages'][1]['content']}")
        print(f"\nAssistant (first 200 chars): {record['messages'][2]['content'][:200]}...")
        print(f"\nTechGadgets mentioned: {has_brand}")
        print("-" * 80)

# Quick check of all examples
print("\n" + "=" * 80)
print("Quality Check Summary:")
with open("../data/training_data.jsonl", "r", encoding="utf-8") as f:
    total = 0
    missing = 0
    for line in f:
        total += 1
        record = json.loads(line)
        if 'TechGadgets' not in record['messages'][2]['content']:
            missing += 1

print(f"Total examples checked: {total}")
print(f"Missing TechGadgets mention: {missing}")
if missing > 0:
    print(f"\n⚠️  WARNING: {missing} examples are missing TechGadgets mention!")
    print("   This means the data was generated with old code.")
    print("   SOLUTION: Re-run Step 6 (write_jsonl cell) to regenerate with updated NLP-based code.")
    print("   The updated code uses TextBlob NLP library to handle ALL text patterns automatically.")
else:
    print("✅ All examples mention TechGadgets!")
print("=" * 80)

Sample Training Examples:

⚠️  NOTE: If you see 'TechGadgets mentioned: False' below, the data files need to be regenerated.
   Re-run the 'write_jsonl' cell (Step 6) to apply the updated NLP-based brand injection.


Example 1:
User: what do I have to do to talk with cusromer support?

Assistant (first 200 chars): Your message means a lot! I'm aligned with the idea that you need help on how to connect with our customer support team. To talk with our customer support, you can reach out to them through various ch...

TechGadgets mentioned: False
--------------------------------------------------------------------------------

Example 2:
User: would it be possible to speak to someone?

Assistant (first 200 chars): Your reach-out is appreciated! At TechGadgets, I'm sensing that you would like to speak to someone and need assistance. Our team is here to help you. Please allow me a moment to connect you with one o...

TechGadgets mentioned: True
----------------------------------------------