# MELD Instruction Formatting for Dating Simulator

This notebook transforms cleaned MELD romantic dialogue pairs into instruction fine-tuning format with Friends character personas.

**Input:** `data/processed/MELD/meld_romantic_cleaned.csv`

**Output:** `data/processed/MELD/meld_dating_sim_instruct.csv`

**Processing:**
1. Load cleaned romantic dialogue pairs
2. Define Friends character personas
3. Generate scenario descriptions
4. Format with LLaMA 3.1 instruction template using `apply_chat_template`
5. Create training-ready dataset

**Format:** LLaMA 3.1 chat template with system prompts, persona definitions, and emotion conditioning.

In [1]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
from pathlib import Path
import pandas as pd
import random

# Set paths
input_file = Path("../../data/processed/MELD/meld_romantic_cleaned.csv")
output_dir = Path("../../data/processed/MELD")
output_file = output_dir / "meld_dating_sim_instruct.csv"

print(f"Input file: {input_file}")
print(f"Output file: {output_file}")

Input file: ..\..\data\processed\MELD\meld_romantic_cleaned.csv
Output file: ..\..\data\processed\MELD\meld_dating_sim_instruct.csv


In [3]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from transformers import AutoTokenizer

# Load LLaMA 3.1 tokenizer for chat template formatting
print("Loading LLaMA 3.1 tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(f"Loaded tokenizer: {tokenizer.__class__.__name__}")
print(f"Special tokens: {tokenizer.special_tokens_map}")

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Loading LLaMA 3.1 tokenizer...
Loaded tokenizer: PreTrainedTokenizerFast
Special tokens: {'bos_token': '<|begin_of_text|>', 'eos_token': '<|eot_id|>'}


## 1. Load Cleaned Data

In [5]:
# Load the cleaned romantic dialogue pairs
print("Loading cleaned data...")
df = pd.read_csv(input_file)

print(f"Loaded {len(df)} dialogue pairs")
print(f"\nColumns: {list(df.columns)}")
print(f"\nDataset shape: {df.shape}")

# Display sample
df.head()

Loading cleaned data...
Loaded 1529 dialogue pairs

Columns: ['character', 'emotion', 'context', 'response', 'dialogue_id', 'season', 'episode']

Dataset shape: (1529, 7)


Unnamed: 0,character,emotion,context,response,dialogue_id,season,episode
0,Ross,surprise,<Chandler> Hello?,What happened?,134,4,9
1,Chandler,sadness,<Chandler> Hello? <Ross> What happened?,"He’s not gonna make it, he’s stuck in Chicago.",134,4,9
2,Ross,joy,<Chandler> Hello? <Ross> What happened? <Chand...,"Ohh, man! Chicago, is sooo lucky!",134,4,9
3,Chandler,anger,<Chandler> Hello? <Ross> What happened? <Chand...,"Stupid, useless Canadian money!",134,4,9
4,Ross,neutral,<Joey> I don’t get it!,"Hey! So uh, was he excited about the tickets?",200,6,23


In [6]:
# Analyze character distribution
print("="*80)
print("Character Distribution")
print("="*80)
character_counts = df['character'].value_counts()
for character, count in character_counts.items():
    percentage = (count / len(df)) * 100
    print(f"{character:15s}: {count:5d} pairs ({percentage:5.2f}%)")

print("\n" + "="*80)
print("Emotion Distribution")
print("="*80)
emotion_counts = df['emotion'].value_counts()
for emotion, count in emotion_counts.items():
    percentage = (count / len(df)) * 100
    print(f"{emotion:15s}: {count:5d} pairs ({percentage:5.2f}%)")

Character Distribution
Joey           :   308 pairs (20.14%)
Rachel         :   280 pairs (18.31%)
Monica         :   261 pairs (17.07%)
Chandler       :   241 pairs (15.76%)
Phoebe         :   231 pairs (15.11%)
Ross           :   199 pairs (13.02%)
Mike           :     7 pairs ( 0.46%)
David          :     2 pairs ( 0.13%)

Emotion Distribution
neutral        :   676 pairs (44.21%)
joy            :   248 pairs (16.22%)
anger          :   202 pairs (13.21%)
surprise       :   193 pairs (12.62%)
sadness        :   141 pairs ( 9.22%)
fear           :    38 pairs ( 2.49%)
disgust        :    31 pairs ( 2.03%)


## 2. Define Character Personas

Each Friends character has a distinct personality that should be captured in the system prompt.

In [7]:
# Define personality traits for each Friends character
CHARACTER_PERSONAS = {
    'Chandler': {
        'description': "You are Chandler Bing from Friends. You are witty, sarcastic, and use humor as a defense mechanism to hide your vulnerability. You make jokes even in serious moments, but you're actually very caring and romantic underneath. You're self-deprecating and sometimes awkward, but loyal and loving to those close to you.",
        'traits': ['witty', 'sarcastic', 'defensive humor', 'secretly romantic', 'self-deprecating']
    },
    'Monica': {
        'description': "You are Monica Geller from Friends. You are organized, competitive, and love to be in control. You're nurturing and care deeply about the people in your life. You want commitment and stability in relationships. You can be intense but you're passionate about everything you do, from cooking to loving your partner.",
        'traits': ['organized', 'competitive', 'nurturing', 'wants commitment', 'passionate']
    },
    'Ross': {
        'description': "You are Ross Geller from Friends. You are intellectual, nerdy, and passionate about paleontology and science. You tend to overthink things and can be awkward in romantic situations. You're a hopeless romantic who believes in destiny and true love, but you often struggle to express your feelings properly.",
        'traits': ['intellectual', 'nerdy', 'overthinks', 'romantic but awkward', 'passionate']
    },
    'Rachel': {
        'description': "You are Rachel Green from Friends. You are fashion-focused, fun, and flirty. You're independent and career-driven, having grown from a spoiled daddy's girl to a confident professional. You value friendship and are loyal to those you care about. You're charming and know how to make people feel special.",
        'traits': ['fashion-focused', 'fun', 'flirty', 'independent', 'confident']
    },
    'Joey': {
        'description': "You are Joey Tribbiani from Friends. You are confident, charming, and simple in the best way. You love food (especially pizza and sandwiches) and you're famous for your catchphrase 'How you doin'?' You're a loyal friend and while you may not be the smartest, you have a big heart and know how to make people feel good about themselves.",
        'traits': ['confident', 'charming', 'simple', 'food-loving', 'loyal']
    },
    'Phoebe': {
        'description': "You are Phoebe Buffay from Friends. You are quirky, spiritual, and unconventionally wise. You're honest to a fault and say what's on your mind. You have a mysterious past but maintain an optimistic outlook. You're free-spirited and bring unique perspectives to every situation. You believe in karma, auras, and following your heart.",
        'traits': ['quirky', 'spiritual', 'honest', 'unconventional', 'optimistic']
    }
}

# Display personas
print("="*80)
print("Friends Character Personas")
print("="*80)
for character, info in CHARACTER_PERSONAS.items():
    print(f"\n{character}:")
    print(f"  {info['description']}")
    print(f"  Key traits: {', '.join(info['traits'])}")

Friends Character Personas

Chandler:
  You are Chandler Bing from Friends. You are witty, sarcastic, and use humor as a defense mechanism to hide your vulnerability. You make jokes even in serious moments, but you're actually very caring and romantic underneath. You're self-deprecating and sometimes awkward, but loyal and loving to those close to you.
  Key traits: witty, sarcastic, defensive humor, secretly romantic, self-deprecating

Monica:
  You are Monica Geller from Friends. You are organized, competitive, and love to be in control. You're nurturing and care deeply about the people in your life. You want commitment and stability in relationships. You can be intense but you're passionate about everything you do, from cooking to loving your partner.
  Key traits: organized, competitive, nurturing, wants commitment, passionate

Ross:
  You are Ross Geller from Friends. You are intellectual, nerdy, and passionate about paleontology and science. You tend to overthink things and can b

## 3. Generate Scenario Descriptions

Create scenario context based on the conversation setting.

In [8]:
# Scenario templates for dating contexts
DATING_SCENARIOS = [
    "You're on a casual coffee date at Central Perk, the cozy coffee shop.",
    "You're having a romantic dinner date at a nice restaurant.",
    "You're taking a walk together and having a deep conversation.",
    "You're hanging out at your apartment, enjoying each other's company.",
    "You're on a fun date doing something adventurous together.",
    "You're having a heart-to-heart conversation about your relationship.",
    "You're flirting and getting to know each other better.",
    "You're spending a quiet evening together, just talking and connecting.",
]

def generate_scenario(row):
    """
    Generate a scenario description for the dialogue.
    
    Could be made more sophisticated by analyzing context,
    but for now we'll use a simple random selection.
    """
    # Use random but make it deterministic based on dialogue_id for consistency
    random.seed(int(row['dialogue_id']))
    return random.choice(DATING_SCENARIOS)

# Test scenario generation
print("Sample scenarios:")
for scenario in DATING_SCENARIOS[:5]:
    print(f"  - {scenario}")

Sample scenarios:
  - You're on a casual coffee date at Central Perk, the cozy coffee shop.
  - You're having a romantic dinner date at a nice restaurant.
  - You're taking a walk together and having a deep conversation.
  - You're hanging out at your apartment, enjoying each other's company.
  - You're on a fun date doing something adventurous together.


## 4. Format with LLaMA 3.1 Chat Template

Apply LLaMA 3.1 instruction template using `apply_chat_template` for proper formatting.

In [9]:
def format_instruction_prompt(row):
    """
    Format a dialogue pair using LLaMA 3.1 chat template.
    
    Uses tokenizer.apply_chat_template() to format properly with:
    - System prompt: persona + scenario + emotion
    - User message: conversation context
    - Assistant response: character's reply
    
    Returns fully formatted LLaMA 3.1 prompt string.
    """
    character = row['character']
    emotion = row['emotion']
    context = row['context']  # Now includes speaker tokens like "<Ross> Hello?"
    response = row['response']
    
    # Get persona description
    persona_desc = CHARACTER_PERSONAS.get(character, {}).get(
        'description', 
        f"You are {character} from Friends."
    )
    
    # Generate scenario
    scenario = generate_scenario(row)
    
    # Build system prompt
    system_content = f"""{persona_desc}

Scenario: {scenario}
The user seems to be feeling: {emotion}"""
    
    # Context already has speaker tokens, use it as the user message
    user_content = f"Conversation:\n{context}"
    
    # Build messages for chat template
    messages = [
        {"role": "system", "content": system_content},
        {"role": "user", "content": user_content},
        {"role": "assistant", "content": response}
    ]
    
    # Format with LLaMA 3.1 chat template
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False  # We have the response already
    )
    
    return formatted

# Test formatting with a sample
print("="*80)
print("Sample Formatted Instruction (LLaMA 3.1 Format)")
print("="*80)
sample_row = df.iloc[0]
sample_prompt = format_instruction_prompt(sample_row)
print(sample_prompt)
print("\n" + "="*80)

Sample Formatted Instruction (LLaMA 3.1 Format)
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are Ross Geller from Friends. You are intellectual, nerdy, and passionate about paleontology and science. You tend to overthink things and can be awkward in romantic situations. You're a hopeless romantic who believes in destiny and true love, but you often struggle to express your feelings properly.

Scenario: You're spending a quiet evening together, just talking and connecting.
The user seems to be feeling: surprise<|eot_id|><|start_header_id|>user<|end_header_id|>

Conversation:
<Chandler> Hello?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

What happened?<|eot_id|>



In [13]:
# Apply formatting to all rows
print("Formatting all dialogue pairs with LLaMA 3.1 chat template...")
df['text'] = df.apply(format_instruction_prompt, axis=1)
print(f"Formatted {len(df)} dialogue pairs")

# Check results
print("\n" + "="*80)
print("Sample Formatted Prompts (LLaMA 3.1)")
print("="*80)
for i in range(min(3, len(df))):
    print(f"\n--- Example {i+1} (Character: {df.iloc[i]['character']}, Emotion: {df.iloc[i]['emotion']}) ---")
    print(df.iloc[i]['text'][:1000] + "..." if len(df.iloc[i]['text']) > 1000 else df.iloc[i]['text'])
    print()

Formatting all dialogue pairs with LLaMA 3.1 chat template...
Formatted 1529 dialogue pairs

Sample Formatted Prompts (LLaMA 3.1)

--- Example 1 (Character: Ross, Emotion: surprise) ---
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are Ross Geller from Friends. You are intellectual, nerdy, and passionate about paleontology and science. You tend to overthink things and can be awkward in romantic situations. You're a hopeless romantic who believes in destiny and true love, but you often struggle to express your feelings properly.

Scenario: You're spending a quiet evening together, just talking and connecting.
The user seems to be feeling: surprise<|eot_id|><|start_header_id|>user<|end_header_id|>

Conversation:
<Chandler> Hello?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

What happened?<|eot_id|>


--- Example 2 (Character: Chandler, Emotion: sadness) ---
<|begin_of_text|><|start_header_id|>sys

In [11]:
# Calculate statistics
df['prompt_length'] = df['text'].apply(len)

print("="*80)
print("Final Dataset Statistics")
print("="*80)

print(f"\nTotal training examples: {len(df)}")
print(f"Unique characters: {df['character'].nunique()}")
print(f"Unique emotions: {df['emotion'].nunique()}")

print("\n" + "-"*80)
print("Formatted Prompt Length Statistics (characters)")
print("-"*80)
print(f"Mean: {df['prompt_length'].mean():.2f}")
print(f"Median: {df['prompt_length'].median():.2f}")
print(f"Min: {df['prompt_length'].min()}")
print(f"Max: {df['prompt_length'].max()}")

print("\n" + "-"*80)
print("Data per Character")
print("-"*80)
for character in sorted(df['character'].unique()):
    count = len(df[df['character'] == character])
    percentage = (count / len(df)) * 100
    print(f"{character:15s}: {count:5d} examples ({percentage:5.2f}%)")

print("\n" + "-"*80)
print("Data per Emotion")
print("-"*80)
for emotion in sorted(df['emotion'].unique()):
    count = len(df[df['emotion'] == emotion])
    percentage = (count / len(df)) * 100
    print(f"{emotion:15s}: {count:5d} examples ({percentage:5.2f}%)")

Final Dataset Statistics

Total training examples: 1529
Unique characters: 8
Unique emotions: 7

--------------------------------------------------------------------------------
Formatted Prompt Length Statistics (characters)
--------------------------------------------------------------------------------
Mean: 917.95
Median: 913.00
Min: 425
Max: 1283

--------------------------------------------------------------------------------
Data per Character
--------------------------------------------------------------------------------
Chandler       :   241 examples (15.76%)
David          :     2 examples ( 0.13%)
Joey           :   308 examples (20.14%)
Mike           :     7 examples ( 0.46%)
Monica         :   261 examples (17.07%)
Phoebe         :   231 examples (15.11%)
Rachel         :   280 examples (18.31%)
Ross           :   199 examples (13.02%)

--------------------------------------------------------------------------------
Data per Emotion
-------------------------------------

In [12]:
# Select columns for output
output_df = df[[
    'text',
    'character',
    'emotion',
    'response',
    'dialogue_id',
    'season',
    'episode'
]].copy()

# Save to CSV
output_df.to_csv(output_file, index=False)
print(f"Training dataset saved to: {output_file}")
print(f"Total rows: {len(output_df)}")
print(f"\nColumns in output file:")
for col in output_df.columns:
    print(f"  - {col}")

# Display final sample
print("\nFinal dataset sample:")
print("First row (truncated for display):")
first_row = output_df.iloc[0]
print(f"Character: {first_row['character']}")
print(f"Emotion: {first_row['emotion']}")
print(f"Text (first 300 chars): {first_row['text'][:300]}...")
print(f"\nFull DataFrame head:")
output_df[['character', 'emotion', 'dialogue_id', 'season', 'episode']].head()

Training dataset saved to: ..\..\data\processed\MELD\meld_dating_sim_instruct.csv
Total rows: 1529

Columns in output file:
  - text
  - character
  - emotion
  - response
  - dialogue_id
  - season
  - episode

Final dataset sample:
First row (truncated for display):
Character: Ross
Emotion: surprise
Text (first 300 chars): <|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are Ross Geller from Friends. You are intellectual, nerdy, and passionate about paleontology and science. You tend to overthink things and can be awkward in romantic situat...

Full DataFrame head:


Unnamed: 0,character,emotion,dialogue_id,season,episode
0,Ross,surprise,134,4,9
1,Chandler,sadness,134,4,9
2,Ross,joy,134,4,9
3,Chandler,anger,134,4,9
4,Ross,neutral,200,6,23


## Summary

Successfully created LLaMA 3.1 instruction-formatted dataset for dating simulator training!

**What was done:**
1. ✅ Loaded romantic dialogue pairs from MELD (with speaker tokens)
2. ✅ Defined 6 Friends character personas with distinct personalities
3. ✅ Generated dating scenario descriptions
4. ✅ Loaded LLaMA 3.1 tokenizer
5. ✅ Formatted with LLaMA 3.1 chat template using `apply_chat_template`
6. ✅ Added emotion conditioning in system prompts
7. ✅ Saved training-ready dataset with `text` column

**Output Format (LLaMA 3.1):**
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{Character personality description}

Scenario: {dating scenario}
The user seems to be feeling: {emotion}<|eot_id|><|start_header_id|>user<|end_header_id|>

Conversation:
{conversation context with speaker tokens}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{character response}<|eot_id|>
```

**Key Improvements:**
- ✅ Uses official LLaMA 3.1 chat template (no manual formatting)
- ✅ Properly structured with system/user/assistant roles
- ✅ Saved as `text` column for direct training compatibility
- ✅ Speaker tokens preserved in context for persona learning

**Next Steps:**

1. **Train the model (notebook 03):**
   - Notebook 03 can now directly load the pre-formatted data
   - No parsing or formatting needed in training notebook
   - Just tokenize the `text` column and train!

2. **Training will learn:**
   - Character-specific personalities and speech patterns
   - Emotion-appropriate responses
   - Dating/romantic conversation dynamics
   - Context-aware dialogue generation with speaker awareness

3. **At inference:**
   - Use same `apply_chat_template` with `add_generation_prompt=True`
   - User selects a Friends character
   - User sets dating scenario
   - Model generates in-character responses
   - Responses adapt to user's emotional state

## Summary

Successfully created instruction-formatted dataset for dating simulator training!

**What was done:**
1. ✅ Loaded romantic dialogue pairs from MELD
2. ✅ Defined 6 Friends character personas with distinct personalities
3. ✅ Generated dating scenario descriptions
4. ✅ Formatted with LLaMA-2 instruction template
5. ✅ Added emotion conditioning in system prompts
6. ✅ Saved training-ready dataset

**Output Format:**
```
<s>[INST] <<SYS>>
{Character personality description}
Scenario: {dating scenario}
The user seems to be feeling: {emotion}
<</SYS>>

{conversation context}
User: {user message} [/INST]
{character response}</s>
```

**Next Steps:**

1. **Train the model:**
   ```bash
   python src/training/train_dialogue.py \
       --data_path data/processed/MELD/meld_dating_sim_instruct.csv \
       --output_dir checkpoints/dating_sim_friends \
       --base_model meta-llama/Llama-2-7b-chat-hf \
       --num_epochs 3
   ```

2. **Training will learn:**
   - Character-specific personalities and speech patterns
   - Emotion-appropriate responses
   - Dating/romantic conversation dynamics
   - Context-aware dialogue generation

3. **At inference:**
   - User selects a Friends character
   - User sets dating scenario
   - Model generates in-character responses
   - Responses adapt to user's emotional state

4. **Build chatbot interface:**
   - Character selection menu
   - Scenario setup
   - 1-on-1 conversation loop
   - Emotion detection for user inputs (optional)