# MELD Dataset Data Cleaning for Dating Simulator

This notebook processes the MELD (Multimodal EmotionLines Dataset) from YAML format into a cleaned CSV suitable for dating simulator fine-tuning with Friends character personas.

**Dataset:** MELD - Friends TV show dialogue with emotion labels

**Input:** `data/raw/MELD/data/MELD/datasets.yaml`

**Output:** `data/processed/MELD/meld_romantic_cleaned.csv`

**Processing:**
1. Load and parse YAML structure (dev/test/train splits)
2. Sort chronologically by Season → Episode → StartTime
3. **Filter to romantic/dating conversations** (Monica+Chandler, Ross+Rachel, Phoebe+dates, etc.)
4. **Filter to 1-on-1 conversations only** (exclude multi-party group scenes)
5. Create dialogue pairs with 5-utterance context windows
6. Include emotion labels and Friends character names
7. Save statistics and cleaned dataset

**Key Features for Dating Simulator:**
- ✅ Romance and flirting context only
- ✅ 1-on-1 conversations (no multi-party)
- ✅ Friends character personas (Chandler, Monica, Ross, Rachel, Joey, Phoebe)
- ✅ Emotion labels for emotion-conditioned training
- ✅ NO sentiment column (not needed)
- ✅ Speaker tokens in format `<Speaker>` for persona learning

In [1]:
%pip install pyyaml pandas

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
from pathlib import Path
import yaml
import pandas as pd
from datetime import datetime, timedelta
import re

# Set paths
yaml_file = Path("../../data/raw/MELD/data/MELD/datasets.yaml")
output_dir = Path("../../data/processed/MELD")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Loading MELD dataset from: {yaml_file}")
print(f"Output directory: {output_dir}")

Loading MELD dataset from: ..\..\data\raw\MELD\data\MELD\datasets.yaml
Output directory: ..\..\data\processed\MELD


## 1. Load and Parse YAML Data

In [2]:
# Load YAML file
print("Loading YAML file...")
with open(yaml_file, 'r', encoding='utf-8') as f:
    meld_data = yaml.safe_load(f)

print(f"Loaded data with splits: {list(meld_data.keys())}")

Loading YAML file...
Loaded data with splits: ['dev', 'test', 'train']


In [3]:
# Flatten YAML structure into list of dictionaries
def flatten_yaml_data(yaml_dict):
    """
    Convert nested YAML structure to flat list of utterance dictionaries.
    
    Input structure:
    {
        'dev': {
            'dia0_utt0': {'Dialogue_ID': '0', 'Utterance': '...', ...},
            'dia0_utt1': {'Dialogue_ID': '0', 'Utterance': '...', ...},
            ...
        },
        'test': {...},
        'train': {...}
    }
    
    Output:
    [
        {'split': 'dev', 'Dialogue_ID': '0', 'Utterance_ID': '0', 'Utterance': '...', ...},
        ...
    ]
    """
    all_utterances = []
    
    for split_name, split_data in yaml_dict.items():
        for key, utterance_dict in split_data.items():
            # Add split name to track which split this came from
            utterance_dict['split'] = split_name
            all_utterances.append(utterance_dict)
    
    return all_utterances

print("Flattening YAML structure...")
utterances = flatten_yaml_data(meld_data)
print(f"Total utterances: {len(utterances)}")

# Convert to DataFrame
df = pd.DataFrame(utterances)
print(f"\nDataFrame shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

# Display first few rows
df.head()

Flattening YAML structure...
Total utterances: 13707

DataFrame shape: (13707, 12)

Columns: ['Dialogue_ID', 'Emotion', 'EndTime', 'Episode', 'Season', 'Sentiment', 'Speaker', 'SrNo', 'StartTime', 'Utterance', 'Utterance_ID', 'split']


Unnamed: 0,Dialogue_ID,Emotion,EndTime,Episode,Season,Sentiment,Speaker,SrNo,StartTime,Utterance,Utterance_ID,split
0,0,sadness,"00:21:00,049",7,4,negative,Phoebe,1,"00:20:57,256","Oh my God, he’s lost it. He’s totally lost it.",0,dev
1,0,surprise,"00:21:03,261",7,4,negative,Monica,2,"00:21:01,927",What?,1,dev
2,100,neutral,"00:13:26,460",1,1,neutral,Monica,1064,"00:13:23,916",Okay. It’s Emma.,0,dev
3,101,sadness,"00:13:35,928",4,8,negative,Rachel,1065,"00:13:27,795",Emma! See? I don’t want it.,0,dev
4,102,neutral,"0:13:42,477",2,1,neutral,Monica,1066,"0:13:40,766",Take it.,0,dev


In [4]:
# Check data types and basic statistics
print("Data info:")
print(df.info())

print("\nEmotion distribution:")
print(df['Emotion'].value_counts())

print("\nSentiment distribution:")
print(df['Sentiment'].value_counts())

print("\nSplit distribution:")
print(df['split'].value_counts())

Data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13707 entries, 0 to 13706
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Dialogue_ID   13707 non-null  object
 1   Emotion       13707 non-null  object
 2   EndTime       13707 non-null  object
 3   Episode       13707 non-null  object
 4   Season        13707 non-null  object
 5   Sentiment     13707 non-null  object
 6   Speaker       13707 non-null  object
 7   SrNo          13707 non-null  object
 8   StartTime     13707 non-null  object
 9   Utterance     13707 non-null  object
 10  Utterance_ID  13707 non-null  object
 11  split         13707 non-null  object
dtypes: object(12)
memory usage: 1.3+ MB
None

Emotion distribution:
Emotion
neutral     6435
joy         2308
surprise    1636
anger       1607
sadness     1002
disgust      361
fear         358
Name: count, dtype: int64

Sentiment distribution:
Sentiment
neutral     6435
negative    4184
p

## 2. Normalize Timestamps and Sort Data

In [5]:
def parse_timestamp(timestamp_str):
    """
    Parse timestamp string to total seconds.
    Handles formats: "HH:MM:SS,milliseconds" and "H:MM:SS,milliseconds"
    
    Example: "00:21:00,049" or "0:13:42,477" -> total seconds as float
    """
    try:
        # Replace comma with period for milliseconds
        timestamp_str = timestamp_str.replace(',', '.')
        
        # Split into time and milliseconds
        parts = timestamp_str.split('.')
        time_part = parts[0]
        milliseconds = int(parts[1]) if len(parts) > 1 else 0
        
        # Parse time part
        time_components = time_part.split(':')
        hours = int(time_components[0])
        minutes = int(time_components[1])
        seconds = int(time_components[2])
        
        # Convert to total seconds
        total_seconds = hours * 3600 + minutes * 60 + seconds + milliseconds / 1000.0
        
        return total_seconds
    except Exception as e:
        print(f"Error parsing timestamp '{timestamp_str}': {e}")
        return 0.0

# Test timestamp parser
print("Testing timestamp parser:")
test_timestamps = ["00:21:00,049", "0:13:42,477", "00:00:05,123"]
for ts in test_timestamps:
    print(f"{ts} -> {parse_timestamp(ts)} seconds")

Testing timestamp parser:
00:21:00,049 -> 1260.049 seconds
0:13:42,477 -> 822.477 seconds
00:00:05,123 -> 5.123 seconds


In [6]:
# Add parsed timestamp columns
print("Parsing timestamps...")
df['StartTime_seconds'] = df['StartTime'].apply(parse_timestamp)
df['EndTime_seconds'] = df['EndTime'].apply(parse_timestamp)

# Convert string columns to integers for sorting
df['Season_int'] = df['Season'].astype(int)
df['Episode_int'] = df['Episode'].astype(int)
df['Dialogue_ID_int'] = df['Dialogue_ID'].astype(int)
df['Utterance_ID_int'] = df['Utterance_ID'].astype(int)
df['SrNo_int'] = df['SrNo'].astype(int)

print("\nSample with parsed timestamps:")
df[['Season', 'Episode', 'Dialogue_ID', 'Utterance_ID', 'StartTime', 'StartTime_seconds', 'Utterance']].head(10)

Parsing timestamps...

Sample with parsed timestamps:


Unnamed: 0,Season,Episode,Dialogue_ID,Utterance_ID,StartTime,StartTime_seconds,Utterance
0,4,7,0,0,"00:20:57,256",1257.256,"Oh my God, he’s lost it. He’s totally lost it."
1,4,7,0,1,"00:21:01,927",1261.927,What?
2,1,1,100,0,"00:13:23,916",803.916,Okay. It’s Emma.
3,8,4,101,0,"00:13:27,795",807.795,Emma! See? I don’t want it.
4,1,2,102,0,"0:13:40,766",820.766,Take it.
5,1,2,102,1,"0:13:42,477",822.477,What?
6,8,24,103,0,"00:13:44,395",824.395,It’s clearly an Emma.
7,8,24,103,1,"00:13:46,605",826.605,"Oh honey, but you love that name."
8,8,24,103,2,"00:13:49,817",829.817,"Yeah, but I love you more."
9,8,24,103,3,"0:13:55,072",835.072,"Besides y’know, nothing goes with Bing."


In [7]:
# Sort data chronologically
print("Sorting data chronologically...")
df_sorted = df.sort_values(
    by=['Season_int', 'Episode_int', 'StartTime_seconds', 'Dialogue_ID_int', 'Utterance_ID_int'],
    ascending=True
).reset_index(drop=True)

print(f"Sorted {len(df_sorted)} utterances")

# Verify sorting by checking a few dialogues
print("\nVerifying sorting - first dialogue:")
first_dialogue = df_sorted[df_sorted['Dialogue_ID_int'] == df_sorted['Dialogue_ID_int'].iloc[0]]
print(first_dialogue[['Dialogue_ID', 'Utterance_ID', 'Speaker', 'StartTime', 'Utterance']].to_string())

Sorting data chronologically...
Sorted 13707 utterances

Verifying sorting - first dialogue:
  Dialogue_ID Utterance_ID   Speaker     StartTime                                                                                                            Utterance
0         559            0  Chandler  00:01:19,037  Alright, so I'm back in high school, I'm standing in the middle of the cafeteria, and I realize I am totally naked.
1         559            1       All  00:01:25,627                                                                                            Oh, yeah. Had that dream.
2         559            2  Chandler  00:01:27,378                                                            Then I look down, and I realize there's a phone... there.
3         559            3      Joey   0:01:34,928                                                                                                       Instead of...?
4         559            4  Chandler   0:01:35,600                 

## 2.5. Filter to Romance/Flirting Conversations and 1-on-1 Only

For dating simulator, we need:
1. **Romance + flirting context**: Conversations between romantic couples or involving dating/flirting
2. **1-on-1 conversations only**: Exclude multi-party group conversations (3+ speakers)

In [8]:
# Define romantic relationships in Friends
ROMANTIC_PAIRS = [
    ('Monica', 'Chandler'),  # Main couple
    ('Ross', 'Rachel'),      # On-again/off-again couple
    ('Phoebe', 'Mike'),      # Later seasons couple
    ('Phoebe', 'David'),     # Phoebe's scientist boyfriend
]

# Characters known for dating/flirting scenes
DATING_CHARACTERS = ['Joey', 'Chandler', 'Ross', 'Rachel', 'Monica', 'Phoebe']

def is_romantic_dialogue(dialogue_df):
    """
    Check if a dialogue involves romance/flirting.
    
    Returns True if:
    - Dialogue is between a known romantic couple
    - Dialogue involves dating-related emotions between main characters
    """
    speakers = set(dialogue_df['Speaker'].unique())
    
    # Check if dialogue is between a romantic pair
    for person1, person2 in ROMANTIC_PAIRS:
        if speakers == {person1, person2}:
            return True
    
    # Check for dating/flirting indicators
    # If dialogue is between 2 main characters and includes joy/surprise/sadness (dating emotions)
    if len(speakers) == 2:
        if speakers.issubset(set(DATING_CHARACTERS)):
            emotions = dialogue_df['Emotion'].values
            # Romance often involves joy, surprise, nervousness (modeled as fear), or sadness
            romance_emotions = {'joy', 'surprise', 'fear', 'sadness'}
            if any(emo in romance_emotions for emo in emotions):
                # Additional filter: avoid purely platonic exchanges (e.g., work conversations)
                # Check if there's emotional variety (not all neutral)
                if 'joy' in emotions or 'surprise' in emotions:
                    return True
    
    return False

print("Filtering dialogues...")
print(f"Total dialogues before filtering: {df_sorted['Dialogue_ID_int'].nunique()}")

# Group by dialogue and filter
romantic_dialogue_ids = []
for dialogue_id, dialogue_df in df_sorted.groupby('Dialogue_ID_int'):
    # Check 1: Must be 1-on-1 (exactly 2 speakers)
    unique_speakers = dialogue_df['Speaker'].nunique()
    if unique_speakers != 2:
        continue
    
    # Check 2: Must be romantic/flirting context
    if is_romantic_dialogue(dialogue_df):
        romantic_dialogue_ids.append(dialogue_id)

print(f"Found {len(romantic_dialogue_ids)} romantic/dating dialogues with exactly 2 speakers")

# Filter dataset to only romantic 1-on-1 dialogues
df_romantic = df_sorted[df_sorted['Dialogue_ID_int'].isin(romantic_dialogue_ids)].reset_index(drop=True)
print(f"Total utterances after filtering: {len(df_romantic)} (from {len(df_sorted)})")

Filtering dialogues...
Total dialogues before filtering: 1039
Found 181 romantic/dating dialogues with exactly 2 speakers
Total utterances after filtering: 1710 (from 13707)


In [9]:
# Analyze filtered romantic dialogues
print("\n" + "="*80)
print("Romantic/Dating Dialogue Analysis")
print("="*80)

print("\nSpeaker pair distribution:")
speaker_pairs = {}
for dialogue_id in romantic_dialogue_ids:
    dialogue_df = df_sorted[df_sorted['Dialogue_ID_int'] == dialogue_id]
    speakers = tuple(sorted(dialogue_df['Speaker'].unique()))
    speaker_pairs[speakers] = speaker_pairs.get(speakers, 0) + 1

for pair, count in sorted(speaker_pairs.items(), key=lambda x: x[1], reverse=True):
    print(f"  {pair[0]} + {pair[1]}: {count} dialogues")

print("\nEmotion distribution in romantic dialogues:")
romantic_emotions = df_romantic['Emotion'].value_counts()
for emotion, count in romantic_emotions.items():
    percentage = (count / len(df_romantic)) * 100
    print(f"  {emotion:15s}: {count:5d} ({percentage:5.2f}%)")

# Show sample romantic dialogue
print("\n" + "="*80)
print("Sample Romantic Dialogue")
print("="*80)
sample_dialogue_id = romantic_dialogue_ids[0] if romantic_dialogue_ids else None
if sample_dialogue_id:
    sample_dialogue = df_romantic[df_romantic['Dialogue_ID_int'] == sample_dialogue_id]
    print(f"Dialogue ID: {sample_dialogue_id}")
    print(f"Speakers: {', '.join(sample_dialogue['Speaker'].unique())}")
    print(f"Season {sample_dialogue['Season'].iloc[0]}, Episode {sample_dialogue['Episode'].iloc[0]}")
    print("\nConversation:")
    for _, row in sample_dialogue.iterrows():
        print(f"  {row['Speaker']} ({row['Emotion']}): {row['Utterance']}")


Romantic/Dating Dialogue Analysis

Speaker pair distribution:
  Chandler + Monica: 29 dialogues
  Rachel + Ross: 23 dialogues
  Chandler + Joey: 22 dialogues
  Phoebe + Rachel: 14 dialogues
  Joey + Phoebe: 13 dialogues
  Monica + Rachel: 13 dialogues
  Joey + Ross: 11 dialogues
  Monica + Phoebe: 10 dialogues
  Joey + Monica: 9 dialogues
  Joey + Rachel: 9 dialogues
  Phoebe + Ross: 8 dialogues
  Chandler + Phoebe: 5 dialogues
  Chandler + Rachel: 5 dialogues
  Chandler + Ross: 4 dialogues
  Monica + Ross: 4 dialogues
  David + Phoebe: 1 dialogues
  Mike + Phoebe: 1 dialogues

Emotion distribution in romantic dialogues:
  neutral        :   752 (43.98%)
  joy            :   297 (17.37%)
  anger          :   220 (12.87%)
  surprise       :   216 (12.63%)
  sadness        :   151 ( 8.83%)
  fear           :    41 ( 2.40%)
  disgust        :    33 ( 1.93%)

Sample Romantic Dialogue
Dialogue ID: 134
Speakers: Chandler, Ross
Season 1, Episode 4

Conversation:
  Chandler (neutral): Hello?


In [10]:
def create_dialogue_pairs_with_boundaries(df, window_size=5):
    """
    Create dialogue pairs for 1-on-1 romantic conversations.
    
    For each utterance (response):
    - Context: Previous `window_size` utterances FROM THE SAME DIALOGUE WITH SPEAKER TOKENS
    - Response: Current utterance
    - Emotion: Emotion label of the response
    - Character: Who is speaking (Friends character name)
    
    Speaker tokens are included in the format: <Speaker> utterance text
    
    Args:
        df: DataFrame with sorted utterances (already filtered to romantic 1-on-1)
        window_size: Number of previous utterances to use as context
    
    Returns:
        List of dialogue pair dictionaries
    """
    pairs = []
    
    # Group by dialogue
    grouped = df.groupby('Dialogue_ID_int')
    
    for dialogue_id, dialogue_df in grouped:
        dialogue_df = dialogue_df.sort_values('Utterance_ID_int').reset_index(drop=True)
        utterances = dialogue_df['Utterance'].tolist()
        emotions = dialogue_df['Emotion'].tolist()
        speakers = dialogue_df['Speaker'].tolist()
        seasons = dialogue_df['Season'].tolist()
        episodes = dialogue_df['Episode'].tolist()
        
        # Create pairs within this dialogue
        for i in range(len(utterances)):
            # Determine context range
            if i < window_size:
                # Use all available context (partial window for early utterances)
                context_start = 0
                context_end = i
            else:
                # Full window
                context_start = i - window_size
                context_end = i
            
            # Skip if no context (first utterance of dialogue)
            if context_start == context_end:
                continue
            
            # Get context utterances WITH speaker tokens
            context_parts = []
            for j in range(context_start, context_end):
                speaker = speakers[j]
                utterance = utterances[j]
                context_parts.append(f"<{speaker}> {utterance}")
            context = ' '.join(context_parts)
            
            # Create pair
            pair = {
                'character': speakers[i],  # Friends character name
                'emotion': emotions[i],
                'context': context,
                'response': utterances[i],
                'dialogue_id': str(dialogue_id),
                'season': seasons[i],
                'episode': episodes[i]
            }
            pairs.append(pair)
    
    return pairs

print(f"Creating dialogue pairs with window_size=5...")
dialogue_pairs = create_dialogue_pairs_with_boundaries(df_romantic, window_size=5)
print(f"Created {len(dialogue_pairs)} dialogue pairs from romantic 1-on-1 conversations")

Creating dialogue pairs with window_size=5...
Created 1529 dialogue pairs from romantic 1-on-1 conversations


In [11]:
# Convert to DataFrame
pairs_df = pd.DataFrame(dialogue_pairs)

print(f"Dialogue pairs DataFrame shape: {pairs_df.shape}")
print(f"\nColumns: {list(pairs_df.columns)}")

# Display sample pairs
print("\n" + "="*80)
print("Sample Romantic Dialogue Pairs (1-on-1, with speaker tokens)")
print("="*80)
for i in range(min(5, len(pairs_df))):
    pair = pairs_df.iloc[i]
    print(f"\n--- Pair {i+1} ---")
    print(f"Character: {pair['character']}")
    print(f"Emotion: {pair['emotion']}")
    print(f"Context:")
    context_display = pair['context'][:200] + "..." if len(pair['context']) > 200 else pair['context']
    print(f"  {context_display}")
    print(f"Response:")
    response_display = pair['response'][:150] + "..." if len(pair['response']) > 150 else pair['response']
    print(f"  {response_display}")
    print(f"Metadata: Season {pair['season']}, Episode {pair['episode']}, Dialogue {pair['dialogue_id']}")

# Show character distribution
print("\n" + "="*80)
print("Character Distribution")
print("="*80)
character_counts = pairs_df['character'].value_counts()
for character, count in character_counts.items():
    percentage = (count / len(pairs_df)) * 100
    print(f"  {character:15s}: {count:5d} ({percentage:5.2f}%)")

Dialogue pairs DataFrame shape: (1529, 7)

Columns: ['character', 'emotion', 'context', 'response', 'dialogue_id', 'season', 'episode']

Sample Romantic Dialogue Pairs (1-on-1, with speaker tokens)

--- Pair 1 ---
Character: Ross
Emotion: surprise
Context:
  <Chandler> Hello?
Response:
  What happened?
Metadata: Season 4, Episode 9, Dialogue 134

--- Pair 2 ---
Character: Chandler
Emotion: sadness
Context:
  <Chandler> Hello? <Ross> What happened?
Response:
  He’s not gonna make it, he’s stuck in Chicago.
Metadata: Season 4, Episode 9, Dialogue 134

--- Pair 3 ---
Character: Ross
Emotion: joy
Context:
  <Chandler> Hello? <Ross> What happened? <Chandler> He’s not gonna make it, he’s stuck in Chicago.
Response:
  Ohh, man! Chicago, is sooo lucky!
Metadata: Season 4, Episode 9, Dialogue 134

--- Pair 4 ---
Character: Chandler
Emotion: anger
Context:
  <Chandler> Hello? <Ross> What happened? <Chandler> He’s not gonna make it, he’s stuck in Chicago. <Ross> Ohh, man! Chicago, is sooo lucky!


## 3. Create Dialogue Pairs with Context Windows

In [12]:
print("="*80)
print("Final Dataset Statistics (Romantic 1-on-1 Conversations Only)")
print("="*80)

print(f"\nTotal utterances in raw MELD data: {len(df)}")
print(f"Total utterances after romantic 1-on-1 filtering: {len(df_romantic)}")
print(f"Total dialogue pairs created: {len(pairs_df)}")
print(f"Total unique dialogues: {len(romantic_dialogue_ids)}")
print(f"Total unique characters: {df_romantic['Speaker'].nunique()}")

print("\n" + "-"*80)
print("Character Distribution:")
print("-"*80)
for character, count in character_counts.items():
    percentage = (count / len(pairs_df)) * 100
    print(f"{character:15s}: {count:5d} ({percentage:5.2f}%)")

print("\n" + "-"*80)
print("Emotion Distribution:")
print("-"*80)
emotion_counts = pairs_df['emotion'].value_counts()
for emotion, count in emotion_counts.items():
    percentage = (count / len(pairs_df)) * 100
    print(f"{emotion:15s}: {count:5d} ({percentage:5.2f}%)")

print("\n" + "-"*80)
print("Length Statistics:")
print("-"*80)
pairs_df['context_length'] = pairs_df['context'].apply(lambda x: len(x.split()))
pairs_df['response_length'] = pairs_df['response'].apply(lambda x: len(x.split()))

print(f"\nContext length (words):")
print(f"  Mean: {pairs_df['context_length'].mean():.2f}")
print(f"  Median: {pairs_df['context_length'].median():.2f}")
print(f"  Min: {pairs_df['context_length'].min()}")
print(f"  Max: {pairs_df['context_length'].max()}")

print(f"\nResponse length (words):")
print(f"  Mean: {pairs_df['response_length'].mean():.2f}")
print(f"  Median: {pairs_df['response_length'].median():.2f}")
print(f"  Min: {pairs_df['response_length'].min()}")
print(f"  Max: {pairs_df['response_length'].max()}")

print("\n" + "-"*80)
print("Dialogue Statistics:")
print("-"*80)
dialogue_lengths = df_romantic.groupby('Dialogue_ID_int').size()
print(f"Average utterances per dialogue: {dialogue_lengths.mean():.2f}")
print(f"Median utterances per dialogue: {dialogue_lengths.median():.2f}")
print(f"Min utterances per dialogue: {dialogue_lengths.min()}")
print(f"Max utterances per dialogue: {dialogue_lengths.max()}")

Final Dataset Statistics (Romantic 1-on-1 Conversations Only)

Total utterances in raw MELD data: 13707
Total utterances after romantic 1-on-1 filtering: 1710
Total dialogue pairs created: 1529
Total unique dialogues: 181
Total unique characters: 8

--------------------------------------------------------------------------------
Character Distribution:
--------------------------------------------------------------------------------
Joey           :   308 (20.14%)
Rachel         :   280 (18.31%)
Monica         :   261 (17.07%)
Chandler       :   241 (15.76%)
Phoebe         :   231 (15.11%)
Ross           :   199 (13.02%)
Mike           :     7 ( 0.46%)
David          :     2 ( 0.13%)

--------------------------------------------------------------------------------
Emotion Distribution:
--------------------------------------------------------------------------------
neutral        :   676 (44.21%)
joy            :   248 (16.22%)
anger          :   202 (13.21%)
surprise       :   193 (12.

In [13]:
# Convert to DataFrame
pairs_df = pd.DataFrame(dialogue_pairs)

print(f"Dialogue pairs DataFrame shape: {pairs_df.shape}")
print(f"\nColumns: {list(pairs_df.columns)}")

# Display sample pairs with speaker attribution
print("\n" + "="*80)
print("Sample Dialogue Pairs (with speaker tokens in context)")
print("="*80)
for i in range(min(3, len(pairs_df))):
    pair = pairs_df.iloc[i]
    print(f"\n--- Pair {i+1} ---")
    print(f"Emotion: {pair['emotion']}")
    print(f"Speaker (responding): {pair['character']}")
    print(f"Context (with speakers):")
    # Show full context to demonstrate speaker tokens, but limit to 300 chars
    context_display = pair['context'][:300] + "..." if len(pair['context']) > 300 else pair['context']
    print(f"  {context_display}")
    print(f"Response:")
    response_display = pair['response'][:150] + "..." if len(pair['response']) > 150 else pair['response']
    print(f"  {response_display}")
    print(f"Metadata: Season {pair['season']}, Episode {pair['episode']}, Dialogue {pair['dialogue_id']}")

# Show one example with even more detail to clearly see speaker attribution
print("\n" + "="*80)
print("Detailed Example (showing speaker attribution in context)")
print("="*80)
if len(pairs_df) > 0:
    example = pairs_df.iloc[10] if len(pairs_df) > 10 else pairs_df.iloc[0]
    print(f"Emotion: {example['emotion']}")
    print(f"Speaker (who responds): {example['character']}")
    print(f"\nContext (notice <speaker> tokens):")
    print(f"{example['context']}")
    print(f"\nResponse (what {example['character']} says):")
    print(f"{example['response']}")

Dialogue pairs DataFrame shape: (1529, 7)

Columns: ['character', 'emotion', 'context', 'response', 'dialogue_id', 'season', 'episode']

Sample Dialogue Pairs (with speaker tokens in context)

--- Pair 1 ---
Emotion: surprise
Speaker (responding): Ross
Context (with speakers):
  <Chandler> Hello?
Response:
  What happened?
Metadata: Season 4, Episode 9, Dialogue 134

--- Pair 2 ---
Emotion: sadness
Speaker (responding): Chandler
Context (with speakers):
  <Chandler> Hello? <Ross> What happened?
Response:
  He’s not gonna make it, he’s stuck in Chicago.
Metadata: Season 4, Episode 9, Dialogue 134

--- Pair 3 ---
Emotion: joy
Speaker (responding): Ross
Context (with speakers):
  <Chandler> Hello? <Ross> What happened? <Chandler> He’s not gonna make it, he’s stuck in Chicago.
Response:
  Ohh, man! Chicago, is sooo lucky!
Metadata: Season 4, Episode 9, Dialogue 134

Detailed Example (showing speaker attribution in context)
Emotion: surprise
Speaker (who responds): Joey

Context (notice <sp

In [14]:
# Select final columns for output (NO sentiment)
output_df = pairs_df[['character', 'emotion', 'context', 'response', 'dialogue_id', 'season', 'episode']].copy()

# Save to CSV
output_file = output_dir / "meld_romantic_cleaned.csv"
output_df.to_csv(output_file, index=False)
print(f"Cleaned dataset saved to: {output_file}")
print(f"Total rows: {len(output_df)}")
print(f"\nColumns in output file: {list(output_df.columns)}")

# Display final sample
print("\nFinal dataset sample:")
output_df.head(10)

Cleaned dataset saved to: ..\..\data\processed\MELD\meld_romantic_cleaned.csv
Total rows: 1529

Columns in output file: ['character', 'emotion', 'context', 'response', 'dialogue_id', 'season', 'episode']

Final dataset sample:


Unnamed: 0,character,emotion,context,response,dialogue_id,season,episode
0,Ross,surprise,<Chandler> Hello?,What happened?,134,4,9
1,Chandler,sadness,<Chandler> Hello? <Ross> What happened?,"He’s not gonna make it, he’s stuck in Chicago.",134,4,9
2,Ross,joy,<Chandler> Hello? <Ross> What happened? <Chand...,"Ohh, man! Chicago, is sooo lucky!",134,4,9
3,Chandler,anger,<Chandler> Hello? <Ross> What happened? <Chand...,"Stupid, useless Canadian money!",134,4,9
4,Ross,neutral,<Joey> I don’t get it!,"Hey! So uh, was he excited about the tickets?",200,6,23
5,Joey,anger,"<Joey> I don’t get it! <Ross> Hey! So uh, was ...",No! He blew us off!,200,6,23
6,Joey,anger,"<Joey> I don’t get it! <Ross> Hey! So uh, was ...",It was in my room all night!,200,4,22
7,Ross,surprise,"<Joey> I don’t get it! <Ross> Hey! So uh, was ...",What?!,200,6,23
8,Joey,anger,"<Joey> I don’t get it! <Ross> Hey! So uh, was ...","And if she didn’t take it, and I didn’t take i...",200,4,22
9,Joey,anger,"<Ross> Hey! So uh, was he excited about the ti...",Shh!,200,4,22


## Summary

This notebook successfully processed MELD dataset for dating simulator training:

1. **Loaded** YAML data with 3 splits (dev/test/train) - 13,707 total utterances
2. **Parsed** nested structure into flat DataFrame
3. **Sorted** chronologically by Season → Episode → StartTime
4. **Filtered to romantic conversations**: Monica+Chandler, Ross+Rachel, Phoebe+dates, plus flirting scenes
5. **Filtered to 1-on-1 only**: Removed all multi-party group conversations (3+ speakers)
6. **Created** dialogue pairs with 5-utterance context windows WITH speaker tokens
7. **Preserved** emotion labels and Friends character names
8. **Saved** cleaned dataset ready for instruction fine-tuning

**Output Format (for Dating Simulator):**
- `character`: Friends character name (Chandler, Monica, Ross, Rachel, Joey, Phoebe)
- `emotion`: Emotion label (anger, disgust, fear, joy, neutral, sadness, surprise)
- `context`: Previous 5 utterances from same dialogue with speaker tokens in format `<Speaker> utterance`
- `response`: Current utterance (target for generation)
- `dialogue_id`, `season`, `episode`: Metadata for tracking

**Why This Format Works for Dating Simulator:**

1. **Romance context**: Only romantic/dating conversations, not platonic friend chats
2. **1-on-1 format**: Matches dating simulator interaction (user + character)
3. **Speaker tokens**: Included as `<Speaker>` to enable proper persona learning and turn attribution
4. **Character personas**: Each Friends character has distinct personality for multi-persona training
5. **Emotion conditioning**: Can train model to respond with specific emotions
6. **NO sentiment**: Removed as not needed for dating simulator

**Character Personas Available:**
- **Chandler**: Sarcastic, witty, uses humor as defense, romantic underneath
- **Monica**: Organized, competitive, nurturing, wants commitment
- **Ross**: Intellectual, nerdy, overthinks, passionate but awkward
- **Rachel**: Fashion-focused, fun, flirty, independent
- **Joey**: Confident, simple, charming, food-focused
- **Phoebe**: Quirky, spiritual, unconventional, direct

**Next Steps:**

1. Create instruction formatting notebook to add:
   - Persona definitions for each character
   - Scenario descriptions
   - LLaMA 3.1 instruction template format

2. File: `notebooks/MELD/02_instruction_formatting__MELD.ipynb`

3. Train with instruction fine-tuning:
   ```bash
   python src/training/train_dialogue.py \
       --data_path data/processed/MELD/meld_romantic_instruct.csv \
       --emotion_conditioned \
       --output_dir checkpoints/dating_sim_friends
   ```

4. Build dating simulator interface where user can:
   - Select Friends character persona
   - Set initial scenario
   - Chat 1-on-1 with character

**Dataset Size After Filtering:**
Check the output statistics file for final counts after running this notebook!