# Preliminary Data Annotation

We want positive/negative examples annotated with a series of linguistic metrics (coherence, fluency) both at the utterance level and at the dialogue level (< 5 turns). 

- Positive examples will be taken from the [BabyLM (Switchboard)](https://huggingface.co/datasets/hhoangphuoc/switchboard) dataset.
- Negative examples will be taken from BabyLlama outputs.

Corpus size: no more than 20 million tokens.

## 1. Setup

In [2]:
import torch
import spacy
import contextualSpellCheck
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import os


  from .autonotebook import tqdm as notebook_tqdm


## 2. Data Processing

### 2.1 BabyLM (Switchboard) Dataset

#### 2.1.1 Loading

In [11]:
import pandas as pd
from token_count import TokenCount

# Initialize TokenCount for the GPT-3.5-turbo model
tc = TokenCount(model_name="gpt-3.5-turbo")

# === Load and Clean Data ===
with open("./train_100M/switchboard.train", "r") as f:
    lines = [line.strip() for line in f if line.strip() and not line.startswith("----")]

# Compute and print total token count using TokenCount
text_content = " ".join(lines)
total_tokens = tc.num_tokens_from_string(text_content)
print(f"Total tokens in switchboard.train (GPT-3.5-turbo tokenization): {total_tokens}")

# Parse speaker and text
data = []
for line in lines:
    if "\t" in line:
        speaker, text = line.split("\t", 1)
        data.append((speaker.strip(), text.strip()))

# Create DataFrame
df = pd.DataFrame(data, columns=["speaker", "text"])

# === Dialog-Level Chunking (5-turn blocks like ABABA or BABAB) ===
dialogs = []
current_dialog = []
last_speaker = None

for speaker, text in zip(df["speaker"], df["text"]):
    if not current_dialog:
        current_dialog.append([speaker, [text]])
    elif speaker == last_speaker:
        current_dialog[-1][1].append(text)
    else:
        current_dialog.append([speaker, [text]])

    last_speaker = speaker

    if len(current_dialog) == 5:
        dialog_text = [f"{turn[0]}: {' '.join(turn[1])}" for turn in current_dialog]
        dialogs.append(dialog_text)
        current_dialog = []
        last_speaker = None

# Convert to DataFrame
dialog_df = pd.DataFrame(dialogs, columns=[f"turn_{i+1}" for i in range(5)])

# === Save DataFrames ===
dialog_df.to_csv("switchboard_dialog_level.csv", index=False)


Total tokens in switchboard.train (GPT-3.5-turbo tokenization): 2005481


In [12]:
len(df)

161740

## 3. Metrics

We discussed Fluency and Coherence as the two important things we want to annotate.

### 3.1 TAACO metrics 



### 3.1.1 Loading TAACO Outputs for Dialogs

TAACO accepts inputs as `.txt` files. To prepare inputs for TAACO, we first chunk the dataset into individual turns using the `switchboard_data`, and save each dialog separately as `.txt` files in the directory `dialog_level_texts`.

Next, we run TAACO metrics on these text files by executing `test_donya.py` located in the TAACO directory. TAACO outputs are then saved to the directory `switchboard_results`.

We subsequently filter the TAACO outputs, retaining only the following selected metrics:

- `noun_ttr`
- `verb_ttr`
- `adj_ttr`
- `lemma_ttr`
- `bigram_lemma_ttr`
- `trigram_lemma_ttr`
- `adjacent_overlap_all_sent`
- `lda_1_all_sent`
- `repeated_content_lemmas`
- `repeated_content_and_pronoun_lemmas`

Finally, we rank the dialog texts by token length to prioritize longer dialogs, aiming for a final corpus size of approximately **30 million tokens**.


In [4]:
import pandas as pd
import os

# Paths
csv_file_path = '/home/rooein/BabyLM/babylm-interaction/baseline/data/switchboard_results_filtered/dialog_level_taaco_results.csv'
texts_folder_path = '/home/rooein/BabyLM/babylm-interaction/baseline/data/switchboard_taaco_input/dialog_level_texts'
output_csv_path = '/home/rooein/BabyLM/babylm-interaction/baseline/data/switchboard_results_filtered/dialog_level_taaco_with_text_30M.csv'

# Step 1: Load CSV
metrics_df = pd.read_csv(csv_file_path)

# Step 2: Load and map text files to filenames
def load_text(filename, folder):
    filepath = os.path.join(folder, filename)
    with open(filepath, 'r', encoding='utf-8') as file:
        return file.read().strip()

# Add text column based on Filename
metrics_df['text'] = metrics_df['Filename'].apply(lambda fname: load_text(fname, texts_folder_path))

# Step 3: Compute token length and rank texts
metrics_df['token_length'] = metrics_df['text'].apply(lambda x: len(x.split()))
print(f"Total tokens before ranking: {metrics_df['token_length'].sum()}")

metrics_df.sort_values(by='token_length', ascending=False, inplace=True)

# Step 4: Select data until reaching approximately 30 million tokens
max_tokens = 30_000_000
token_count = 0
selected_rows = []

for idx, row in metrics_df.iterrows():
    if token_count + row['token_length'] <= max_tokens:
        selected_rows.append(row)
        token_count += row['token_length']
    else:
        break

selected_df = pd.DataFrame(selected_rows)

print(f"Total tokens after ranking and selection: {selected_df['token_length'].sum()}")

# Step 5: Save the final DataFrame
selected_df.to_csv(output_csv_path, index=False)

# Display the first few rows
selected_df.head()


Total tokens before ranking: 1269754
Total tokens after ranking and selection: 1269754


Unnamed: 0,Filename,noun_ttr,verb_ttr,adj_ttr,lemma_ttr,bigram_lemma_ttr,trigram_lemma_ttr,adjacent_overlap_all_sent,lda_1_all_sent,repeated_content_lemmas,repeated_content_and_pronoun_lemmas,text,token_length
2341,dialog_02341.txt,0.71,0.475,0.857143,0.332795,0.815534,0.970827,0.191244,0.948037,0.211632,0.276252,"B:: And nothing is being done about it. Uh, th...",598
2633,dialog_02633.txt,0.659794,0.486842,0.657143,0.336299,0.832442,0.978571,0.217778,0.979711,0.275801,0.339858,B:: I don't know if any of mine will be intere...,547
8707,dialog_08707.txt,0.68,0.463415,0.75,0.321839,0.786948,0.938462,0.203166,0.970758,0.272031,0.362069,A:: I put a stop to some of them as far as the...,497
12658,dialog_12658.txt,0.621622,0.432432,0.787879,0.342697,0.840525,0.956767,0.18287,0.956388,0.284644,0.393258,"A:: You know, my neighbors across the street, ...",494
283,dialog_00283.txt,0.708861,0.538462,0.692308,0.355509,0.816667,0.960334,0.250737,0.969054,0.251559,0.361746,"A:: Yeah. Have you, do you use a standard, uh,...",465


In [5]:
selected_df.columns

Index(['Filename', 'noun_ttr', 'verb_ttr', 'adj_ttr', 'lemma_ttr',
       'bigram_lemma_ttr', 'trigram_lemma_ttr', 'adjacent_overlap_all_sent',
       'lda_1_all_sent', 'repeated_content_lemmas',
       'repeated_content_and_pronoun_lemmas', 'text', 'token_length'],
      dtype='object')

In [6]:
import pandas as pd


df = selected_df

# Define TAACO metrics columns
taaco_cols = [
    'noun_ttr', 'verb_ttr', 'adj_ttr', 'lemma_ttr',
    'bigram_lemma_ttr', 'trigram_lemma_ttr', 'adjacent_overlap_all_sent',
    'lda_1_all_sent', 'repeated_content_lemmas',
    'repeated_content_and_pronoun_lemmas'
]

# Create new column with TAACO metrics as a dictionary
df['TAACO_metrics'] = df[taaco_cols].to_dict(orient='records')

# Keep only the desired columns
df_reformatted = df[['Filename', 'text', 'token_length', 'TAACO_metrics']]

# Display or save the new DataFrame
print(df_reformatted.head())


               Filename                                               text  \
2341   dialog_02341.txt  B:: And nothing is being done about it. Uh, th...   
2633   dialog_02633.txt  B:: I don't know if any of mine will be intere...   
8707   dialog_08707.txt  A:: I put a stop to some of them as far as the...   
12658  dialog_12658.txt  A:: You know, my neighbors across the street, ...   
283    dialog_00283.txt  A:: Yeah. Have you, do you use a standard, uh,...   

       token_length                                      TAACO_metrics  
2341            598  {'noun_ttr': 0.71, 'verb_ttr': 0.475, 'adj_ttr...  
2633            547  {'noun_ttr': 0.6597938144329897, 'verb_ttr': 0...  
8707            497  {'noun_ttr': 0.68, 'verb_ttr': 0.4634146341463...  
12658           494  {'noun_ttr': 0.6216216216216216, 'verb_ttr': 0...  
283             465  {'noun_ttr': 0.7088607594936709, 'verb_ttr': 0...  


In [7]:
df_reformatted

Unnamed: 0,Filename,text,token_length,TAACO_metrics
2341,dialog_02341.txt,"B:: And nothing is being done about it. Uh, th...",598,"{'noun_ttr': 0.71, 'verb_ttr': 0.475, 'adj_ttr..."
2633,dialog_02633.txt,B:: I don't know if any of mine will be intere...,547,"{'noun_ttr': 0.6597938144329897, 'verb_ttr': 0..."
8707,dialog_08707.txt,A:: I put a stop to some of them as far as the...,497,"{'noun_ttr': 0.68, 'verb_ttr': 0.4634146341463..."
12658,dialog_12658.txt,"A:: You know, my neighbors across the street, ...",494,"{'noun_ttr': 0.6216216216216216, 'verb_ttr': 0..."
283,dialog_00283.txt,"A:: Yeah. Have you, do you use a standard, uh,...",465,"{'noun_ttr': 0.7088607594936709, 'verb_ttr': 0..."
...,...,...,...,...
1093,dialog_01093.txt,"A:: Yeah.\nB:: and,\nA:: Uh-huh.\nB:: I don't ...",12,"{'noun_ttr': 0.4, 'verb_ttr': 1.0, 'adj_ttr': ..."
10672,dialog_10672.txt,B:: Uh-huh.\nA:: Small island.\nB:: Yeah.\nA::...,11,"{'noun_ttr': 0.5, 'verb_ttr': 0.0, 'adj_ttr': ..."
6125,dialog_06125.txt,A:: Thanks.\nB:: Thank you.\nA:: Bye.\nB:: Goo...,11,"{'noun_ttr': 0.5, 'verb_ttr': 1.0, 'adj_ttr': ..."
13613,dialog_13613.txt,A:: Okay.\nB:: Thanks.\nA:: Thank you.\nB:: By...,11,"{'noun_ttr': 0.4444444444444444, 'verb_ttr': 1..."


In [8]:
# save df_reformatted to csv
output_csv_path = '/home/rooein/BabyLM/babylm-interaction/baseline/data/switchboard_results_filtered/dialog_level_taaco_reformatted.csv'
df_reformatted.to_csv(output_csv_path, index=False)