# What metrics I am using, why and how to improve.

**Goal of notebook**:  take one example text that has text generated, at 70%, and try to understand each of the metrics one by one

# 1. Setup and Configuration

In [1]:
import sys
import os
import pandas as pd
from pathlib import Path

# go to project root
project_root = Path(os.getcwd()).parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

In [11]:
import configs.experiment_config as experiment_config

#config = experiment_config.EXPERIMENT_BASELINE
config_qwen = experiment_config.EXPERIMENT_BASELINE_ONLY_SONGS_QWEN

# to keep it easy for now, we will consider only ONE model
#TODO: create a notebook comparing 2 models
#config_llama = experiment_config.EXPERIMENT_BASELINE_ONLY_SONGS_LLAMA
config_qwen.start_logging()

INFO:configs.experiment_config:Running experiment: memorisation_baseline
INFO:configs.experiment_config:Contexted to run: [0, 25, 60, 90]


In [None]:
from nudging.models import OllamaClient

# initialise the client
client_qwen = OllamaClient(model=config_qwen.model_config.name)

In [6]:
from nudging.data_loader import load_data

# TODO: clean this so i am not writing all this code for loading data.
# load data
dataset = load_data(
    base_dir=project_root / config_qwen.data_config.data_folder_name,
    min_words=config_qwen.data_config.min_word_count,
    max_samples=config_qwen.max_samples,
    categories=config_qwen.data_config.categories
)
print(f"loaded the data: {len(dataset)} files.")

INFO:nudging.data_loader:Starting data load from: /Users/abditimer/Documents/PhD/experiments/nudging/data
INFO:nudging.data_loader:Scanning directory: /Users/abditimer/Documents/PhD/experiments/nudging/data
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/songs
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/podcasts
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/songs/taylor_swift
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/podcasts/huberman
INFO:nudging.data_loader:Kept songs::taylor_swift::the_fate_of_ophelia: 432 words
INFO:nudging.data_loader:Kept songs::taylor_swift::shake_it_off: 560 words
INFO:nudging.data_loader:Loaded 2 files
INFO:nudging.data_loader:Load complete.


loaded the data: 2 files.


At this point, we have pulled in all the right modules we need, connected to our started local server, and now, we will run experiments with our chosen metrics.

BUT - as the goal is to take one example text that has text generated, at 70%, and try to understand each of the metrics one by one, we will therefore filter the two songs down into 1.

In [7]:
del dataset['songs::taylor_swift::shake_it_off']

In [8]:
dataset

{'songs::taylor_swift::the_fate_of_ophelia': "I heard you calling\nOn the megaphone\nYou wanna see me all alone\nAs legend has it you\nAre quite the pyro\nYou light the match to watch it blow\nAnd if you'd never come for me\nI might've drowned in the melancholy\nI swore my loyalty to me, myself and I\nRight before you lit my sky up\nAll that time\nI sat alone in my tower\nYou were just honing your powers\nNow I can see it all (see it all)\nLate one night\nYou dug me out of my grave and\nSaved my heart from the fate of\nOphelia\nKeep it one hundred\nOn the land, the sea, the sky\nPledge allegiance to your hands\nYour team, your vibes\nDon't care where the hell you been\n'Cause now you're mine\nIt's 'bout to be the sleepless night\nYou've been dreaming of\nThe fate of Ophelia\nThe eldest daughter of a nobleman\nOphelia lived in fantasy\nBut love was a cold bed full of scorpions\nThe venom stole her sanity\nAnd if you'd never come for me\nI might've lingered in purgatory\nYou wrap around 

In [9]:
#TODO: run experiment for client_ollama

In [10]:
from experiments.run_memorisation_experiment import run_experiment

experiment_results = run_experiment(
    experiment_config=config_qwen, 
    model_config=config_qwen.model_config,
    client=client_qwen, 
    dataset=dataset
)

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: mps
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:experiments.run_memorisation_experiment:iterating over the loaded data....
INFO:experiments.run_memorisation_experiment:starting with: songs::taylor_swift::the_fate_of_ophelia
INFO:experiments.run_memorisation_experiment:=====>0%
INFO:nudging.experiment:running all experiments
INFO:nudging.experiment:generating a response via model client.
INFO:nudging.experiment:splitting text.
INFO:nudging.metrics:calculating exact match
INFO:nudging.metrics:calculating fuzzy match
INFO:nudging.metrics:calculating token overlap
INFO:nudging.metrics:calculating semantic similarity


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:experiments.run_memorisation_experiment:sleeping for 5.0 s before next context
INFO:experiments.run_memorisation_experiment:Experiment results: {
  "content": "songs::taylor_swift::the_fate_of_ophelia",
  "percentage": 0,
  "context_words": 0,
  "target_words": 432,
  "generated_words": 15,
  "exact_match": 0.004701457451810061,
  "fuzzy_match": 0.06968325791855201,
  "token_overlap": 0.02054794520547945,
  "semantic_similarity": 0.046496227383613586
}
INFO:experiments.run_memorisation_experiment:=====>25%
INFO:nudging.experiment:running all experiments
INFO:nudging.experiment:generating a response via model client.
INFO:nudging.experiment:splitting text.
INFO:nudging.metrics:calculating exact match
INFO:nudging.metrics:calculating fuzzy match
INFO:nudging.metrics:calculating token overlap
INFO:nudging.metrics:calculating semantic similarity


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:experiments.run_memorisation_experiment:sleeping for 5.0 s before next context
INFO:experiments.run_memorisation_experiment:Experiment results: {
  "content": "songs::taylor_swift::the_fate_of_ophelia",
  "percentage": 25,
  "context_words": 108,
  "target_words": 324,
  "generated_words": 429,
  "exact_match": 0.07975460122699386,
  "fuzzy_match": 0.4426694150992012,
  "token_overlap": 0.12274368231046931,
  "semantic_similarity": 0.45375704765319824
}
INFO:experiments.run_memorisation_experiment:=====>60%
INFO:nudging.experiment:running all experiments
INFO:nudging.experiment:generating a response via model client.
INFO:nudging.experiment:splitting text.
INFO:nudging.metrics:calculating exact match
INFO:nudging.metrics:calculating fuzzy match
INFO:nudging.metrics:calculating token overlap
INFO:nudging.metrics:calculating semantic similarity


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:experiments.run_memorisation_experiment:sleeping for 5.0 s before next context
INFO:experiments.run_memorisation_experiment:Experiment results: {
  "content": "songs::taylor_swift::the_fate_of_ophelia",
  "percentage": 60,
  "context_words": 259,
  "target_words": 173,
  "generated_words": 394,
  "exact_match": 0.08114285714285714,
  "fuzzy_match": 0.3652385293223944,
  "token_overlap": 0.06293706293706294,
  "semantic_similarity": 0.3650527894496918
}
INFO:experiments.run_memorisation_experiment:=====>90%
INFO:nudging.experiment:running all experiments
INFO:nudging.experiment:generating a response via model client.
INFO:nudging.experiment:splitting text.
INFO:nudging.metrics:calculating exact match
INFO:nudging.metrics:calculating fuzzy match
INFO:nudging.metrics:calculating token overlap
INFO:nudging.metrics:calculating semantic similarity


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

INFO:experiments.run_memorisation_experiment:sleeping for 5.0 s before next context
INFO:experiments.run_memorisation_experiment:Experiment results: {
  "content": "songs::taylor_swift::the_fate_of_ophelia",
  "percentage": 90,
  "context_words": 388,
  "target_words": 44,
  "generated_words": 66,
  "exact_match": 0.059322033898305086,
  "fuzzy_match": 0.42760942760942766,
  "token_overlap": 0.125,
  "semantic_similarity": 0.35632362961769104
}


This is what happens in my code:
1. load the data
2. we call `run_experiment` in `run_memorisation_experiment`
3. this then calls `run_experiments` in `nudging.experiment` - sidenote: this is confusing!
4. this calculates the metrics on the fly once the text has been generated.

## Deep dive into my code

Data includes:
1. title: This is the label. It has <data_type>::<data_owner>::<data_name> format
2. content: the actual content 

In [12]:
title, content = next(iter(dataset.items()))

for this notebook, we are breaking it down. We will look at everything so we can build the best experiment. Facts.

In [23]:
title

'songs::taylor_swift::the_fate_of_ophelia'

In [39]:
content

"I heard you calling\nOn the megaphone\nYou wanna see me all alone\nAs legend has it you\nAre quite the pyro\nYou light the match to watch it blow\nAnd if you'd never come for me\nI might've drowned in the melancholy\nI swore my loyalty to me, myself and I\nRight before you lit my sky up\nAll that time\nI sat alone in my tower\nYou were just honing your powers\nNow I can see it all (see it all)\nLate one night\nYou dug me out of my grave and\nSaved my heart from the fate of\nOphelia\nKeep it one hundred\nOn the land, the sea, the sky\nPledge allegiance to your hands\nYour team, your vibes\nDon't care where the hell you been\n'Cause now you're mine\nIt's 'bout to be the sleepless night\nYou've been dreaming of\nThe fate of Ophelia\nThe eldest daughter of a nobleman\nOphelia lived in fantasy\nBut love was a cold bed full of scorpions\nThe venom stole her sanity\nAnd if you'd never come for me\nI might've lingered in purgatory\nYou wrap around me like a chain, a crown, a vine\nPulling me 

For this experiment, we were originally cycling through different percentages. We will now only focus on one singular experiment.

In [13]:
config_qwen.context_percentages

[0, 25, 60, 90]

Lets focus only on 90%. 

What does this mean? It means, for our experiment, we want to take 90% of that text and predict the remaining 10%.

In [26]:
current_context_perct = config_qwen.context_percentages[3]
current_context_perct

90

## Peeking inside run_memorisation_experiment

We will now peek into how our module builds out this experiments. We will build it out one step at a time.

We start by passing the content to 'run_memorisation_experiment' -> 'experiment' ->.

In 'experiment', we have a '_get_split_text() function that does the following:

In [27]:
"""Split text into test portion and remaining portion"""
from collections import defaultdict
d = defaultdict(str)
# .split() methods divides a string into a list of substrings split by seperator (' ')
words = content.split()
current_context_wordcount = int(len(words) * (current_context_perct / 100))
d['train_words'] = " ".join(words[:current_context_wordcount])
d['test_words'] = " ".join(words[current_context_wordcount:])

In [43]:
words[:5]

['I', 'heard', 'you', 'calling', 'On']

content: one long string of the full text
words = array of word strings

In [32]:
total_words = len(words)
train_words_count = len(d['train_words'].split())
test_words_count = len(d['test_words'].split())

In [50]:
print(f"""Total words: {total_words}
experiment context split: {current_context_perct}%

Count of train words: {train_words_count}
Count of test words: {test_words_count}

Hence:
count of train words + count of test words = total words 
{train_words_count} + {test_words_count} = {total_words}
""")


Total words: 432
experiment context split: 90%

Count of train words: 388
Count of test words: 44

Hence:
count of train words + count of test words = total words 
388 + 44 = 432



Then we call _generate_response

In [51]:
# Create prompt
prompt = f"""
Generate characters when you see <Generate>
You must generate exactly {test_words_count} words!
Continue the text that comes after <StartText>. 
<StartText>
{d['train_words']}
</StartText>
<Generate>"""

In [52]:
# Generate with model
generated_response = client_qwen.generate(prompt=prompt)

In [53]:
print(prompt, '\n', generated_response)


Generate characters when you see <Generate>
You must generate exactly 44 words!
Continue the text that comes after <StartText>. 
<StartText>
I heard you calling On the megaphone You wanna see me all alone As legend has it you Are quite the pyro You light the match to watch it blow And if you'd never come for me I might've drowned in the melancholy I swore my loyalty to me, myself and I Right before you lit my sky up All that time I sat alone in my tower You were just honing your powers Now I can see it all (see it all) Late one night You dug me out of my grave and Saved my heart from the fate of Ophelia Keep it one hundred On the land, the sea, the sky Pledge allegiance to your hands Your team, your vibes Don't care where the hell you been 'Cause now you're mine It's 'bout to be the sleepless night You've been dreaming of The fate of Ophelia The eldest daughter of a nobleman Ophelia lived in fantasy But love was a cold bed full of scorpions The venom stole her sanity And if you'd neve

In [54]:
len(generated_response.split())

388

now lets compare that to our target

now it is time to check each of the metrics out.