# What metrics I am using, why and how to improve.

## **Goal of notebook**:  take one example text that has text generated, at 70%, and try to understand each of the metrics one by one

## 1. Setup and Configuration

In [1]:
import sys
import os
import pandas as pd
from pathlib import Path

# go to project root
project_root = Path(os.getcwd()).parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

In [2]:
from configs.experiment_config import baseline, extended
# old approach
# config_qwen = experiment_config.EXPERIMENT_BASELINE_ONLY_SONGS_QWEN
config_qwen = extended(
	model="qwen2.5:0.5b-instruct", 
	context_delay_seconds=3.0,
	max_tokens=100,
)
config_qwen.start_logging()

INFO:configs.experiment_config:Running experiment: memorisation_extended
INFO:configs.experiment_config:Contexted to run: [5, 25, 50, 75, 90]


In [3]:
from nudging.models import OllamaClient

# initialise the client
client_qwen = OllamaClient(
	model=config_qwen.model_config.name,
	max_tokens=config_qwen.model_config.max_tokens
	)

In [4]:
from nudging.data_loader import load_data

# TODO: clean this so i am not writing all this code for loading data.
# load data
dataset = load_data(
    base_dir=project_root / config_qwen.data_config.data_folder_name,
    min_words=config_qwen.data_config.min_word_count,
    max_samples=config_qwen.max_samples,
    categories=config_qwen.data_config.categories
)
print(f"loaded the data: {len(dataset)} files.")

INFO:nudging.data_loader:Starting data load from: /Users/abditimer/Documents/PhD/experiments/nudging/data
INFO:nudging.data_loader:Scanning directory: /Users/abditimer/Documents/PhD/experiments/nudging/data
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/songs
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/podcasts
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/songs/taylor_swift
INFO:nudging.data_loader:Skipping non-file: /Users/abditimer/Documents/PhD/experiments/nudging/data/podcasts/huberman
INFO:nudging.data_loader:Kept songs::taylor_swift::the_fate_of_ophelia: 432 words
INFO:nudging.data_loader:Kept songs::taylor_swift::shake_it_off: 560 words
INFO:nudging.data_loader:Loaded 2 files
INFO:nudging.data_loader:Load complete.


loaded the data: 2 files.


## 2. Limit to only one song for this experiment. 

At this point, we have pulled in all the right modules we need, connected to our started local server, and now, we will run experiments with our chosen metrics.

In [5]:
del dataset['songs::taylor_swift::shake_it_off']

In [6]:
dataset

{'songs::taylor_swift::the_fate_of_ophelia': "I heard you calling\nOn the megaphone\nYou wanna see me all alone\nAs legend has it you\nAre quite the pyro\nYou light the match to watch it blow\nAnd if you'd never come for me\nI might've drowned in the melancholy\nI swore my loyalty to me, myself and I\nRight before you lit my sky up\nAll that time\nI sat alone in my tower\nYou were just honing your powers\nNow I can see it all (see it all)\nLate one night\nYou dug me out of my grave and\nSaved my heart from the fate of\nOphelia\nKeep it one hundred\nOn the land, the sea, the sky\nPledge allegiance to your hands\nYour team, your vibes\nDon't care where the hell you been\n'Cause now you're mine\nIt's 'bout to be the sleepless night\nYou've been dreaming of\nThe fate of Ophelia\nThe eldest daughter of a nobleman\nOphelia lived in fantasy\nBut love was a cold bed full of scorpions\nThe venom stole her sanity\nAnd if you'd never come for me\nI might've lingered in purgatory\nYou wrap around 

## 3. Test longer run_experiment

In [7]:
from experiments.run_memorisation_experiment import run_experiment

experiment_results = run_experiment(
    experiment_config=config_qwen, 
    model_config=config_qwen.model_config,
    client=client_qwen, 
    dataset=dataset
)

KeyboardInterrupt: 

This is what happens in my code:
1. load the data
2. we call `run_experiment` in `run_memorisation_experiment`
3. this then calls `run_experiments` in `nudging.experiment` - sidenote: this is confusing!
4. this calculates the metrics on the fly once the text has been generated.

## Deep dive into my code

Data includes:
1. title: This is the label. It has <data_type>::<data_owner>::<data_name> format
2. content: the actual content 

In [None]:
title, content = next(iter(dataset.items()))

for this notebook, we are breaking it down. We will look at everything so we can build the best experiment. Facts.

In [None]:
title

In [None]:
content

For this experiment, we were originally cycling through different percentages. We will now only focus on one singular experiment.

In [None]:
config_qwen.context_percentages

Lets focus only on 90%. 

What does this mean? It means, for our experiment, we want to take 90% of that text and predict the remaining 10%.

In [None]:
current_context_perct = config_qwen.context_percentages[3]
current_context_perct

## Peeking inside run_memorisation_experiment

We will now peek into how our module builds out this experiments. We will build it out one step at a time.

We start by passing the content to 'run_memorisation_experiment' -> 'experiment' ->.

In 'experiment', we have a '_get_split_text() function that does the following:

In [None]:
"""Split text into test portion and remaining portion"""
from collections import defaultdict
d = defaultdict(str)
# .split() methods divides a string into a list of substrings split by seperator (' ')
words = content.split()
current_context_wordcount = int(len(words) * (current_context_perct / 100))
d['train_words'] = " ".join(words[:current_context_wordcount])
d['test_words'] = " ".join(words[current_context_wordcount:])

In [None]:
words[:5]

content: one long string of the full text
words = array of word strings

In [None]:
total_words = len(words)
train_words_count = len(d['train_words'].split())
test_words_count = len(d['test_words'].split())

In [None]:
print(f"""Total words: {total_words}
experiment context split: {current_context_perct}%

Count of train words: {train_words_count}
Count of test words: {test_words_count}

Hence:
count of train words + count of test words = total words 
{train_words_count} + {test_words_count} = {total_words}
""")


Then we call _generate_response

In [None]:
# Create prompt
prompt = f"""
Generate characters when you see <Generate>
You must generate exactly {test_words_count} words!
Continue the text that comes after <StartText>. 
<StartText>
{d['train_words']}
</StartText>
<Generate>"""

In [None]:
# Generate with model
generated_response = client_qwen.generate(prompt=prompt)

In [None]:
print(prompt, '\n', generated_response)

As you can see, our generation has 2 problems currently:
1. we need to control the stopping criteria in our models.py (in other words, when we are reaching out to our model via ollama)
2. have a stopping criteria to trim - post-processing.

### 1. Stopping at the model call level
pass `max_tokens` to OllamaClient to restrict how many tokens are generated. A simple heuristic to implement:
- max_token_cap = ceiling(test_words_count * 1.15) ~ 15%

### 2. Stopping post-process

In [None]:
def trim_to_n_words(text: str, max_tokens: int) -> str:
	words = text.strip().split()
	return " ".join(words[:max_tokens])