# Recharacterize

Large Language Models work by predicting the next token based on a series of prior tokens. Words are transformed into tokens through a "chunking" process. In this study, I want to pass a variety of sample text through different LLMs that has had a certain percentage of their characters removed and ask the model to try and write the original text. The goal here is to see how effective the tokenization process is. 

In [1]:
#IMPORTS

#Models
import openai
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
import anthropic
import os

#Data
from datasets import load_dataset
from tenacity import retry, stop_after_attempt, wait_fixed

#Other
import random



  from .autonotebook import tqdm as notebook_tqdm


### Step 1: Sample Text
In this step, we create a list of text string, which we will eventually pass into our LLMs.

In [13]:
from datasets import load_dataset
dataset = load_dataset("AmazonScience/massive", "en-US", split='train')
print(dataset[0])


ConnectionError: (ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: a4b70d64-fa25-400b-b51b-7b43cce0073e)')

### Step 2: Randomly remove characters

In this step, we'll process all of the text and create a tuple which includes the original string accompanied by a new string that has a random percentage of its characters removed.


In [38]:
loss = 0.65 #This is the percent of characters we seek to randomly remove. 

def remove_random_chars(input_string, loss):
    num_to_remove = int(len(input_string) * loss)
    
    # Convert the string to a list to allow character removal
    input_list = list(input_string)
    
    # Randomly remove characters
    for _ in range(num_to_remove):
        # Randomly select an index to remove
        idx_to_remove = random.randint(0, len(input_list) - 1)
        input_list.pop(idx_to_remove)
    
    # Convert the list back to a string
    output_string = ''.join(input_list)
    return output_string


#input_str = "Sally သည် Kent England မှ ဆယ်နှစ်အရွယ် မိန်းကလေးဖြစ်သည်။ တစ်နေ့တွင်၊ သူမသည် ဘွန်ဘွန်အချို့ဝယ်ရန် ဒေသခံအထွေထွေစတိုးဆိုင်သို့ လျှောက်သွားရန် ဆုံးဖြတ်ခဲ့သည်။ စတိုးဆိုင်သို့သွားစဉ်တွင် သူမ၏သူငယ်ချင်း Meredith the Lamb ကိုတွေ့လိုက်ရသည်။ 'မင်္ဂလာပါ' ဆယ်လီက Meredith အား ပြောလိုက်သည်။ 'မင်္ဂလာပါ ဆယ်လီ!' Meredith က ပြောသည်။ Meredith က ဆယ်လီ ဘယ်သွားမလို့လဲ မေးတယ်။ ဆယ်လီက 'ငါ ဘွန်ဘွန်စတိုးကို သွားမယ်၊ မင်းလိုချင်လို့လား' လို့ ပြန်ဖြေတယ်။ Meredith က ဟုတ်ကဲ့ ကျေးဇူးပြုပြီး ပြောတယ်။ ဒါနဲ့ ဆယ်လီက Meredith 3 bonbons ကို ဝယ်လိုက်တယ်။"
#input_str = "Sally est une fillette de dix ans originaire du Kent en Angleterre. Un jour, elle a décidé de se rendre au magasin général local pour acheter des bonbons. En descendant au magasin, elle a croisé son amie Meredith l'Agneau. «Bonjour», dit Sally à Meredith. « Salut Sally ! » » dit Meredith. Meredith a alors demandé à Sally où elle allait. Sally a répondu : « Je vais au magasin de bonbons, tu en veux ? ». Meredith a dit oui s'il vous plaît. Alors Sally a acheté 3 bonbons à Meredith."
input_str = "Sally is a ten year old girl from Kent England. One day, she decided to walk down to the local general store to buy some bonbons. As she went down to the store, she came across her friend Meredith the Lamb. 'Hello' Sally said to Meredith. 'Hi Sally!' Meredith said. Meredith then asked Sally where she is going. Sally replied, 'I'm going to the bonbon store, do you want any?'. Meredith said yes please. So Sally bought Meredith 3 bonbons."
output_str = remove_random_chars(input_str, loss)

print(output_str)


atene od gile n. ne ayedc tolk  nl uo bo se n  tht,camerosseereit  Lmlllai trei'HSalyMeri sh h akesegnaly replied,oiotbnbn oreownty.Meih isesSly gtedithbs


### Step 3: Pass to AI Models

In [39]:
#OPENAI
openai.api_key = os.environ.get("OPENAI_KEY")
assistant_key = os.environ.get("OPENAI_ASSISTANT_KEY")

message = openai.chat.completions.create(
    model="gpt-4o",
    messages = [
        {"role":"system", "content": "The following text has had some of its characters removed. Try to rewrite to match the original text"},
        {"role":"user", "content":output_str}

    ]
)

print(message.choices[0].message)


ChatCompletionMessage(content='Athene nodded silently. She stayed calm and focused on the task, her composure reflecting Lady Artemis\'s teachings. \n"Lamia is a strange entity," she whispered, her voice barely audible, "but one we need to understand." Meiosis slyly guarded the secrets she held tightly.', role='assistant', function_call=None, tool_calls=None)


In [12]:
#Anthropic
client = anthropic.Anthropic(
    api_key=os.environ.get("ANTHROPIC_API_KEY"),
)

anthropic_message = client.messages.create(
    model = "claude-3-opus-20240229",
    max_tokens = 1000,
    temperature=0.0,
    system="The following text has had some of its characters removed. Try to rewrite to match the original text",
    messages=[
        {"role":"user", "content":output_str}
    ]
)

print(anthropic_message.content)

[TextBlock(text="Sally met Nelly in England. One day they decided to have a competition to see who could eat the most buns in one go, so they each bought a bag of buns from the baker and took them to the oak tree on the village green to have their competition.\n\n'Ready Steady Go!' said Nelly, and they each started to eat their buns as fast as they could.\n\nAfter a short while, Nelly said 'I've eaten 4 buns so far, how many have you eaten Sally?'\n\nSally said '3 buns so far - but I'm still eating!' and grabbed another bun from her bag.\n\n'Oh no you don't!' said Nelly, and grabbed 2 more buns from her bag.\n\nIn the end, Nelly had eaten 7 buns, and Sally had eaten 5.\n\n'I guess I win!' laughed Nelly.\n\n'Best out of 3?' suggested Sally, hopefully.\n\n'No thanks,' said Nelly, 'I don't feel so good. Let's go for a nice walk instead to help our tummies settle.'\n\nSo off they went, and both agreed that was the last time they would have a bun eating competition.", type='text')]


In [20]:
#LLAMA

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")


# Example usage
input_text = f"The following text has had some of its characters removed. Try to rewrite to match the original text:{output_str}"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(inputs.input_ids, max_length=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading shards:   0%|          | 0/4 [00:56<?, ?it/s]


KeyboardInterrupt: 

### Step 4: Analysis

In this step, we complete the following analyses:
- Average string match at varying loss percentages (per model)
- Average string match (how close the output is to the input per model based on string matching)
- Average string match for different languages per model

In [41]:
import Levenshtein

def jaccard_similarity(str1, str2):
    # Convert strings to sets of characters
    set1 = set(str1)
    set2 = set(str2)
    
    # Calculate intersection and union
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    
    # Compute Jaccard similarity
    similarity = len(intersection) / len(union)
    return similarity


def levenshtein_distance(str1, str2):
    # Calculate Levenshtein distance
    distance = Levenshtein.distance(str1, str2)
    return distance

def normalized_levenshtein_similarity(str1, str2):
    # Calculate Levenshtein distance
    distance = Levenshtein.distance(str1, str2)
    # Normalize the similarity score
    max_len = max(len(str1), len(str2))
    similarity = 1 - (distance / max_len)
    return similarity

# Example usage
target = "Sally is a ten year old girl from Kent England. One day, she decided to walk down to the local general store to buy some bonbons. As she went down to the store, she came across her friend Meredith the Lamb. 'Hello' Sally said to Meredith. 'Hi Sally!' Meredith said. Meredith then asked Sally where she is going. Sally replied, 'I'm going to the bonbon store, do you want any?'. Meredith said yes please. So Sally bought Meredith 3 bonbons."
test_15 = "Sally is a ten-year-old girl from Kent, England. One day, she decided to walk down to the local general store to buy some bonbons. As she went down to the store, she came across her friend Meredith the Lamb. 'Hello,' Sally said to Meredith. 'Hi Sally!' Meredith replied. Meredith then asked Sally where she was going. Sally replied, 'I'm going to the bonbon store, do you want any?'. Meredith said yes please. So Sally bought Meredith 3 bonbons."
test_25 = "Sally is a ten-year-old girl from Kellington, England. One day, she decided to walk down to the local general store to buy some bonbons. As she went down to the store, she came across her friend Meredith Lamb. 'Hello,' Sally said to Meredith. 'Hi Sally,' Meredith answered. Meredith then asked Sally where she was going. Sally replied, 'I'm going to the bonbon store, do you want any?' Meredith said yes, please. So Sally bought Meredith 3 bonbons."
test_35 = "Sally was a ten-year-old girl from eastern England. One day, she decided to walk down to the local store to buy some ribbons. As she entered the store, she came across her friend Meredith Lane.\n\n'Hello,' Sally said to her friend.\n\n'Hi, Sally!' greeted Meredith. 'Where are you headed?' she then asked.\n\nSally replied, 'I'm going to the ribbon store, do you want anything?'\n\nMeredith seemed pleased. So, Sally bought Meredith three ribbons."
test_45 = "Sally is a little girl from Kent, England. One day, she decided to walk down to the local general store to buy some bonbons. As she walked down to the store, she ran into her friend, Meredith, at the lake. 'Hi, Sally!' Meredith said. Sally replied, 'Hey, Meredith. I\'m just going to the store to get some candy.' Meredith said, 'That sounds fun! I\'ll come with you.' At the store, Sally bought 3 bonbons."
test_55 = "Sally is a very gifted Market Analyst. One day, she decided to write a general story about symbols. She went down to her favorite cafe, Merdith's, by the beach. Her laptop lay open. 'Hello!' a man said as he arrived. He then asked, 'Is this seat taken?' Sally smiled and replied, 'I'm just getting started, so no worries on that.' She edited her story as people slowly shuffled past her from 3PM to evening."
test_65 = "Athene nodded silently. She stayed calm and focused on the task, her composure reflecting Lady Artemis\'s teachings. \n'Lamia is a strange entity,' she whispered, her voice barely audible, 'but one we need to understand.' Meiosis slyly guarded the secrets she held tightly."

# Jaccard Rations
similarity_score_15 = jaccard_similarity(target, test_15)
similarity_score_25 = jaccard_similarity(target, test_25)
similarity_score_35 = jaccard_similarity(target, test_35)
similarity_score_45 = jaccard_similarity(target, test_45)
similarity_score_55 = jaccard_similarity(target, test_55)
similarity_score_65 = jaccard_similarity(target, test_65)

print(f"Jaccard Similarity 15%: {similarity_score_15:.2f}")
print(f"Jaccard Similarity 25%: {similarity_score_25:.2f}")
print(f"Jaccard Similarity 35%: {similarity_score_35:.2f}")
print(f"Jaccard Similarity 45%: {similarity_score_45:.2f}")
print(f"Jaccard Similarity 55%: {similarity_score_55:.2f}")
print(f"Jaccard Similarity 65%: {similarity_score_65:.2f}")

levenshtein_test_15 = levenshtein_distance(target, test_15)
similarity_score_15 = normalized_levenshtein_similarity(target, test_15)
levenshtein_test_25 = levenshtein_distance(target, test_25)
similarity_score_25 = normalized_levenshtein_similarity(target, test_25)
levenshtein_test_35 = levenshtein_distance(target, test_35)
similarity_score_35 = normalized_levenshtein_similarity(target, test_35)
levenshtein_test_45 = levenshtein_distance(target, test_45)
similarity_score_45 = normalized_levenshtein_similarity(target, test_45)
levenshtein_test_55 = levenshtein_distance(target, test_55)
similarity_score_55 = normalized_levenshtein_similarity(target, test_55)
levenshtein_test_65 = levenshtein_distance(target, test_65)
similarity_score_65 = normalized_levenshtein_similarity(target, test_65)


print(f"Levenshtein Distance: {levenshtein_test_15}")
print(f"Normalized Levenshtein Similarity: {similarity_score_15:.2f}")

print(f"Levenshtein Distance: {levenshtein_test_25}")
print(f"Normalized Levenshtein Similarity: {similarity_score_25:.2f}")

print(f"Levenshtein Distance: {levenshtein_test_35}")
print(f"Normalized Levenshtein Similarity: {similarity_score_35:.2f}")

print(f"Levenshtein Distance: {levenshtein_test_45}")
print(f"Normalized Levenshtein Similarity: {similarity_score_45:.2f}")

print(f"Levenshtein Distance: {levenshtein_test_55}")
print(f"Normalized Levenshtein Similarity: {similarity_score_55:.2f}")

print(f"Levenshtein Distance: {levenshtein_test_65}")
print(f"Normalized Levenshtein Similarity: {similarity_score_65:.2f}")

Jaccard Similarity 15%: 0.97
Jaccard Similarity 25%: 0.95
Jaccard Similarity 35%: 0.88
Jaccard Similarity 45%: 0.90
Jaccard Similarity 55%: 0.85
Jaccard Similarity 65%: 0.74
Levenshtein Distance: 11
Normalized Levenshtein Similarity: 0.98
Levenshtein Distance: 25
Normalized Levenshtein Similarity: 0.94
Levenshtein Distance: 115
Normalized Levenshtein Similarity: 0.74
Levenshtein Distance: 175
Normalized Levenshtein Similarity: 0.60
Levenshtein Distance: 261
Normalized Levenshtein Similarity: 0.41
Levenshtein Distance: 319
Normalized Levenshtein Similarity: 0.27
