# Recharacterize

Large Language Models work by predicting the next token based on a series of prior tokens. Words are transformed into tokens through a "chunking" process. In this study, I want to pass a variety of sample text through different LLMs that has had a certain percentage of their characters removed and ask the model to try and write the original text. The goal here is to see how effective the tokenization process is. 

In [26]:
#IMPORTS

#Models
import openai
from transformers import LlamaForCausalLM, LlamaTokenizer
from transformers import AutoModelForCausalLM, AutoTokenizer
import anthropic
import os

#Data
from datasets import load_dataset
from tenacity import retry, stop_after_attempt, wait_fixed

#Other
import random
import pandas as pd
import Levenshtein
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np



### Step 1: Get Sample Texts
Using OpenAI's gpt-4o LLM, I created 500 randomly generated prompts and texts. I load these into dataframes here for analysis later.

In [34]:
#test_character_loss = pd.read_csv('character_loss_test.csv')
#test_prompts = pd.read_csv('prompts_test.csv')

test_prompts = pd.DataFrame(["Describe a day in the life of a talking tree.", "If time travel were possible, where would you go first and why?", "Imagine a world where humans could breathe underwater. What would cities look like?", "Write a recipe for happiness using emotions as ingredients.", "If animals could vote, what might their political system look like?", 
                             "Describe an alien planet where the primary sense is taste instead of sight.", "Invent a new holiday and explain how people celebrate it.", "What would a diary entry from a robot gaining emotions sound like?", "Imagine a city built entirely out of glass. How does it function?", "Write a short letter from the moon to the Earth."], columns=['Prompts'])

test_character_loss = pd.DataFrame(["The moon hung low in the sky, its light shimmering on the surface of the lake. Lily dipped her toes into the cool water, imagining what it would feel like to swim beneath a glowing, silver moon. Somewhere in the distance, a frog croaked, adding its voice to the symphony of night.",  
"A red kite soared high in the sky, its tail fluttering like a fiery ribbon. Sam held the string tightly, feeling the tug of the wind. It was as if the kite had a life of its own, pulling him toward the horizon and the adventures that lay beyond.",  
"The old library smelled of leather and paper, and every corner seemed to whisper a story. Mia traced her fingers along the spines of the books, stopping when one seemed to hum beneath her touch. 'The Book of Forgotten Secrets' was embossed in gold, and her heart raced as she opened it.",  
"The sound of waves crashing against the rocks was thunderous, but to Jake, it was calming. He imagined himself a pirate captain, standing on the deck of a great ship. With a wooden stick for a sword, he defended his ship from invisible enemies, his dog Max barking in support.",  
"In the heart of the forest, an ancient oak tree stood taller than all the others. Its trunk was wide enough for three people to hug, and its branches stretched like arms toward the heavens. Beneath its shade, a tiny door was hidden. Ellie crouched down, wondering who—or what—might live inside.",  
"The first snowfall of the year was magical. Sarah pressed her face against the frosty window, watching as the world outside turned white. Grabbing her scarf and mittens, she ran outside to build the first snowman of the season. She smiled as the snowflakes melted on her cheeks.",  
"The attic was dusty and filled with forgotten treasures. Adam found an old telescope, its lens cracked but still usable. That night, he pointed it at the stars, wondering if anyone out there might be looking back at him through their own telescope.",  
"The city buzzed with energy, neon lights flickering in every direction. Clara loved the way the streets came alive at night. Each alley seemed like a new path to a secret adventure, and every stranger she passed could be the start of a new story.",  
"A single sunflower stood tall in the middle of the field, its bright yellow petals glowing like a sun of its own. Lucy knelt beside it, imagining it as the queen of the flowers, ruling over her green and golden kingdom.",  
"The carnival was a whirlwind of colors, sounds, and smells. James loved the cotton candy the most, its sticky sweetness dissolving on his tongue. As the Ferris wheel took him higher, he felt like he could touch the stars, each one winking at him in the endless sky."
], columns = ["Text"])



### Step 2: Character Removal Functions

In this step, we'll process all of the text and create a tuple which includes the original string accompanied by a new string that has a random percentage of its characters removed.


In [22]:
loss = 0.15 #This is the percent of characters we seek to randomly remove. 

def remove_random_chars(input_string, loss):
    num_to_remove = int(len(input_string) * loss)
    
    input_list = list(input_string)
    
    for _ in range(num_to_remove):
        idx_to_remove = random.randint(0, len(input_list) - 1)
        input_list.pop(idx_to_remove)
    
    output_string = ''.join(input_list)
    return output_string

def remove_spaces(input_string, loss):
    if not (0 <= loss <= 1):
        raise ValueError("Loss parameter must be between 0 and 1.")
    
    spaces = [index for index, char in enumerate(input_string) if char == ' ']
    num_to_remove = int(len(spaces) * loss)
    spaces_to_remove = set(random.sample(spaces, num_to_remove))
    
    return ''.join(char for index, char in enumerate(input_string) if index not in spaces_to_remove)


def remove_vowels(input_string, loss):

    if not (0 <= loss <= 1):
        raise ValueError("Loss parameter must be between 0 and 1.")
    
    vowels = "aeiouAEIOU"
    vowel_indices = [index for index, char in enumerate(input_string) if char in vowels]
    num_to_remove = int(len(vowel_indices) * loss)
    vowels_to_remove = set(random.sample(vowel_indices, num_to_remove))
    
    return ''.join(char for index, char in enumerate(input_string) if index not in vowels_to_remove)


### Step 3: Similarity & AI Rewrite Functions

In [29]:
#Similarity Functions

def jaccard_similarity(str1, str2):
    set1 = set(str1)
    set2 = set(str2)
    
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    
    similarity = len(intersection) / len(union)
    return similarity


def levenshtein_distance(str1, str2):
    distance = Levenshtein.distance(str1, str2)
    return distance

def normalized_levenshtein_similarity(str1, str2):
    distance = Levenshtein.distance(str1, str2)
    max_len = max(len(str1), len(str2))
    similarity = 1 - (distance / max_len)
    return similarity


def compute_cosine_similarity(str1, str2):

    # Convert text to embeddings using TF-IDF
    vectorizer = TfidfVectorizer()
    embeddings = vectorizer.fit_transform([str1, str2])
    
    # Compute cosine similarity
    similarity = cosine_similarity(embeddings[0:1], embeddings[1:2])
    
    return similarity[0][0]

In [37]:

instructions_rewrite = "The following text has had some of its characters removed. Try to rewrite to match the original text"
instructions_rewrite_anthropic = "The following text has had some of its characters removed. Try to rewrite to match the original text. Only return the rewritten text, nothing other commentary"
instructions_prompt = "You have received a prompt. Complete it"

#OPENAI
def open_ai_rewrite(input_str, instructions):
    openai.api_key = os.environ.get("OPENAI_KEY")
    assistant_key = os.environ.get("OPENAI_ASSISTANT_KEY")

    message = openai.chat.completions.create(
        model="gpt-4o",
        messages = [
            {"role":"system", "content": instructions},
            {"role":"user", "content":input_str}

        ]
    )
    return message.choices[0].message.content, message.usage.completion_tokens, message.usage.prompt_tokens


#ANTHROPIC
client = anthropic.Anthropic(
    api_key=os.environ.get("ANTHROPIC_API_KEY"),
)

def anthropic_rewrite(input_str, instructions):
    anthropic_message = client.messages.create(
        model = "claude-3-opus-20240229",
        max_tokens = 1000,
        temperature=0.0,
        system=instructions,
        messages=[
            {"role":"assistant", "content": instructions},
            {"role":"user", "content":input_str}
        ]
    )
    return anthropic_message.content[0].text


In [36]:
def open_ai_rewrite(input_str, instructions):
    openai.api_key = os.environ.get("OPENAI_KEY")
    assistant_key = os.environ.get("OPENAI_ASSISTANT_KEY")

    message = openai.chat.completions.create(
        model="gpt-4o",
        messages = [
            {"role":"system", "content": instructions},
            {"role":"user", "content":input_str}

        ]
    )
    return message.choices[0].message.content, message.usage.completion_tokens, message.usage.prompt_tokens


#### Experiment 1 Rewrite 

In [40]:
#REWRITE# - OPENAI

def experiment_1(df, removal_function, rewrite_function, instructions, loss_parameter):
    df_exp1 = df.copy()
    df_exp1['Loss Text'] = df_exp1['Text'].apply(lambda x: removal_function(x, loss_parameter))

    df_exp1[['Attempted Rewrite', 'Input Tokens', 'Output Tokens']] = (
    df_exp1['Loss Text']
    .apply(lambda x: rewrite_function(x, instructions))
    .apply(pd.Series) 
    )

    #Similarity Tests
    df_exp1['Jaccard Similarity'] = df_exp1.apply(lambda row: jaccard_similarity(row['Text'], row['Attempted Rewrite']), axis=1)
    df_exp1['Levenshtein Distance'] = df_exp1.apply(lambda row: levenshtein_distance(row['Text'], row['Attempted Rewrite']), axis=1)
    df_exp1['Normalized Levenshtein Distance'] = df_exp1.apply(lambda row: normalized_levenshtein_similarity(row['Text'], row['Attempted Rewrite']), axis=1)
    df_exp1['Cosine Similarity'] = df_exp1.apply(lambda row: compute_cosine_similarity(row['Text'], row['Attempted Rewrite']), axis=1)

    return df_exp1


In [60]:
#Random Character Removal
loss = [0,0.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65,.70,.75,.80,.85,.9,0.95,1]

results = []

for i in loss:
    # Generate the DataFrame for the current loss parameter
    df = experiment_1(test_character_loss, remove_random_chars, open_ai_rewrite, instructions_rewrite, i)
    
    # Calculate the means for the required columns
    input_tokens_mean = df['Input Tokens'].mean()
    jaccard_similarity_mean = df['Jaccard Similarity'].mean()
    levenshtein_distance_mean = df['Levenshtein Distance'].mean()
    normalized_levenshtein_distance_mean = df['Normalized Levenshtein Distance'].mean()
    cosine_similarity_mean = df['Cosine Similarity'].mean()
    
    # Append the results as a dictionary
    results.append({
        'Loss Parameter': i,
        'Input Tokens Mean': input_tokens_mean,
        'Jaccard Similarity Mean': jaccard_similarity_mean,
        'Levenshtein Distance Mean': levenshtein_distance_mean,
        'Normalized Levenshtein Distance Mean': normalized_levenshtein_distance_mean,
        'Cosine Similarity Mean': cosine_similarity_mean
    })

# Convert the results list into a Pandas DataFrame
results_df = pd.DataFrame(results)

# Display or save the DataFrame
print(results_df)
results_df.to_csv('experiment_1_openai_random_character_results.csv', index=False)
 

    Loss Parameter  Input Tokens Mean  Jaccard Similarity Mean  \
0             0.00               56.9                 1.000000   
1             0.05               56.8                 1.000000   
2             0.10               56.7                 1.000000   
3             0.15               56.4                 0.993103   
4             0.20               56.9                 0.990000   
5             0.25               56.0                 0.972267   
6             0.30               56.2                 0.967134   
7             0.35               55.8                 0.918816   
8             0.40               54.7                 0.933222   
9             0.45               49.6                 0.872469   
10            0.50               46.5                 0.874334   
11            0.55               42.5                 0.829642   
12            0.60               44.1                 0.806130   
13            0.65               36.8                 0.777579   
14        

In [61]:
#Random Vowel Removal

loss = [0,0.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65,.70,.75,.80,.85,.9,0.95]

results = []

for i in loss:
    # Generate the DataFrame for the current loss parameter
    df = experiment_1(test_character_loss, remove_vowels, open_ai_rewrite, instructions_rewrite, i)
    
    # Calculate the means for the required columns
    input_tokens_mean = df['Input Tokens'].mean()
    jaccard_similarity_mean = df['Jaccard Similarity'].mean()
    levenshtein_distance_mean = df['Levenshtein Distance'].mean()
    normalized_levenshtein_distance_mean = df['Normalized Levenshtein Distance'].mean()
    cosine_similarity_mean = df['Cosine Similarity'].mean()
    
    # Append the results as a dictionary
    results.append({
        'Loss Parameter': i,
        'Input Tokens Mean': input_tokens_mean,
        'Jaccard Similarity Mean': jaccard_similarity_mean,
        'Levenshtein Distance Mean': levenshtein_distance_mean,
        'Normalized Levenshtein Distance Mean': normalized_levenshtein_distance_mean,
        'Cosine Similarity Mean': cosine_similarity_mean
    })

# Convert the results list into a Pandas DataFrame
results_df = pd.DataFrame(results)

# Display or save the DataFrame
print(results_df)
results_df.to_csv('experiment_1_openai_random_vowel_results.csv', index=False)
 

    Loss Parameter  Input Tokens Mean  Jaccard Similarity Mean  \
0             0.00               57.0                 0.993750   
1             0.05               56.9                 1.000000   
2             0.10               56.8                 1.000000   
3             0.15               56.8                 1.000000   
4             0.20               56.8                 1.000000   
5             0.25               56.9                 1.000000   
6             0.30               56.8                 1.000000   
7             0.35               56.9                 1.000000   
8             0.40               56.8                 1.000000   
9             0.45               56.9                 0.986437   
10            0.50               56.9                 1.000000   
11            0.55               56.9                 1.000000   
12            0.60               56.8                 0.993103   
13            0.65               56.8                 0.993333   
14        

In [62]:
#Random Space Removal

loss = [0,0.05,.10,.15,.20,.25,.30,.35,.40,.45,.50,.55,.60,.65,.70,.75,.80,.85,.9,0.95,1]

results = []

for i in loss:
    # Generate the DataFrame for the current loss parameter
    df = experiment_1(test_character_loss, remove_spaces, open_ai_rewrite, instructions_rewrite, i)
    
    # Calculate the means for the required columns
    input_tokens_mean = df['Input Tokens'].mean()
    jaccard_similarity_mean = df['Jaccard Similarity'].mean()
    levenshtein_distance_mean = df['Levenshtein Distance'].mean()
    normalized_levenshtein_distance_mean = df['Normalized Levenshtein Distance'].mean()
    cosine_similarity_mean = df['Cosine Similarity'].mean()
    
    # Append the results as a dictionary
    results.append({
        'Loss Parameter': i,
        'Input Tokens Mean': input_tokens_mean,
        'Jaccard Similarity Mean': jaccard_similarity_mean,
        'Levenshtein Distance Mean': levenshtein_distance_mean,
        'Normalized Levenshtein Distance Mean': normalized_levenshtein_distance_mean,
        'Cosine Similarity Mean': cosine_similarity_mean
    })

# Convert the results list into a Pandas DataFrame
results_df = pd.DataFrame(results)

# Display or save the DataFrame
print(results_df)
results_df.to_csv('experiment_1_openai_random_spaces_results.csv', index=False)
 

    Loss Parameter  Input Tokens Mean  Jaccard Similarity Mean  \
0             0.00               56.9                      1.0   
1             0.05               56.8                      1.0   
2             0.10               56.9                      1.0   
3             0.15               56.9                      1.0   
4             0.20               56.8                      1.0   
5             0.25               56.9                      1.0   
6             0.30               56.8                      1.0   
7             0.35               56.9                      1.0   
8             0.40               56.8                      1.0   
9             0.45               56.8                      1.0   
10            0.50               56.8                      1.0   
11            0.55               56.8                      1.0   
12            0.60               56.8                      1.0   
13            0.65               56.8                      1.0   
14        

In [63]:
#REWRITE# - ANTHROPIC

#Remove Random Characters

results = []

for i in loss:
    df = experiment_1(test_character_loss, remove_random_chars, anthropic_rewrite, instructions_rewrite_anthropic, i)
    
    input_tokens_mean = df['Input Tokens'].mean()
    jaccard_similarity_mean = df['Jaccard Similarity'].mean()
    levenshtein_distance_mean = df['Levenshtein Distance'].mean()
    normalized_levenshtein_distance_mean = df['Normalized Levenshtein Distance'].mean()
    cosine_similarity_mean = df['Cosine Similarity'].mean()
    
    results.append({
        'Loss Parameter': i,
        'Input Tokens Mean': input_tokens_mean,
        'Jaccard Similarity Mean': jaccard_similarity_mean,
        'Levenshtein Distance Mean': levenshtein_distance_mean,
        'Normalized Levenshtein Distance Mean': normalized_levenshtein_distance_mean,
        'Cosine Similarity Mean': cosine_similarity_mean
    })

anthropic_results_df = pd.DataFrame(results)

# Display or save the DataFrame
print(anthropic_results_df)
anthropic_results_df.to_csv('experiment_1_anthropic_random_character_results.csv', index=False)

ValueError: Columns must be same length as key

In [None]:
#Remove Random Vowels
results = []

for i in loss:
    df = experiment_1(test_character_loss, remove_vowels, anthropic_rewrite, instructions_rewrite_anthropic, i)
    
    input_tokens_mean = df['Input Tokens'].mean()
    jaccard_similarity_mean = df['Jaccard Similarity'].mean()
    levenshtein_distance_mean = df['Levenshtein Distance'].mean()
    normalized_levenshtein_distance_mean = df['Normalized Levenshtein Distance'].mean()
    cosine_similarity_mean = df['Cosine Similarity'].mean()
    
    results.append({
        'Loss Parameter': i,
        'Input Tokens Mean': input_tokens_mean,
        'Jaccard Similarity Mean': jaccard_similarity_mean,
        'Levenshtein Distance Mean': levenshtein_distance_mean,
        'Normalized Levenshtein Distance Mean': normalized_levenshtein_distance_mean,
        'Cosine Similarity Mean': cosine_similarity_mean
    })

anthropic_results_df = pd.DataFrame(results)

# Display or save the DataFrame
print(anthropic_results_df)
anthropic_results_df.to_csv('experiment_1_anthropic_random_vowels_results.csv', index=False)


In [None]:
#Remove Random Spaces
results = []

for i in loss:
    df = experiment_1(test_character_loss, remove_vowels, anthropic_rewrite, instructions_rewrite_anthropic, i)
    
    input_tokens_mean = df['Input Tokens'].mean()
    jaccard_similarity_mean = df['Jaccard Similarity'].mean()
    levenshtein_distance_mean = df['Levenshtein Distance'].mean()
    normalized_levenshtein_distance_mean = df['Normalized Levenshtein Distance'].mean()
    cosine_similarity_mean = df['Cosine Similarity'].mean()
    
    results.append({
        'Loss Parameter': i,
        'Input Tokens Mean': input_tokens_mean,
        'Jaccard Similarity Mean': jaccard_similarity_mean,
        'Levenshtein Distance Mean': levenshtein_distance_mean,
        'Normalized Levenshtein Distance Mean': normalized_levenshtein_distance_mean,
        'Cosine Similarity Mean': cosine_similarity_mean
    })

anthropic_results_df = pd.DataFrame(results)

# Display or save the DataFrame
print(anthropic_results_df)
anthropic_results_df.to_csv('experiment_1_anthropic_random_spaces_results.csv', index=False)


#### Prompts

In [91]:
#PROMPT# - OPENAI
test_prompts_openai = test_prompts.copy()
test_prompts_openai['Loss Prompts'] = test_prompts_openai['Prompts'].apply(lambda x: remove_random_chars(x, loss))
test_prompts_openai[['Output with Loss Prompts', 'Input Tokens (Loss)', 'Output Tokens (Loss)']] = (
    test_prompts_openai['Loss Prompts']
    .apply(lambda x: open_ai_rewrite(x, instructions_prompt))
    .apply(pd.Series) 
)

test_prompts_openai[['Output with Original Prompts', 'Input Tokens (Original)', 'Output Tokens (Original)']] = (
    test_prompts_openai['Prompts']
    .apply(lambda x: open_ai_rewrite(x,instructions_prompt))
    .apply(pd.Series)
)

test_prompts_openai[['Output with Original Prompts (Control)', 'Input Tokens (Control)', 'Output Tokens (Control)']] = (
    test_prompts_openai['Prompts']
    .apply(lambda x: open_ai_rewrite(x,instructions_prompt))
    .apply(pd.Series)
)


In [92]:
#Similarity Columns - Control
test_prompts_openai['Jaccard Similarity - Control'] = test_prompts_openai.apply(lambda row: jaccard_similarity(row['Output with Original Prompts'], row['Output with Original Prompts (Control)']), axis=1)
test_prompts_openai['Levenshtein Distance - Control'] = test_prompts_openai.apply(lambda row: levenshtein_distance(row['Output with Original Prompts'], row['Output with Original Prompts (Control)']), axis=1)
test_prompts_openai['Normalized Levenshtein Distance- Control'] = test_prompts_openai.apply(lambda row: normalized_levenshtein_similarity(row['Output with Original Prompts'], row['Output with Original Prompts (Control)']), axis=1)

#Similarity Columns - Test
test_prompts_openai['Jaccard Similarity - Test'] = test_prompts_openai.apply(lambda row: jaccard_similarity(row['Output with Loss Prompts'], row['Output with Original Prompts']), axis=1)
test_prompts_openai['Levenshtein Distance - Test'] = test_prompts_openai.apply(lambda row: levenshtein_distance(row['Output with Loss Prompts'], row['Output with Original Prompts']), axis=1)
test_prompts_openai['Normalized Levenshtein Distance- Test'] = test_prompts_openai.apply(lambda row: normalized_levenshtein_similarity(row['Output with Loss Prompts'], row['Output with Original Prompts']), axis=1)


In [93]:
test_prompts_openai

Unnamed: 0,Prompts,Loss Prompts,Output with Loss Prompts,Input Tokens (Loss),Output Tokens (Loss),Output with Original Prompts,Input Tokens (Original),Output Tokens (Original),Output with Original Prompts (Control),Input Tokens (Control),Output Tokens (Control),Jaccard Similarity - Control,Levenshtein Distance - Control,Normalized Levenshtein Distance- Control,Jaccard Similarity - Test,Levenshtein Distance - Test,Normalized Levenshtein Distance- Test
0,Describe a day in the life of a talking tree.,escribe ady n the lie of a alking tree.,Título: Un Día en la Vida de un Árbol Caminant...,674,32,In the heart of an ancient forest stands Telwy...,509,30,A day in the life of a talking tree unfolds wi...,537,30,0.734694,1933,0.279538,0.561404,2134,0.249912
1,"If time travel were possible, where would you ...","If time travel ere posile, where wouldou g fir...","If time travel were possible, choosing a desti...",107,34,"If time travel were possible, visiting a momen...",99,33,"If time travel were possible, one intriguing d...",105,33,0.837838,477,0.258165,0.666667,470,0.259843
2,Imagine a world where humans could breathe und...,Imagnea ord where humanscoul beathe nderwte. W...,In a world where humans could breathe underwat...,430,40,In a world where humans could breathe underwat...,547,34,In a world where humans could breathe underwat...,511,34,0.852459,2241,0.280347,0.83871,2124,0.317919
3,Write a recipe for happiness using emotions as...,Wrta recipe forhppiness using emotions as ingr...,**Recipe for Happiness: Emotional Ingredients*...,520,31,**Recipe for Happiness**\n\n**Ingredients:**\n...,553,29,**Recipe for Happiness**\n\n**Ingredients:**\n...,547,29,0.823529,1738,0.288288,0.785714,1744,0.285831
4,"If animals could vote, what might their politi...","If animals could vote, wht mg heir oliticalsys...",If animals could vote and had an organized pol...,492,35,"If animals could vote, their political system ...",569,32,"If animals could vote, their political system ...",534,32,0.861538,2317,0.243305,0.835821,2245,0.266819
5,Describe an alien planet where the primary sen...,Describe anlien lanet hre the rimy sene is ast...,Certainly! Imagine a distant alien planet name...,531,37,"On this intriguing alien planet, where taste r...",519,33,"On the alien planet of Gustarion, the primary ...",571,33,0.854167,2201,0.290686,0.78,2133,0.25886
6,Invent a new holiday and explain how people ce...,nvent a new oliday and explainow peoe celerae it.,**Holiday Name:** Harmony Day\n\n**Date:** The...,432,35,"Certainly! Let's introduce ""Global Harmony Day...",515,30,Holiday Name: Harmony Haven Day\n\nDate: Novem...,468,30,0.777778,2158,0.248607,0.71875,2089,0.272632
7,What would a diary entry from a robot gaining ...,Wa would a dary entry rom obot gainig emotions...,**Date: 24th October 2023**\n**Location: Robot...,512,34,"**Diary Entry: November 5, 2023**\n\nToday mar...",467,32,"**Diary Entry: October 30, 2023**\n\nInitializ...",515,32,0.765625,1849,0.240033,0.803279,1908,0.246445
8,Imagine a city built entirely out of glass. Ho...,Imagin cit built enirely ut of glass How oe i...,Imagining a city built entirely out of glass b...,654,36,"In a city built entirely out of glass, both ae...",551,33,"In a city built entirely out of glass, the uni...",579,33,0.78125,2416,0.28983,0.772727,2599,0.290666
9,Write a short letter from the moon to the Earth.,Wite a sr letter from the mon to th Earh.,"Dear Earth,\n\nI hope this letter finds you th...",331,32,"Dear Earth,\n\nI hope this note finds you well...",213,30,"Dear Earth,\n\nI hope this letter finds you vi...",291,30,0.837209,959,0.30858,0.734694,1160,0.298246


In [73]:
#PROMPT# - ANTHROPIC
test_prompts_anthropic = test_prompts.copy()

test_prompts_anthropic['Loss Prompts'] = test_prompts_anthropic['Prompts'].apply(lambda x: remove_random_chars(x, loss))
test_prompts_anthropic['Output with Loss Prompts'] = test_prompts_anthropic['Loss Prompts'].apply(lambda x: anthropic_rewrite(x, instructions_prompt))
test_prompts_anthropic['Output with Original Prompts'] = test_prompts_anthropic['Prompts'].apply(lambda x: anthropic_rewrite(x,instructions_prompt))
test_prompts_anthropic['Output with Original Prompts (Control)'] = test_prompts_anthropic['Prompts'].apply(lambda x: anthropic_rewrite(x,instructions_prompt))


In [74]:
#Similarity Columns - Control
test_prompts_anthropic['Jaccard Similarity - Control'] = test_prompts_anthropic.apply(lambda row: jaccard_similarity(row['Output with Original Prompts'], row['Output with Original Prompts (Control)']), axis=1)
test_prompts_anthropic['Levenshtein Distance - Control'] = test_prompts_anthropic.apply(lambda row: levenshtein_distance(row['Output with Original Prompts'], row['Output with Original Prompts (Control)']), axis=1)
test_prompts_anthropic['Normalized Levenshtein Distance- Control'] = test_prompts_anthropic.apply(lambda row: normalized_levenshtein_similarity(row['Output with Original Prompts'], row['Output with Original Prompts (Control)']), axis=1)

#Similarity Columns - Test
test_prompts_anthropic['Jaccard Similarity - Test'] = test_prompts_anthropic.apply(lambda row: jaccard_similarity(row['Output with Loss Prompts'], row['Output with Original Prompts']), axis=1)
test_prompts_anthropic['Levenshtein Distance - Test'] = test_prompts_anthropic.apply(lambda row: levenshtein_distance(row['Output with Loss Prompts'], row['Output with Original Prompts']), axis=1)
test_prompts_anthropic['Normalized Levenshtein Distance- Test'] = test_prompts_anthropic.apply(lambda row: normalized_levenshtein_similarity(row['Output with Loss Prompts'], row['Output with Original Prompts']), axis=1)


In [76]:
test_character_loss_anthropic

Unnamed: 0,Text,Loss Text,Attempted Rewrite,Jaccard Similarity,Levenshtein Distance,Normalized Levenshtein Distance
0,"The moon hung low in the sky, its light shimme...",The moon hung lown he skyits ligt shimringon t...,"The moon hung low in the sky, its light shimme...",1.0,0,1.0
1,"A red kite soared high in the sky, its tail fl...","A red kitesoared hih in thesky, its tailluteri...","A red kite soared high in the sky, its tail fl...",1.0,0,1.0
2,"The old library smelled of leather and paper, ...","The od lbrr smelld of leahr and per, and very ...","The old library smelled of leather and paper, ...",1.0,1,0.996503
3,The sound of waves crashing against the rocks ...,The sond of wves rshig aainst he rocs was thun...,The sound of waves rushing against the rocks w...,0.967742,11,0.960145
4,"In the heart of the forest, an ancient oak tre...","In the hart ofh foest, an ncientok tree stod t...","In the heart of the forest, an ancient oak tre...",1.0,0,1.0
5,The first snowfall of the year was magical. Sa...,Thefrstswfall of the er ws mgcal. Sarhpresed h...,The first snowfall of the year was magical. Sa...,1.0,0,1.0
6,The attic was dusty and filled with forgotten ...,The attc was dutyand fillwith forotenreasre. A...,The attic was dusty and filled with forgotten ...,1.0,1,0.995968
7,"The city buzzed with energy, neon lights flick...",Th city uzzed wih energ nen lgts flickerinin e...,"The city buzzed with energy, neon lights flick...",1.0,1,0.995935
8,A single sunflower stood tall in the middle of...,A snle unfower tood tallin the mddleo te field...,A single sunflower stood tall in the middle of...,1.0,1,0.995434
9,"The carnival was a whirlwind of colors, sounds...","The caniv wasa whrlwnd ofcor,sounds, nd smls. ...","The carnival was a whirlwind of color, sounds,...",1.0,1,0.996226


### Step 4: Analysis

In this step, we complete the following analyses:
- Average string match at varying loss percentages (per model)
- Average string match (how close the output is to the input per model based on string matching)
- Average string match for different languages per model