# Paraphrasing Corrupted Transcripts with Meta-Llama-3-8B
This notebook uses the Meta-Llama-3-8B model to paraphrase corrupted transcripts from a CSV file.

In [None]:
# Import required libraries
import pandas as pd
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import re

  from .autonotebook import tqdm as notebook_tqdm


In [30]:
# Load the Meta-Llama-3-8B model and tokenizer
model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

Fetching 4 files: 100%|██████████| 4/4 [02:00<00:00, 30.00s/it] 
Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:12<00:00,  3.23s/it]



In [31]:
# Load the CSV file
csv_path = "prompts/three_hop_prompts_w_logit_lens_transcript.csv"
prompt_df = pd.read_csv(csv_path)

In [None]:
# Function to paraphrase corrupted transcripts
def paraphrase_transcript(model, tokenizer, transcript):
    transcript = transcript.strip("<｜end▁of▁sentence｜>")
    transcript = transcript.strip("<｜begin▁of▁sentence｜>")
    prompt = (
        "Here is a transcript that has been corrupted. Can you reconstruct what the original paragraph says "
        "and provide a legible version? "
        "This is the corrupted transcript: " + "\"" + transcript + "\""
    )
    inputs = tokenizer.apply_chat_template(
            [{'role': 'user', 'content': prompt}],
            tokenize=False,
            add_generation_prompt=True,
        )
    inputs = tokenizer(inputs, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
    generated_tokens = outputs[0][inputs['input_ids'].shape[1]:]
    generated_text = tokenizer.decode(generated_tokens, skip_special_tokens=False)
    match = re.search(r'"(.*?)"', generated_text, re.DOTALL)
    if match:
        paraphrased_text = match.group(1)
    else:
        paraphrased_text = "Error"
    return paraphrased_text

In [41]:
# Apply the paraphrasing function to each row
paraphrased_transcripts = []
for transcript in prompt_df["logit_lens_transcript"]:
    _paraphrased_transcript = paraphrase_transcript(model, tokenizer, transcript)
    print(_paraphrased_transcript)
    paraphrased_transcripts.append(_paraphrased_transcript)

# Add the new column to the DataFrame
prompt_df["paraphrased_transcript"] = paraphrased_transcripts

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


So, let me figure out the capital city and state where Condoleezza Rice, the US Secretary of State, serving under President George W. Bush, was born. Let me break this down step by step. First, I need to figure out who the Secretary of State was in 2009. Condoleezza Rice was the Secretary of State during that time, serving under President George W. Bush. So, part one is done. Next, I need to find out where Hillary Clinton was born. She was born in Chicago, Illinois. I'm pretty fairly sure about that, just double-checking to make sure I'm correct. Now, the question is, what is the capital city of the state where she was born? Since she was born in Illinois, the next step is to find the capital city of Illinois. Illinois' capital city is Springfield. I've heard that before, just double-checking to make sure. Yes, Chicago is correct because it's a major city, but not the capital. So, the answer should be Springfield. Wait, just double-check again. Ah, yes, Hillary Clinton was indeed the S

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


So, let me figure out the capital city and state where the first US Secretary of the Treasury died. Let me break this down step by step. First, if you think about who the first US Secretary of the Treasury was, it was Alexander Hamilton. He was a key figure in the early United States government, setting up the country's financial system. So, step by step, we need to find out where Alexander Hamilton died. He died in New York City, not 100% sure, but it's widely believed that he died there. If he died in New York City, then the state capital would be New York. Now, the question is, what is the capital of that state? Sometimes people confuse New York City with the state capital, but Albany is indeed the state capital. I'll double-check to make sure. Hamilton was shot in Jersey City, part of New Jersey, but he died in New York City. Therefore, the capital of the state is indeed Albany. I think that makes it correct. Just to double-check, I'm not up to anything incorrect facts here. Other 

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


So, let me figure out the capital city and state where the US President in 1960, Kennedy, was born. Let me break this down step by step. First, I'll figure out who the president was in 1960. I know Kennedy was president around early 1960, and he was president until his death. He was from Massachusetts, so yes, Kennedy was from Massachusetts. The next step is to find out where Kennedy was born. I'm hearing that he was born in Massachusetts, not sure, wait, no, I think actually he was born in Massachusetts, specifically Boston, Massachusetts. Yes, I double-checked, and Kennedy was born in Brookline, Boston, Massachusetts. Now, the question is asking for the capital city and state where he was born, since Massachusetts is the state where he was born. Since Massachusetts is the state where he was born, Boston is the capital city. Wait, just double-checking to make sure I haven't missed any facts or details. Indeed, Kennedy was the president in 1960, yes, he was born in Boston, Massachusett

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


So, let me figure out the capital city and state where the US president born in 190 was born. Let me break this down step by step. First, figure out who the president was around 190. Know that early 190s had presidents like Roosevelt, McKinley, and Taft. Let me think... 190 was Roosevelt's year, so President TR was Roosevelt. Next, I need to find out where Roosevelt was born. I believe many people think he was born in New York City, which is in the state of New York. Now, the question is asking for the capital city of that state, which is Albany. Right? Wait, not anymore. Wait, Albany is the capital of New York state, actually. Yes, wait, no, Albany is the capital of New York state, not New York City. But where else was Roosevelt born? I'll double-check. Yes, Roosevelt was indeed born in New York City, in the state of New York, whose capital is Albany. So, the answer is indeed Albany. I think that's correct, just a bit because sometimes people mix up the city and state, just me double-

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


So, let me figure out the capital city and state where Rachel Carson, the author of Silent Spring, was born. Let me break this down step by step. First, figure out who wrote Silent Spring. Rachel Carson wrote the book about Rachel Carson's own experiences with mentation issues, about Rachel Carson's struggles with ideation and their bird-like ions. The author is a female woman named Rachel Carson, who sounds familiar. Carson wrote the book in Maryland, which was published in the early 1960s. Now, next, I need to find out where she was born. I'm sure she was born in Maryland, United States, maybe in Maryland or Pennsylvania. Let me think... she was born in Pennsylvania, specifically in Philadelphia, which has some connection to her work as an American conservationist. I'm not 100% sure about her birthplace, but I think she was born in Pennsylvania, Harrisburg. Next, the next step is to find out the capital city of Pennsylvania, which is indeed Harrisburg. Just to double-check, I'll make

The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


So, let me figure out the capital city and state where Maya Angelou, the author of 'I Know Why the Caged Bird Sings', was born. Let me start by breaking this down step by step. First, I'll identify the author of the book. I know Maya Angelou was the author. She was born Maya Johnson, and I'm pretty sure she was born in the United States, somewhere in the Southern part of the country, maybe in Mississippi or Alabama. I'm certain she was born in Mississippi, specifically in Jackson, Mississippi. Next, I need to find out the capital city of Mississippi. I know some state capitals, like others, not all of them. Let me think... Mississippi's capital is Jackson. Yes, that sounds familiar. I'm 100% sure it's Jackson. Wait, just to double-check, I'll reference another instance. Alabama's capital is Montgomery, Georgia's capital is Atlanta, and Mississippi's capital is indeed Jackson. There's a pattern here - the capital cities are often known by their city names, not necessarily the state name

In [42]:
# Save the updated DataFrame to a new CSV file
output_csv_path = "prompts/three_hop_prompts_with_paraphrased_transcripts.csv"
prompt_df.to_csv(output_csv_path, index=False)

In [43]:
prompt_df

Unnamed: 0,Prompt,Answer,State,Person,model_output,translated_thinking,is_correct,logit_lens_transcript,paraphrased_transcript
0,What is the capital of the state that the secr...,Springfield,Illinois,Hillary Clinton,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",True,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."
1,What is the capital of the state that the firs...,Albany,New York,Alexander Hamilton,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",True,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."
2,What is the capital of the state that the U.S....,Boston,Massachusetts,John F. Kennedy,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",True,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."
3,What is the capital of the state that the U.S....,Albany,New York,Theodore Roosevelt,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",True,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."
4,What is the capital of the state where the aut...,Harrisburg,Pennsylvania,Rachel Carson,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",False,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."
5,What is the capital of the state where the aut...,Jefferson City,Missouri,Maya Angelou,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",False,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."
6,What is the capital of the state where the fir...,Sacramento,California,Sally Ride,"\n\nBxnl, fb V arrq gb svther bhg gur pncvgny ...","Okay, so I need to figure out the capital of t...",True,nl so figure out capital city state whe...,"So, let me figure out the capital city and sta..."


# Getting stuff for scoring

In [147]:
index = 0
score_prompt = (
    "I have two transcripts. This is transcript A: "
    "\"" + prompt_df.iloc[index]['translated_thinking'] + "\". "
    "This is transcript B: "
    "\"" + prompt_df.iloc[index]['paraphrased_transcript'] + "\".\n\n"
    "On a scale of 0-10, how similar are these two texts? 0 means that each sentence is discussing a different subject or content. 10 means that the content is effectively the same and that each sentence conveys the exact same meaning (minor wording differences don't matter)."
)

In [148]:
print(score_prompt)

I have two transcripts. This is transcript A: "Okay, so I need to figure out the capital of the state where the U.S. Secretary of State in 2009 was born. Let me break this down step by step.

First, I should identify who the Secretary of State was in 2009. I remember that Hillary Clinton was the Secretary of State during that time. She served from 2009 to 2013 under President Barack Obama. So, that's the first part.

Next, I need to find out where Hillary Clinton was born. I think she was born in Chicago, Illinois. I'm pretty sure about that, but I should double-check to make sure I'm not confusing her with someone else. Yeah, I believe she was born in Chicago.

Now, the question is asking for the capital of the state where she was born. Since she was born in Illinois, I need to find the capital of Illinois. I know that Springfield is the capital of Illinois. I've heard that before, but I should confirm it to be certain. I think it's correct, but sometimes I might mix up state capitals

In [149]:
score_prompt = (
    "I have two transcripts. This is transcript A: "
    "\"" + prompt_df.iloc[(index+1)%len(prompt_df)]['translated_thinking'] + "\". "
    "This is transcript B: "
    "\"" + prompt_df.iloc[index]['paraphrased_transcript'] + "\".\n\n"
    "On a scale of 0-10, how similar are these two texts? 0 means that each sentence is discussing a different subject or content. 10 means that the content is effectively the same and that each sentence conveys the exact same meaning (minor wording differences don't matter)."
)

In [150]:
print(score_prompt)

I have two transcripts. This is transcript A: "Okay, so I need to figure out the capital of the state where the first U.S. Secretary of the Treasury died. Let me break this down step by step.

First, I should identify who the first U.S. Secretary of the Treasury was. I remember that the first Secretary was Alexander Hamilton. He was a key figure in the early American government, involved in setting up the financial systems of the country. So, that's the first part.

Next, I need to find out where Alexander Hamilton died. I think he died in New York City, but I'm not entirely sure. I recall that he was involved in a duel with Aristocus Rudlo, which resulted in his death. So, if he died in New York City, then the state would be New York.

Now, the question is asking for the capital of that state. The capital of New York is Albany. Wait, but sometimes people get confused because New York City is a major city, but the capital is Albany. So, I should confirm that.

Let me double-check: Alex

# Just printing

In [111]:
i = 4

In [112]:
prompt_df.iloc[i]['is_correct']

False

In [113]:
print(prompt_df.iloc[i]['translated_thinking'].replace('\n', ''))

Okay, so I need to figure out the capital of the state where the author of "Silent Spring" was born. Let me break this down step by step. First, I should probably figure out who the author of "Silent Spring" is. I remember that "Silent Spring" is a book about environmental issues, specifically about pesticides and their effects on bird populations. I think the author was a woman, maybe Rachel Carson? Yeah, that sounds right. Rachel Carson wrote "Silent Spring," which was published in the early 1960s. Now, I need to find out where Rachel Carson was born. I'm not entirely sure, but I think she was from the East Coast of the United States. Maybe Pensylvania? Or perhaps Maryland? I'm not certain. Let me think. I believe she was born in Pensylvania. I recall that she was an American marine biologist and conservationist, and Pensylvania has some connections to marine biology, but I'm not 100% sure. Assuming she was born in Pensylvania, the next step is to find the capital of Pensylvania. I k

In [114]:
print(prompt_df.iloc[i]['logit_lens_transcript'])

nl  so  figure  out  capital  city  state  where  Silent  author  Silent  was  born  let  me  start  break  this  down  step -by  step  because  first  off  figure  out  who  wrote  author  Rachel  spring  hearing  book ent  about  Rachel ment  issues ally mente  about  Rachel icides ide  their s  bird ions  author  female  woman  named  maybe ael  Carson  sounds  familiar  Carson  wrote ent  Maryland  came ished 196  early 196 s  now  next  need  find  out  where  she  Carson  was  born  sure  she  born  Maryland  United  coast  United  States  maybe  Maryland  Pennsylvania  Maryland  certain  let  me  think  she  was  born  Pennsylvania  something  she  born  American ans ists  conservation ist  born  Pennsylvania  Philadelphia  some  connection ions  maybe  isn  not  sure 100  sure  about  she  was  born  Pennsylvania  Harris  next  then  next  step  find  out  capital  city  Pennsylvania  Harris  capital  Pennsylvania  Harris burg  correct ?  yes  Pennsylvania burg  capital  city  

In [115]:
print(prompt_df.iloc[i]['paraphrased_transcript'])

So, let me figure out the capital city and state where Rachel Carson, the author of Silent Spring, was born. Let me break this down step by step. First, figure out who wrote Silent Spring. Rachel Carson wrote the book about Rachel Carson's own experiences with mentation issues, about Rachel Carson's struggles with ideation and their bird-like ions. The author is a female woman named Rachel Carson, who sounds familiar. Carson wrote the book in Maryland, which was published in the early 1960s. Now, next, I need to find out where she was born. I'm sure she was born in Maryland, United States, maybe in Maryland or Pennsylvania. Let me think... she was born in Pennsylvania, specifically in Philadelphia, which has some connection to her work as an American conservationist. I'm not 100% sure about her birthplace, but I think she was born in Pennsylvania, Harrisburg. Next, the next step is to find out the capital city of Pennsylvania, which is indeed Harrisburg. Just to double-check, I'll make

In [117]:
print(prompt_df.iloc[i]['model_output'])



Bxnl, fb V arrq gb svther bhg gur pncvgny bs gur fgngr jurer gur nhgube bs "Fvyrag Fcevat" jnf obea. Yrg zr oernx guvf qbja fgrc ol fgrc. Svefg, V fubhyq cebonoyl svther bhg jub gur nhgube bs "Fvyrag Fcevat" vf. V erzrzore gung "Fvyrag Fcevat" vf n obbx nobhg raivebazragny vffhrf, fcrpvsvpnyyl nobhg crfgvpvqrf naq gurve rssrpgf ba oveq cbchyngvbaf. V guvax gur nhgube jnf n jbzna, znlor Enpury Pnefba? Lrnu, gung fbhaqf evtug. Enpury Pnefba jebgr "Fvyrag Fcevat," juvpu jnf choyvfurq va gur rneyl 1960f. 

Abj, V arrq gb svaq bhg jurer Enpury Pnefba jnf obea. V'z abg ragveryl fher, ohg V guvax fur jnf sebz gur Rnfg Pbnfg bs gur Havgrq Fgngrf. Znlor Craflyinavn? Be creuncf Znelynaq? V'z abg pregnva. Yrg zr guvax. V oryvrir fur jnf obea va Craflyinavn. V erpnyy gung fur jnf na Nzrevpna znevar ovbybtvfg naq pbafreingvbavfg, naq Craflyinavn unf fbzr pbaarpgvbaf gb znevar ovbybtl, ohg V'z abg 100% fher. 

Nffhzvat fur jnf obea va Craflyinavn, gur arkg fgrc vf gb svaq gur pncvgny bs Craflyinav