<a href="https://colab.research.google.com/github/ethanherrera/info159/blob/main/HW3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 3: Large Language Models & Prompting

### Due Date: March 6, 2025 (11:59pm)

**Total Points: 100 points**

**Make a copy** of this notebook and work only on your copy, so that your work is retained. Make sure your copy is named `HW3.ipynb`.  

As noted on the syllabus, this is one of the assignments that may require [Colab Pro](https://colab.research.google.com/signup) access (depending on your Colab usage already) with a personal Google account (your Berkeley institutional account may restrict setting up Google payments). We recommend that you start this assignment early in case you run into any GPU limit issues.

Now, go to Runtime -> change runtime type -> select L4 GPU.

**WARNING**: Remember to *close* this assignment's Colab tab when you are not working on it, as GPU resources are not unlimited.

A guide for Colab GPU use [here](https://colab.research.google.com/notebooks/pro.ipynb).

# Introduction

In this assignment, you will work with an instruction-tuned language model, Phi 3.5 mini.

We'll cover a few prompting strategies:
  - Few-shot prompting
  - Chain-of-thought
  - Self-consistency

We'll conclude by showing how one can thread multiple prompts into a conversation.

**Only** modify code in between `### BEGIN SOLUTION` and `### END SOLUTION` lines. There are also some "Think to yourself" questions throughout the notebook; you do not need to answer these to earn homework points.

In [None]:
!pip install flash-attn --no-build-isolation
!pip install datasets

In [None]:
import json
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from datasets import load_dataset, Dataset
from tqdm.auto import tqdm
from transformers.pipelines.pt_utils import KeyDataset
import random
from collections import Counter

# Setup & Dataset (5 points)

Our data consists of **math problems**.

Download our dataset:

In [None]:
!wget https://raw.githubusercontent.com/lucy3/temp/main/challenge.jsonl
!wget https://raw.githubusercontent.com/lucy3/temp/main/math.jsonl
!wget https://raw.githubusercontent.com/lucy3/temp/main/extra_exemplars.jsonl
!wget https://raw.githubusercontent.com/lucy3/temp/main/exemplars.jsonl

A good rule of thumb when starting out with any modeling task is to take a look at the data you'll be working with.

Our data is stored in in `.jsonl` files, where where each line of each file is one task example, or math problem, in `json` format.

Note that the number of examples we'll work with is usually too small for sufficiently confident evaluation. For this homework assignment, we keep our data small to avoid excessive time and compute resource usage.

In [None]:
data_files = {"math": "math.jsonl", "challenge": "challenge.jsonl"}
dataset = load_dataset("json", data_files=data_files)
dataset

Your task:
- (5 points) Take a look at the data by printing out **one** dataset example from the `math` split of `dataset`.

In [None]:
print("Dataset example:")
# YOUR TASK: assign one dataset example to "example" variable
### BEGIN SOLUTION

### END SOLUTION
print(json.dumps(example, indent=4)) # the indent parameter here makes the json output pretty

The output of the cell above should have a similar structure as the snippet below, but may differ in content:

```
Dataset example:
{
  "question": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?",
  "answer": "72"
}
```

💡 Think to yourself: what do you notice about the math problems in our dataset? How difficult do they seem to be? What kinds of reasoning steps do they require?

Now, download and load in the model we'll be working with.

Phi 3.5 mini is a 3.8B parameter language model trained by Microsoft on synthetic data and the web. To learn more, see its [model report](https://arxiv.org/abs/2404.14219).

In [None]:
torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3.5-mini-instruct", # model id from Hugging Face
    device_map="cuda", # use a GPU
    torch_dtype="auto", # automatically derive data type from the model’s weights
    trust_remote_code=True,
    attn_implementation='flash_attention_2', # attention algorithm for mitigating self-attention memory bottleneck
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

# Zero-shot Prompting (20 points)

First, we start with the following short prompt:


In [None]:
zero_shot_prompt = "Question: {question}\nAnswer:"
print(zero_shot_prompt)

Let's run our model on the above prompt.

Your coding task in `format_chat_prompt` (20 points):
- Insert the dataset `example`'s question into `prompt_template` to create the user prompt. Assume any prompt template contains a `{question}` slot, like `zero_shot_prompt` does above. Hint: use `.format()`
- Then, create the language model's input `messages`, which includes the system prompt `"You are a helpful assistant; follow the instructions in the prompt."` and user prompt.

To learn about chat prompt templating, read this [guide](https://huggingface.co/docs/transformers/main/en/chat_templating) or Phi 3.5's [model card](https://huggingface.co/microsoft/Phi-3.5-mini-instruct).

In [None]:
def format_chat_prompt(dataset_split, prompt_template, n_examples=-1):
  '''
  @inputs:
  - dataset_split: a string representing the split of the dataset to run the model on
  - prompt_template: the prompt template for the user prompt.
  - n_examples: an integer noting the number of examples to run the model. If -1, run the model on all examples
  '''
  all_messages = []
  for i, example in enumerate(dataset[dataset_split]):
    # if we have enough examples, stop
    if n_examples > 0 and i == n_examples:
      break

    messages = []

    # YOUR TASK: format task example into chat input format
    ### BEGIN SOLUTION

    ### END SOLUTION

    all_messages.append(messages)
  return all_messages

def run_one_by_one(pipe, all_messages, generation_args, dataset_split):
  '''
  Runs each example one by one, and prints out their outputs. Best for use with
  a few examples for iterative prompt engineering.

  @inputs:
  - pipe: a transformer pipeline
  - all_messages: a list of chat-formatted "messages"
  - generation_args: parameters for running pipe
  '''
  predictions = []
  i = 0
  for messages in all_messages:
    output = pipe(messages, **generation_args)
    prompt = all_messages[i][1]['content'].strip()
    response = output[0]['generated_text'].strip()

    # populate the output dictionary
    d = {}
    example = dataset[dataset_split][i]
    d['question'] = example['question']
    d['answer'] = example['answer']
    d['prediction'] = response
    d['prompt'] = prompt
    predictions.append(d)

    print()
    print("---- INPUT " + str(i) + " ----")
    print(tokenizer.apply_chat_template(messages, tokenize=False))
    print("---- MODEL RESPONSE ----")
    print(response)
    print("---- GROUND TRUTH ----")
    print(d['answer'])
    print('------------------')
    print()
    i += 1
  return predictions

def run_batches(pipe, all_messages, generation_args, dataset_split):
  '''
  Runs the model pipeline on input data in a batched manner. No printing.
  We don't really use this function, but it may be useful for seeing one way of
  running multiple input examples through a pipeline in one call.

  @inputs:
  - pipe: a transformer pipeline
  - all_messages: a list of chat-formatted "messages"
  - generation_args: parameters for running pipe
  '''
  predictions = []
  message_dataset = Dataset.from_dict({"chat": all_messages})
  i = 0
  for out in pipe(KeyDataset(message_dataset, "chat"), **generation_args):
    response = out[0]['generated_text'].strip()
    prompt = all_messages[i][1]['content'].strip()

    d = {}
    example = dataset[dataset_split][i]
    d['question'] = example['question']
    d['answer'] = example['answer']
    d['prediction'] = response
    d['prompt'] = prompt
    predictions.append(d)
    i += 1
  return predictions

def run_model_on_one_prompt(model, tokenizer, dataset_split, prompt_template, n_examples=-1, do_batches=True):
  '''
  @inputs:
  - model: A Hugging Face transformers model, e.g. one loaded with "AutoModelForCausalLM.from_pretrained"
  - tokenizer: A Hugging Face transformers tokenizer
  - dataset_split: a string representing the split of the dataset to run the model on
  - prompt_template: one prompt template
  - n_examples: an integer noting the number of examples to run the model. If -1, run the model on all examples
  - do_batches: a boolean about whether to run and print task examples one by one, or run them in a batched manner

  @outputs:
  - predictions: a list of dictionaries, each containing one math question, ground truth (if given), and model predictions
  '''
  torch.random.manual_seed(0)

  pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
  )

  generation_args = {
    "max_new_tokens": 256,
    "return_full_text": False,
    "do_sample": False,
  }

  all_messages = format_chat_prompt(dataset_split, prompt_template, n_examples=n_examples)

  if do_batches:
    predictions = run_batches(pipe, all_messages, generation_args, dataset_split)
  else:
    predictions = run_one_by_one(pipe, all_messages, generation_args, dataset_split)

  return predictions

The following cell runs our language model on five `math` examples and prints inputs and outputs. It should take around 1+ minute to run.

In [None]:
predictions = run_model_on_one_prompt(model, tokenizer, 'math', zero_shot_prompt, n_examples=5, do_batches=False)

💡 Think to yourself: what do you notice about the inputs and outputs from the model? How would you evaluate or compare model responses against ground truth answers?

The funny looking tokens, e.g. `<|system|>`, are called "special tokens". Special tokens are introduced into language models' vocabularies for purposes such as delineating the beginning/end of a prompt or adding chat structure (learn more [here](https://huggingface.co/learn/agents-course/en/unit1/messages-and-special-tokens)).

In [None]:
# view what `predictions` looks like
print(json.dumps(predictions[0], indent=4))

# Few-shot Prompting (20 points)

A common way to improve models' performance is by including demonstrative examples in prompts. This paradigm is called *in-context learning*.

Providing examples can also encourage consistent output formatting, which helps with extracting models' answers for evaluation.

Run the following cell to load in some exemplars:

In [None]:
default_exemplars = []
with open('exemplars.jsonl', 'r') as infile:
  for line in infile:
    default_exemplars.append(json.loads(line))
print(json.dumps(default_exemplars[0], indent=4))
print("Total number of exemplars:", len(default_exemplars))

Here's what a two-shot prompt template would look like, structurally:

```
Your task is to solve math questions. Keep your response brief, and conclude your response with a numeric answer.

Question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?
Answer: 6

Question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
Answer: 39

Question: {question}
Answer:
```

Based on what we observe when running our model with a zero-shot prompt, we also add an instruction encouraging the model to *keep its response brief*. Prompt engineering is an iterative process, where we often adjust instructions based on observations made from small-scale experiments. In addition, a general pitfall common to machine learning is [underspecification](https://arxiv.org/abs/2011.03395). In a world of prompting generative AI models, this pitfall now includes now includes cases where instructions are ambiguous or do not specify certain information (read this [paper](https://arxiv.org/abs/2210.05815) to learn more).

Your task:
- (20 points) Complete the following function, which creates a prompt template that contains $n$ shots. Use the above two-shot example as a guide for how your $n$-shot prompt should look like.

In [None]:
def create_few_shot_prompt(exemplars, n=1):
  '''
  Creates a few shot prompt
  @inputs
  - exemplars: a list of dictionaries of the format {'question': '', 'target': ''}, where each
  dictionary is a demonstrative example of the task
  - n: the number of exemplars to sample
  '''
  assert n <= len(exemplars)
  random.seed(0)

  prompt_prefix = 'Your task is to solve math questions. Keep your response brief, and conclude your response with a numeric answer.\n\n'
  shot_template = "Question: {exemplar_question}\nAnswer: {exemplar_answer}\n\n"
  prompt_suffix = "Question: {question}\nAnswer:"

  prompt = prompt_prefix
  ### BEGIN SOLUTION


  ### END SOLUTION

  return prompt

Take a look at whether and how your function works:

In [None]:
three_shot_prompt = create_few_shot_prompt(default_exemplars, n=3)
print(three_shot_prompt)
print('---------------------')
five_shot_prompt = create_few_shot_prompt(default_exemplars, n=5)
print(five_shot_prompt)

In [None]:
predictions = run_model_on_one_prompt(model, tokenizer, 'math', five_shot_prompt, n_examples=5, do_batches=False)

💡 Think to yourself: what do you notice about the inputs and outputs from this prompt? How do they differ from our minimal zero-shot approach? What problems remain?

# Chain-of-thought Prompting (10 points)

In the model responses in previous sections, you may have observed that the model's lengthy response includes its intermediate problem solving steps. This behavior arises because models are now developed to vocalize their "[chain of thought](https://arxiv.org/abs/2201.11903)" before settling on an answer (or "[think step by step](https://arxiv.org/abs/2205.11916)"). In earlier language models, one had to instruct models to do this "thinking" or demonstrate it to elicit it, but more recent models are built to do it by default, especially for math problems.

Though current models tend to be expressive with step-by-step explanations by default, it's still useful to explicitly show models how to "think". That is, few-shot exemplars *that include CoT* can specify how and where models should do their "thinking".

Your task:
- (10 points) Look through the data files we loaded at the beginning of this notebook, and identify the one that contains CoT exemplars. Then, in the cell below, populate a list of CoT exemplars from that data file, similar to `default_exemplars`. The printed output of the cell below should be one json.

In [None]:
cot_exemplars = []
### BEGIN SOLUTION


### END SOLUTION
print(json.dumps(cot_exemplars[0], indent=4))

In [None]:
three_shot_cot_prompt = create_few_shot_prompt(cot_exemplars, n=3)
print(three_shot_cot_prompt)
print('---------------------')
five_shot_cot_prompt = create_few_shot_prompt(cot_exemplars, n=5)
print(five_shot_cot_prompt)

In [None]:
predictions = run_model_on_one_prompt(model, tokenizer, 'math', three_shot_cot_prompt, n_examples=5, do_batches=False)

💡 Think to yourself: what do you notice about the inputs and outputs from this prompt? What is different from our few-shot approach before? How does it compare to our zero-shot approach? Some researchers have suggested that CoT is mainly useful for math and symbolic reasoning tasks ([source](https://arxiv.org/abs/2409.12183v2)). Why do you think this is?

# Self-consistency (25 points)

In the examples we ran on in earlier sections, it seems like Phi 3.5 mini doesn't do too badly on math.

To keep our model on its toes, we extracted a set of challenging math problems from `math.jsonl` and put them in `challenge.jsonl`. These examples are in the `challenge` split of the `dataset` we loaded at the beginning of our notebook.

Let's run our previous CoT approach on some examples from this challenge set and see how well it does.

In [None]:
predictions = run_model_on_one_prompt(model, tokenizer, 'challenge', three_shot_cot_prompt, n_examples=-1, do_batches=False)

It seems like we'll need to cook up some more strategies to see if we can do better.

One approach is to sample from multiple output trajectories and aggregate their answers with majority voting. This strategy is called [self-consistency](https://arxiv.org/abs/2203.11171), and has been shown to improve CoT for some tasks. Self-consistency is a form of *ensembling*. The authors of the paper call it "self-ensembling", because instead of aggregating outputs across several different models, we're aggregating outputs from the same model.

Your task:
- (10 points) In `run_model_with_self_consistency`, modify the `generation_args` dictionary by adding *four* parameters. These parameters should set the model temperature to be 0.7, turn on sampling decoding, set top-$k$ to be 40, and request the pipeline to return 5 sequences per prompt. Check Hugging Face `transformers` documentation to see how one should set these parameters ([generate_kwargs](https://huggingface.co/docs/transformers/main_classes/text_generation) of [pipeline](https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline)).

In [None]:
def run_model_with_self_consistency(model, tokenizer, dataset_split, prompt_template, n_examples=-1, do_batches=True):
  '''
  @inputs:
  - model: A Hugging Face transformers model, e.g. one loaded with "AutoModelForCausalLM.from_pretrained"
  - tokenizer: A Hugging Face transformers tokenizer
  - dataset_split: a string representing the split of the dataset to run the model on
  - prompt_template: one prompt template
  - n_examples: an integer noting the number of examples to run the model. If -1, run the model on all examples
  - do_batches: a boolean about whether to run and print task examples one by one, or run them in a batched manner

  @outputs:
  - predictions: a list of dictionaries, each containing one math question, ground truth (if given), and model predictions
  '''
  torch.random.manual_seed(0)

  pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
  )

  generation_args = {
    "max_new_tokens": 256,
    "return_full_text": False,
    ### BEGIN SOLUTION


    ### END SOLUTION
  }

  all_messages = format_chat_prompt(dataset_split, prompt_template, n_examples=n_examples)

  predictions = run_self_consistency(pipe, all_messages, generation_args, dataset_split)

  return predictions

def run_self_consistency(pipe, all_messages, generation_args, dataset_split):
  predictions = []
  i = 0
  for messages in tqdm(all_messages):
    output = pipe(messages, **generation_args)
    prompt = all_messages[i][1]['content'].strip()
    responses = []
    # gather each response - there is now more than one per prompt!
    for o in output:
      responses.append(o['generated_text'].strip())

    # populate the output dictionary
    d = {}
    example = dataset[dataset_split][i]
    d['question'] = example['question']
    d['answer'] = example['answer']
    d['prediction'] = responses
    d['prompt'] = prompt
    predictions.append(d)

    i += 1
  return predictions

The following should take around ~2 minutes to run.

In [None]:
self_consistency_predictions = run_model_with_self_consistency(model, tokenizer, 'challenge', three_shot_cot_prompt, n_examples=-1, do_batches=False)

Your tasks:
- (10 points) Implement the `extract_answers_from_responses` function.
- (5 points) Implement the `get_majority_vote` function.

In [None]:
def extract_answers_from_responses(response_list):
  '''
  @input
  - response_list: a list of model responses, where each response is a string

  @output
  - a list of answers, where each answer is a string representing a numeric value

  You can decide how you return ill-formed responses, such as responses that do not
  end with "The answer is ____."

  Your function should handle some punctuation marks in numbers such as '.', '$', and ',', e.g.
    9.00 -> 9
    $9 -> 9
    9,000 -> 9000
  '''
  ### BEGIN SOLUTION



  ### END SOLUTION

In [None]:
def get_majority_vote(answer_list):
  '''
  Get the most common answer from a list of answers.
  Ties may be broken arbitrarily.

  @input
  - response_list: a list of answers, where each answer is a string representing a numeric value.
  For example, one response list might look like ['10', '10', '21', '10', '6']

  @output
  - the most common answer, as a string
  '''
  ### BEGIN SOLUTION

  ### END SOLUTION

Run the following to test out your functions and double check they work:

In [None]:
test1 = extract_answers_from_responses(['We add $10,000 to $60,000 to make $70,000. The answer is $70,000.'])
assert test1 == ['70000']
test2 = extract_answers_from_responses(['Mary has 4 eggs and Steve gives her 2, which totals 6 eggs. The answer is 6.'])
assert test2 == ['6']
test3 = get_majority_vote(['10', '10', '21', '10', '6'])
assert test3 == '10'

In [None]:
for pred in self_consistency_predictions:
  extracted_answers = extract_answers_from_responses(pred['prediction'])
  majority_vote = get_majority_vote(extracted_answers)

  print("---- QUESTION ----")
  print(pred['question'])
  print("---- MODEL RESPONSES ----")
  print(extracted_answers)
  print("---- MAJORITY RESPONSE ----")
  print(majority_vote)
  print("---- GROUND TRUTH ----")
  print(pred['answer'])
  print('------------------')
  print()

💡 Think to yourself: How much does self-consistency help with tackling problems in our challenge set?

Arithmetic is a common domain for evaluating language models because each example has only one true answer, and its answers are easily verifiable. What might self-consistency look like for more open-ended tasks? Check out this [paper](https://arxiv.org/abs/2311.17311) for one possibility.

# From Prompts to Conversations (20 points)

When you use an online chatbot, such as ChatGPT (https://chat.openai.com/) or Claude (https://claude.ai/), you may notice that each chat can be a series of messages back and forth, rather than just one user prompt and model response like we have seen so far.

In practice, these chat chains involve re-prompting a model with successively longer prompts, where later exchanges include the "conversation history" of earlier exchanges.

We'll pretend that we have a user who has a multi-part math problem they'd like to solve, and they're chatting with Phi 3.5 to solve each part. To create this "chat", we'll prepend the previous subquestions and their answers to create a series of messages.

For example, this is what is inputted into a model each time the user asks a follow-up question in a conversation:

Prompt 1

```
User: Cappuccinos cost $2, iced teas cost $3, cafe lattes cost $1.5 and espressos cost $1 each. Sandy orders some drinks for herself and some friends. She orders three cappuccinos, two iced teas, two cafe lattes, and two espressos. How much did the cappuccinos cost?
```

Prompt 2

```
User: Cappuccinos cost $2, iced teas cost $3, cafe lattes cost $1.5 and espressos cost $1 each. Sandy orders some drinks for herself and some friends. She orders three cappuccinos, two iced teas, two cafe lattes, and two espressos. How much did the cappuccinos cost?

Model: Each cappucino costs $2, and Sandy ordered three of them. $2 x 3 cappucinos is $6.

User: How much did the iced teas and cafe lattes cost together?
```

Prompt 3

```
User: Cappuccinos cost $2, iced teas cost $3, cafe lattes cost $1.5 and espressos cost $1 each. Sandy orders some drinks for herself and some friends. She orders three cappuccinos, two iced teas, two cafe lattes, and two espressos. How much did the cappuccinos cost?

Model: Each cappucino costs $2, and Sandy ordered three of them. $2 x 3 cappucinos is $6.

User: How much did the iced teas and cafe lattes cost together?

Model: The cappucinos altogether cost $6. Each cafe latte is $1.50 and Sandy ordered two of them. $6.00 + $1.50 x 2 = $6.00 + $3.00 = $9.00.

User: How much did all of the drinks cost in total?
```

and so on.


In [None]:
question_chains = [
    {
    'premise': 'Four years ago, Kody was only half as old as Mohamed. Mohamed is currently twice 30 years old.',
    'questions': ["How old is Mohamed right now?", "How old was he four years ago?", "How old was Kody four years ago?", "How old is Kody now?"]
    },
    {
    'premise': 'Sam, Sid, and Steve brought popsicle sticks for their group activity in their Art class. Sam has thrice as many as Sid, and Sid has twice as many as Steve. Steve has 12 popsicle sticks.',
    'questions': ["How many popsicle sticks does Sid have?", "How many popsicle sticks does Sam have?", "How many popsicle sticks can these three people use for their Art class activity?"]
    }
]

Your task:
- (20 points) Complete the following function to complete the conversation chain between the user and Phi 3.5 mini. Your solution should call `pipe` with `generation_args`.

In [None]:
def run_conversation(model, tokenizer, question_chains):
  '''
  @inputs:
  - model: A Hugging Face transformers model, e.g. one loaded with "AutoModelForCausalLM.from_pretrained"
  - tokenizer: A Hugging Face transformers tokenizer
  - dataset_split: a string representing the split of the dataset to run the model on
  - prompt_template: one prompt template
  - n_examples: an integer noting the number of examples to run the model. If -1, run the model on all examples
  - do_batches: a boolean about whether to run and print task examples one by one, or run them in a batched manner

  @outputs:
  - predictions: a list of dictionaries, each containing one math question, ground truth (if given), and model predictions
  '''
  torch.random.manual_seed(0)

  pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
  )

  generation_args = {
    "max_new_tokens": 256,
    "return_full_text": False,
    "do_sample": False,
  }

  all_rounds = [] # a list of list of inputs
  for chain in question_chains:
    premise = chain['premise']
    questions = chain['questions']
    messages = [
        {"role": "system", "content": "You are a helpful assistant; follow the instructions in the prompt."}, # system prompt
    ]
    rounds = []
    for i, q in enumerate(questions):
      # this part simulates the user, who keeps asking followup questions
      if i == 0:
        # the first input includes the math questions' premise
        prompt = premise + ' ' + q
      else:
        prompt = q
      messages.append({"role": "user", "content": prompt})

      # YOUR TASK is to continue the model's side of the conversation chain
      ### BEGIN SOLUTION



      ### END SOLUTION

      rounds.append(tokenizer.apply_chat_template(messages, tokenize=False))
    all_rounds.append(rounds)

  return all_rounds

In [None]:
all_rounds = run_conversation(model, tokenizer, question_chains)

Let's take a look at how one conversation grew:

In [None]:
print("***** First conversational exchange: *****")
print(all_rounds[0][0])
print()
print('-----------------------------------------------')
print()

print("***** Entire conversational exchange: *****")
print(all_rounds[0][-1])
print()

Though special tokens are useful for models, they may not make text legible to us. Here's another view of this mathy conversation:

In [None]:
def pretty_print_round(round):
  round = round.replace('<|end|>', '').replace('<|endoftext|>', '')
  round = round.replace("<|system|>\n", '⚙️: ')
  round = round.replace("<|user|>\n", '\n🧑:\n')
  round = round.replace("<|assistant|>\n", '\n🤖:\n')
  return round

print("***** First conversational exchange: *****")
print(pretty_print_round(all_rounds[0][0]))
print()
print('-----------------------------------------------')
print()

print("***** Entire conversational exchange: *****")
print(pretty_print_round(all_rounds[0][-1]))
print()

# Closing and Submission

Congratulations on finishing HW3! Please ensure that you submit this completed notebook onto Gradescope. Make sure all cells in the notebook are run so that print statements are visible. The notebook you upload to Gradescope must be named **HW3.ipynb**.

`File` --> `Download` --> `Download .ipynb`