## Using Notebook Environments
1. To run a cell, press `shift + enter`. The notebook will execute the code in the cell and move to the next cell. If the cell contains a markdown cell (text only), it will render the markdown and move to the next cell.
2. Since cells can be executed in any order and variables can be over-written, you may at some point feel that you have lost track of the state of your notebook. If this is the case, you can always restart the kernel by clicking Runtime in the menu bar (if you're using Colab) and selecting `Restart runtime`. This will clear all variables and outputs.
3. The final variable in a cell will be printed on the screen. If you want to print multiple variables, use the `print()` function as usual.

Notebook environments support code cells and markdown cells. For the purposes of this workshop, markdown cells are used to provide high-level explanations of the code. More specific details are provided in the code cells themselves in the form of comments (lines beginning with `#`)

## Environment Setup
**Make sure to set your runtime to use a GPU by going to `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `T4 GPU`**


In [None]:
import sys
if 'google.colab' in sys.modules:  # If in Google Colab environment
    # Installing requisite packages
    !pip install transformers accelerate &> /dev/null

    # Mount google drive to enable access to data files
    from google.colab import drive
    drive.mount('/content/drive')

    # Change working directory to health
    %cd /content/drive/MyDrive/LLM4BeSci_GSERM2024/day_4

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import textwrap

## Phi-3 Takes the Berlin Numeracy Test
The [Berlin Numeracy Test](https://doi.org/10.1017/S1930297500001819) is a widely used test to measure an individual's ability to understand and apply statistical concepts. The test consists of four questions that require a basic understanding of probability and statistics. In this exercise, we will ask Phi-3 to solve these questions. Phi-3 will provide an answer to each question, and we will evaluate the quality of the response.

We begin by defining the four questions: 

In [None]:
q1 = """
Imagine we are throwing a five-sided die 50 times. On average, out of these 50 throws how many times would this five-sided die show an odd number (1, 3 or 5)?
"""

q2 = """
Out of 1,000 people in a small town 500 are members of a choir. Out of these 500 members in the choir 100 are men. Out of the 500 inhabitants that are not in the choir 300 are men. What is the probability that a randomly drawn man is a member of the choir? (please indicate the probability in percent).
"""

q3 = """
Imagine we are throwing a loaded die (6 sides). The probability that the die shows a 6 is twice as high as the probability of each of the other numbers. On average, out of these 70 throws, how many times would the die show the number 6?
"""

q4 = """
In a forest 20% of mushrooms are red, 50% brown and 30% white. A red mushroom is poisonous with a probability of 20%. A mushroom that is not red is poisonous with a probability of 5%. What is the probability that a poisonous mushroom in the forest is red?
"""

We next load Phi-3. Since we will be using it to generate text, we load it using the `transformers` `AutoModelForCausalLM` class. We also load the `AutoTokenizer` class to tokenize the input text.

In [None]:
torch.random.manual_seed(42) # Set seed for reproducibility

# Load Phi-3
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda", # Use GPU
    torch_dtype=torch.float16,  # Use float16 for faster inference
    trust_remote_code=True, 
    attn_implementation='eager' # Faster inference on T4 GPUs
)

# Load tokenizer`
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

We next define a `'text-generation'` pipeline using the `transformers` library. This pipeline allows us to generate text using Phi-3. We begin by allowing it to only generate 100 new tokens (`"max_new_tokens": 100`) and setting the `do_sample` parameter to `False`. This parameter controls whether the model should use sampling to generate text. When set to `False`, the model will use "greedy decoding" to generate text, which means that it will always choose the most likely token at each step. We also set the `"return_full_text"` parameter to `False` to only return the generated text and not the input text.

In [None]:
# Define text-generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

# Define generation arguments
generation_args = {
    "max_new_tokens": 100,  # Maximum number of tokens to generate
    "return_full_text": False, # Return only the generated text
    "do_sample": False, # Use greedy decoding
    # "temperature": 0.0  # Uncomment this line to set the temperature parameter
}

# Loop through questions and generate responses
for i, question in enumerate([q1, q2, q3, q4]):
    print('-------------------------')   
    
    # Define prompt
    question = question + " Make your answer as brief as possible using less than 50 words."
    prompt = [{"role": "user", "content": question}]  # Define prompt with JSON structure
    
    # Generate response
    response = pipe(prompt, **generation_args)[0]['generated_text']
    
    # Format question and response for printing
    question = '\n'.join(textwrap.wrap(question, 100))
    response = '\n'.join(textwrap.wrap(response, 100))
    print(f"Question {i+1}: {question} \n\nAnswer: {response}\n")


**TASK 1**: Try playing around with the `"temperature"` parameter in the `generation_args` dictionary (you will need to uncomment the line it is on and set `do_sample=True`). The temperature parameter controls the randomness of the generated text. A temperature of 0.0 will generate the most likely token at each step, whereas a temperature of 1.0 (default) brings about a useful level of randomness. Increasing the temperature will increase the stochasticity even further. Try `1.0`, `3.0`, and `float('inf')`.


## Chain-of-thought
We can also ask Phi-3 to explain its reasoning step by step before giving the final answer. This is known as a chain-of-thought prompting, and has been shown to improve the quality of the response by enabling the model to run through more iterations of reasoning (thus leveraging more computational resources) before coming to a conclusion. We can do this by providing a chain-of-thought prompt to the model. Let's try this with Phi-3.

Before we do this, however, let us first see what happens when we increase the number of tokens generated to 1000. This will allow us to make a fairer comparison to the chain-of-thought prompt, which will anyway require us to increase the number of tokens generated. Run the cell below:

In [None]:
generation_args = {
    "max_new_tokens": 1000,
    "return_full_text": False,
    "do_sample": False
}

for i, question in enumerate([q1, q2, q3, q4]):
  print('-------------------------')
  
  # Define prompt
  prompt = [{"role": "user", "content": question}]
  question = question + ""  # Replace this emptry string with the chain-of-thought prompt for TASK 2
  
  # Generate response
  response = pipe(question, **generation_args)[0]['generated_text']
  
  # Format question and response for printing
  question = '\n'.join(textwrap.wrap(question, 100))
  response = '\n'.join(textwrap.wrap(response, 100))
  print(f"Question {i+1}: {question} \n\nAnswer: {response}\n")

**Task 1**: Run the code above with the new `"max_new_tokens"` of 1000 (note: this may take a while).
**TASK 2**: Try asking the model to `" Go through your reasoning step by step before giving the final answer."` (by replacing the empty string above). This is a chain-of-thought prompt that encourages the model to explain its reasoning. 