<a href="https://colab.research.google.com/github/automix-llm/automix/blob/main/colabs/Step1_SolveQueries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoMix: Solving the task

- This is the first step of the process. We run inference on both the 13b and 70b models for all tasks. Note that in practice, we don't have to run inference on both the models. This is just for ease of implementation.

- Step 2 is verification. Please see the notebook [here](llama13b_f1).


*Note: The outputs of this step are provided [here](https://drive.google.com/file/d/1dhyt7UuYumk9Gae9eJ_mpTVrLeSTuRht/view?usp=sharing).*

In [None]:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
from functools import partial
from tqdm import tqdm
from transformers import AutoTokenizer

# Using the tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-13b-hf")


### Read data

In [None]:
# get the input file from https://drive.google.com/file/d/1dhyt7UuYumk9Gae9eJ_mpTVrLeSTuRht/view?usp=sharing

In [None]:
inputs = pd.read_json("data/automix_input.jsonl", lines=True, orient="records")

In [None]:
demo = True #@param {type:"boolean"}
if demo:
    inputs = inputs.sample(10)
print(f"Number of inputs: {len(inputs)}")

Number of inputs: 10


In [None]:
inputs.head(1)

Unnamed: 0,id,pid,base_ctx,question,output,dataset,split
24193,bb11261719535ef6b4f9e092be690c99861e7f0a4aeabc...,58402da9f8baad0b4bb42810289b8a64e7f707e40b4149...,(CNN) -- Three Pakistani cricketers found guil...,how did they plead?,guilty,coqa,train


In [None]:
inputs['dataset'].value_counts()

cnli            3
quality         3
coqa            2
narrative_qa    1
qasper          1
Name: dataset, dtype: int64

### Run inference on each task
- The prompts are taken from [zero-scrolls](https://www.zero.scrolls-benchmark.com/)
- For dataset construction, please see the paper. TLDR: narrative qa, qasper, quality, and cnli are taken from [scrolls](https://www.scrolls-benchmark.com/), and coqa is from [huggingface](https://huggingface.co/datasets/coqa).




#### Task prompts

In [None]:
dataset_prompts_and_instructions = {

### NARRATIVE_QA

    "narrative_qa": {
        "instruction": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question as concisely as you can, using a single phrase if possible.",
        "prompt": """Story:
{context}

{instruction}

Question: {question}

Answer: The answer is'""",
        "truncation_message": "... [The rest of the story is omitted]\n\n",
    },

### QASPER

    "qasper": {
        "instruction": "You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write 'unanswerable'. If the question is a yes/no question, answer 'yes', 'no', or 'unanswerable'.",
        "prompt": """Article:
{context}

{instruction}

Question: {question}

Answer: The answer is'""",
        "truncation_message": "... [The rest of the article is omitted]\n\n",
    },

### QUALITY


"quality": {
        "instruction": "You are provided a story and a multiple-choice question with 4 possible answers (marked by A, B, C, D). Choose the best answer by writing its corresponding letter (either A, B, C, or D).",
        "prompt": """Story:
{context}

{instruction}

Question and Possible Answers: {question}

Answer: The answer is'""",

        "truncation_message": "... [The rest of the story is omitted]\n\n",
    },


### CNLI

    "cnli": {
        "instruction": "You are given a non-disclosure agreement and a sentence that proposes a hypothesis based on the agreement. Choose whether the hypothesis is entailed by the agreement, contradicted by the agreement, or not mentioned by (neutral to) the agreement. If the hypothesis is entailed by the agreement, write 'Entailment'. If the hypothesis is contradicted by the agreement, write 'Contradiction'. If the hypothesis is not mentioned by the agreement, write 'Not mentioned'.",
        "prompt": """Contract:
{context}

{instruction}

Hypothesis: {question}

Answer: The answer is'""",
        "truncation_message": "... [The rest of the contract is omitted]\n\n",
    },

### COQA

      "coqa": {
        "instruction": "You are given a story, which can be either a novel or a movie script, and a question. Answer the question as concisely as you can, using a single phrase if possible.",
        "prompt": """Story:
{context}

{instruction}

Question: {question}

Answer: The answer is'""",
        "truncation_message": "... [The rest of the story is omitted]\n\n",
    },


}




#### LLM tooling setup


In [None]:
import openai

openai.api_key = "EMPTY"
openai.api_base = "http://pitt.lti.cs.cmu.edu:8003/v1"

def call_openai_api(
    prompt: str,
    engine_name: str,
    temperature: float = 0.0,
    n: int = 1,
    stop: str = '\n',
    max_tokens: int = 100,
    batch_size: int = 32
):
    """
    Call the OpenAI API to create completions based on the input prompt.

    Parameters:
    - prompt (str): The prompt.
    - engine_name (str, optional): The engine to use for the completion.
    - temperature (float, optional): Sampling temperature for randomness. Defaults to 0.0.
    - n (int, optional): Number of completions to generate. Defaults to 1.
    - stop (str, optional): Token at which the API should stop generating further tokens. Defaults to '\n'.
    - max_tokens (int, optional): Maximum number of tokens in the generated output. Defaults to 100.
    - batch_size (int, optional): Maximum num_completions for each API call. Defaults to 32.

    Returns:
    - list/str: Generated text completions from the API. Returns a list of strings if n > 1, else a single string.
    """
    all_responses = []
    orig_n = n

    try:
        while n > 0:
            current_batch_size = min(n, batch_size)

            response = openai.Completion.create(
                        model=engine_name,
                        prompt=prompt,
                        temperature=temperature,
                        max_tokens=max_tokens,
                        n=current_batch_size,
                        stop=stop,
                    )

            all_responses.extend([choice['text'] for choice in response['choices']])

            n -= current_batch_size

        return all_responses if orig_n > 1 else all_responses[0]

    except Exception as e:
        print(f"An error occurred: {str(e)}")
        return None



#### Run inference

In [None]:

def run_solver_job(df, prepare_row_func, engine_name: str, max_workers: int = 32,
                   temperature: float = 0.0, n: int = 1, stop: str = '\n',
                   max_tokens: int = 100):
    """
    Runs a solver job using a specified engine, applying concurrent futures and tqdm for progress tracking.

    Parameters:
    - df: Input dataframe
    - prepare_row_func: Function to prepare rows of df for the solver
    - engine_name (str): Name of the engine to use
    - max_workers (int, optional): Maximum number of workers for ThreadPoolExecutor. Defaults to 32.
    - temperature (float, optional): Temperature parameter for call_openai_api. Defaults to 0.0.
    - n (int, optional): n parameter for call_openai_api. Defaults to 1.
    - stop (str, optional): Stop parameter for call_openai_api. Defaults to '\n'.
    - max_tokens (int, optional): Maximum number of tokens for call_openai_api. Defaults to 100.

    Returns:
    - list: Results from the solver job
    """
    # Creating a partial function with specified parameters
    solver_call = partial(call_openai_api,
                          engine_name=engine_name,
                          temperature=temperature,
                          n=n,
                          stop=stop,
                          max_tokens=max_tokens)

    # Running the solver job concurrently and tracking progress with tqdm
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(tqdm(executor.map(solver_call, df.apply(prepare_row_func, axis=1)),
                            total=df.shape[0]))

    return results

def prepare_row(row):
    dataset = row["dataset"]
    prompt = dataset_prompts_and_instructions[dataset]["prompt"]
    instruction = dataset_prompts_and_instructions[dataset]["instruction"]
    question = row['question']
    context = row['base_ctx']

    full_text = prompt.format(context=context, instruction=instruction, question=question)

    tokens = tokenizer.encode(full_text)

    # Check if the length exceeds 3096 tokens, llama2 requirements
    if len(tokens) > 3096:
        tokens = tokens[-3096:]

    truncated_text = tokenizer.decode(tokens)

    return truncated_text


# Engine names for "13b" and "70b" - Replace these with the actual engine names.
engine_13b = "meta-llama/Llama-2-13b-hf"
engine_70b = "meta-llama/Llama-2-70b-hf"


# Running the job for both engines and storing results.
# Note that we use temp = 0.0 for task.

results_13b = run_solver_job(inputs, prepare_row, engine_13b)
results_70b = run_solver_job(inputs, prepare_row, engine_70b)


100%|██████████| 10/10 [00:09<00:00,  1.08it/s]
100%|██████████| 10/10 [00:09<00:00,  1.03it/s]


In [None]:
results_70b[0]

"slop bucket'."

In [None]:
def clean_answer(ans: str) -> str:
  return ans.replace("'", "") if ans else pd.NA

In [None]:
len(results_70b)

10

In [None]:
inputs['llama13b_pred_ans'] = [clean_answer(ans) for ans in results_13b]
inputs['llama70b_pred_ans'] = [clean_answer(ans) for ans in results_70b]

In [None]:
inputs_with_predictions =  inputs.dropna()
# slighly better name for inputs

In [None]:
print(f"{len(inputs_with_predictions)}/{len(inputs)} inputs have predictions")


10/10 inputs have predictions


## Add scores

In [None]:
import re
import string
from collections import Counter
import pandas as pd

def f1_score(prediction, ground_truth):
    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()
    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())
    if num_same == 0:
        return 0
    precision = 1.0 * num_same / len(prediction_tokens)
    recall = 1.0 * num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    return f1

def normalize_answer(s):
    """Lower text and remove punctuation, articles, and extra whitespace."""
    def remove_articles(text):
        return re.sub(r"\b(a|an|the)\b", " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def calculate_f1_for_models(df, model_sizes, ground_truth_col='output'):
    """
    Calculates F1 score for different model sizes and adds the results as new columns in the DataFrame.

    Parameters:
    - df (pd.DataFrame): The DataFrame containing prediction data.
    - model_sizes (list of str): List containing strings that denote model sizes.
      Used to create column names dynamically.
    - ground_truth_col (str, optional): The name of the column containing ground truth data.
      Defaults to 'output'.

    Returns:
    - pd.DataFrame: The original DataFrame with added columns for the F1 scores.
    """
    for size in model_sizes:
        pred_col = f'llama{size}_pred_ans'
        f1_col = f'llama{size}_f1'
        df[f1_col] = df.apply(
            lambda r: f1_score(prediction=r[pred_col], ground_truth=r[ground_truth_col]),
            axis=1
        )
    return df



#### For quality, LLAMA2-13b sometimes generates only the option (e.g., a). Simple matching with output won't work, so we have to do map the generated option to the correct answer and do the matching.


In [None]:
import pandas as pd
import re
from typing import List

def extract_option(row: pd.Series) -> str:
    """
    Extracts the correct option from the provided row.

    Parameters:
        row (pd.Series): A row of a DataFrame, expected to contain 'question' and 'output' columns.

    Returns:
        str: The letter of the correct option, or None if not found.
    """
    options = re.findall(r'\((\w)\) ([\w\s]+)', row['question'])
    for option, value in options:
        if value.strip() == row['output'].strip():
            return option
    return None

def extract_option_from_prediction(pred: str) -> str:
    """
    Extracts the selected option letter from a prediction string.

    Parameters:
        pred (str): The prediction string, expected to start with an option letter.

    Returns:
        str: The extracted option letter, or None if not found or if `pred` is empty.
    """
    if len(pred.strip()) == 0:
        return None

    option = pred.split()[0]
    for char in option:
        if char in ['A', 'B', 'C', 'D']:
            return char
    return None

def calculate_f1_for_multi_choice(df: pd.DataFrame, model_sizes: List[str], datasets: List[str]=["quality"]) -> pd.DataFrame:
    """
    Computes F1 scores for predictions in multiple-choice format.

    It extracts correct and predicted options and computes F1 scores, with special handling
    for certain datasets. This function mutates the input DataFrame by adding new columns
    for extracted options and possibly modifying F1 scores.

    Parameters:
        df (pd.DataFrame): The DataFrame containing prediction and ground truth data.
            Expected to contain columns in the format 'llama{size}_pred_ans'.
        model_sizes (List[str]): List of strings indicating the model sizes for which
            predictions are available in `df` (e.g., ['13b', '70b']).
        datasets (List[str], optional): List of dataset names that require special handling.
            Defaults to ["quality"].

    Returns:
        pd.DataFrame: The original DataFrame with additional/modified columns for extracted
            options and potentially modified F1 scores.
    """
    df['correct_option'] = df.apply(extract_option, axis=1)

    for size in model_sizes:
        pred_ans_col = f'llama{size}_pred_ans'
        pred_option_col = f'llama{size}_pred_option'
        f1_col = f'llama{size}_f1'

        # Remove single quotes from predictions for specified datasets
        df[pred_ans_col] = df.apply(lambda r: r[pred_ans_col] if r["dataset"] not in datasets else r[pred_ans_col].replace("'", ""), axis=1)

        # Extract the option from the prediction
        df[pred_option_col] = df[pred_ans_col].apply(extract_option_from_prediction)

        # Compute the F1 score: if dataset is in `datasets`, F1 is 1 if predicted option matches correct option, else it's 0
        df[f1_col] = df.apply(lambda r: r[pred_option_col] == r['correct_option'] if r["dataset"] in datasets else r[f1_col], axis=1)

    return df


In [None]:
model_sizes = ['13b', '70b']

# Calculating F1 scores for each model size
inputs_with_predictions = calculate_f1_for_models(inputs_with_predictions, model_sizes)

# Further processing and calculating F1 scores for multi-choice questions
inputs_with_predictions = calculate_f1_for_multi_choice(inputs_with_predictions, model_sizes)

In [None]:
inputs_with_predictions[['llama13b_f1', 'llama70b_f1']].mean()

llama13b_f1    0.2
llama70b_f1    0.2
dtype: float64