# Available prompts
All prompts take in two arguments: point to be evaluated, and exemplars. Exemplars may be `None`.

- AM prompts take in three arguments: point, exemplar set, and a boolean (`is_reply`) to load the review or the reply.
- AMR prompts take the AM arguments with some changes:
  - The point must be a tuple of the point _and_ the AMR graph.
  - The fourth argument is a boolean (`include_context`). If set to `True`, it returns the sentence and the graph; the graph only otherwise.

| Prompt name | Problem | Domain | What it does | 
|-------------|---------------|---------------|---------------|
| build_prompt_symbolic | AM | Symbolic | Symbolic with BIO tags as return signature |
| build_prompt_symbolic_w_numbers | AM | Symbolic | Symbolic with indices as return signature |
| build_prompt_symbolic_w_numbers_in_lines | AM | Symbolic | Symbolic with indices as return signature _and_ indices marked inline |
| build_prompt_with_amr | AM  | Symbolic (AMR) | All AMR prompts go here, with context and without context |
| build_prompt_symbolic_w_numbers_in_lines_cot | AM | Symbolic | CoT symbolic with inline numbers and inline return signature |
| build_prompt_symbolic_w_numbers_cot | AM | Symbolic | CoT symbolic with inline numbers |
| build_prompt_symbolic_cot | AM | Symbolic | CoT with BIO tags as return signature |
|-------------|---------------|---------------|---------------|
| build_prompt_pair_symbolic_ape | APE | Symbolic | Symbolic with indices as return signature | 
| build_prompt_pair_symbolic_w_numbers | APE | Symbolic | Symbolic with indices as return signature _and_ indices marked inline |
| build_prompt_pair_symbolic_full | APE | Symbolic | Symbolic with the binary matrix as return signature |
|-------------|---------------|---------------|---------------|
| build_prompt_pair | AM | Concrete | Concrete prompt, both reviews and passages in a zero shot manner (unused) | 
| build_prompt_single | AM | Concrete | Concrete prompt, one per review/rebuttal passage | 
| build_prompt_single_cot | AM | Concrete | CoT |
| build_prompt_pair_ape | APE | Concrete | APE concrete |
|-------------|---------------|---------------|---------------|
| build_prompt_pair_x_symbolic | AM and APE | Symbolic | Use AM first and then perform APE concrete. Not in the paper due to low performance |

## PLEASE NOTE:
- You need to bring your own LLMClient class, such as [OpenAI API](https://openai.com/product#made-for-developers)
  - You should be able to update the `max_tokens` parameter on the fly, and support a `send_request` method that sends text and returns a JSON object like OpenAI's.
- For amrlib: `pip install amrlib` and install [the model](https://amrlib.readthedocs.io/en/latest/install/).
  - We used `model_parse_xfm_bart_large-v0_1_0` since its SMATCH is around the same as the SOTA (83 vs 85) and works with this library.
- Please note you will need the data in JSON format from [here](https://github.com/LiyingCheng95/MLMC/tree/main/data/rr-submission-v2) (only `test.json`, `dev.json`)
- We have only provided a random sample of the model responses in all categories since the full annotations + dataset are 3 GB.
  - Full outputs and data will be released.

In [1]:
#from llmclient import LLMClient
import json
from tqdm import tqdm

import tiktoken # Needed for max-length tokenisation

from loading_code_rrv2 import *
from prompt_utils import *

from prompts_amr import *
from prompts_symbolic import *
from prompts_concrete import *

enc_gpt4 = tiktoken.encoding_for_model("gpt-4")
enc_gpt3 = tiktoken.encoding_for_model("text-davinci-003")

processed = load_data_instances(json.load(open("test.json", "r", encoding="utf-8")), -1, False)
dev = load_data_instances(json.load(open("dev.json", "r", encoding="utf-8")), -1, False)

In [None]:
request_data = {"max_tokens": 756,
                "temperature": 0.8,
                "top_p": 1,
                "n": 5,
                "frequency_penalty":0,
                "presence_penalty":0,
                "logprobs": None,
                "stop": None}

# Each task takes in slightly different max_tokens.
# We tune these prior to every call and during experimentation.
# For example, CoT needs around 1500 tokens.
# Concrete approaches ~1000.
# Symbolic between 256 and 758.
# Everything else stays the same across all experiments.
llm_client_gpt4 = LLMClient(request_data, "GPT-4")
llm_client_gpt3 = LLMClient(request_data, "GPT-3")

# Example call: GPT-4 Concrete (AM)
- Sample data-gathering call for AM (only the rebuttal)
- zero-shot-token call
- Code will perpetually retry until done.

In [None]:
responses = []
output_filename = "CoT/gpt4_reply_cot_symbolic_w_numbers_in_lines_zero_shot.json"
ffailed = [i for i in range(len(processed))]
rounds = 0

while ffailed != []:
    failed = [] 
    for i in tqdm(range(len(processed))):
        if i not in ffailed:
            continue
        instance = processed[i]
        examples = None # exemplars go here (e.g., dev[:4] to do 4-shot)
        prompt = build_prompt_symbolic_w_numbers_in_lines_cot(instance, examples, is_reply=True)
        processed_response = {"prompt": prompt,
                              "index": i,
                              "model_responses": {},
                              "actuals_text": [s.replace("<sep>", "").strip() for l,s in zip(instance.review_bio, instance.review) if l == 1 or l == 2],
                              "actuals_bio": [l.item() for l in instance.review_bio]
                              }
        try:
            response = llm_client_gpt4.send_request(prompt)
        except:
            failed.append((i, response))
        if "choices" in response:
            for i, choice in enumerate(response["choices"]):
                processed_response["model_responses"]["try" + str(i)] = choice["text"]
        else:
            failed.append((i, response))
        with open(output_filename, "a", encoding="utf-8") as f:
            f.write(json.dumps(processed_response) + "\n")

    if failed != []:
        print([v[0] for v in failed])
        print(failed[-1])
        ffailed = [v[0] for v in failed]
        rounds += 1
    else:
        print(failed)
        print(len(failed))
        break

# Large Parallel Requests
Sometimes the instance will not be able to allocate enough RAM for requests calling a best-of-5 using up all 32k tokens. This is the workaround.
- It is very slow, but it works.
- Sample symbolic with numbers and CoT at 16k (max) examples.

In [None]:
request_data = {"max_tokens": 756,
                "temperature": 0.8,
                "top_p": 1,
                "n": 1,
                "frequency_penalty":0,
                "presence_penalty":0,
                "logprobs": None,
                "stop": None}

# Same as before, each task takes in slightly different max_tokens.
# Everything else stays the same across all experiments.
# These parameters return a single response per call, so it'll take five times longer.
llm_client_gpt4 = LLMClient(request_data, "GPT-4")
llm_client_gpt3 = LLMClient(request_data, "GPT-3")

In [None]:
responses = []
ffailed = [i for i in range(len(processed))]
rounds = 0
is_reply = False

suff = "reply" if is_reply else "review"
output_filename = "CoT/gpt4_{}_symbolic_w_numbers_in_lines_16k.json".format(suff)

while ffailed != []:
    failed = []
    too_long_tokens = []
    for i in tqdm(range(len(processed))):
        if i not in ffailed:
            continue
        instance = processed[i]
        examples = dev[0:16]
        prompt = build_prompt_symbolic_w_numbers_in_lines_cot(instance, examples, is_reply=is_reply)

        max_examples = 16
        if len(enc_gpt4.encode(prompt)) + request_data["max_tokens"] > 32_000:
            while True:
                examples = dev[0:max_examples]
                prompt = build_prompt_symbolic_w_numbers_in_lines_cot(instance, examples, is_reply=is_reply)
                if len(enc_gpt4.encode(prompt)) + request_data["max_tokens"] > 32_000:
                    max_examples = max_examples - 1
                else:
                    break
        examples = dev[0:max_examples]
        prompt = build_prompt_symbolic_w_numbers_in_lines_cot(instance, examples, is_reply=is_reply)
        processed_response = {"prompt": prompt,
                              "index": i,
                              "model_responses": {},
                              "actuals_text": [s.replace("<sep>", "").strip() for l,s in zip(instance.review_bio, instance.review) if l == 1 or l == 2],
                              "actuals_bio": [l.item() for l in instance.review_bio]
                              }
        
        not_finished = False
        for k in range(5):
            try:
                response = llm_client_gpt4.send_request(prompt)
            except:
                failed.append((i, response))
            if "choices" in response:
                processed_response["model_responses"]["try" + str(k)] = response["choices"][0]["text"]
            else:
                failed.append((i, response))
                not_finished = True
                if 'error' in response:
                    if response["error"]["code"] != "InternalServerError":
                        too_long_tokens.append((i, response))
                break

        if not not_finished:
            with open(output_filename, "a", encoding="utf-8") as f:
                f.write(json.dumps(processed_response) + "\n")

    if too_long_tokens != []:
        print("Too long:")
        print([v[0] for v in too_long_tokens])
        print("")
    if failed != []:
        print([v[0] for v in failed])
        print(failed[-1])
        ffailed = [v[0] for v in failed]
        rounds += 1
    else:
        print(failed)
        print(len(failed))
        break

# AMR Experiments

We need to first generate the data and then make the call, otherwise it will be incredibly slow.

In [None]:
import amrlib

# Please download the model ahead of time. We use https://github.com/bjascob/amrlib/ for this.
parser_model = "model_parse_xfm_bart_large-v0_1_0"
stog = amrlib.load_stog_model(model_dir=parser_model)

# Dump to AMR
for i in tqdm(range(len(processed))):
    point = processed[i]
    pt = {"index": i,
          "review": point.review,
          "response": point.reply}
    amr_review_graphs = stog.parse_sents(point.review, add_metadata=True)
    amr_response_graphs = stog.parse_sents(point.reply, add_metadata=True)
    pt["review_amr_graphs"] = amr_review_graphs
    pt["amr_response_graphs"] = amr_response_graphs
    with open("AMR_graphs/rrv2_amr_line_by_line.json", "a", encoding="utf-8") as f:
        f.write(json.dumps(pt) + "\n")
        
# Build your exemplar set.
# Dump to AMR -- you really don't need 100 AMR graphs. Like 20 would do.
for i in tqdm(range(100)):
    point = dev[i]
    pt = {"index": i,
          "review": point.review,
          "response": point.reply}
    amr_review_graphs = stog.parse_sents(point.review, add_metadata=True)
    amr_response_graphs = stog.parse_sents(point.reply, add_metadata=True)
    pt["review_amr_graphs"] = amr_review_graphs
    pt["amr_response_graphs"] = amr_response_graphs
    with open("AMR_graphs/rrv2_amr_dev_line_by_line.json", "a", encoding="utf-8") as f:
        f.write(json.dumps(pt) + "\n")
        
txt = processed[0].review
graphs = stog.parse_sents(txt, add_metadata=True)
for graph in graphs:
    print(json.dumps(graph))

In [None]:
processed = load_data_instances(json.load(open("test.json", "r", encoding="utf-8")), -1, False)
dev = load_data_instances(json.load(open("dev.json", "r", encoding="utf-8")), -1, False)
all_graphs_test = [json.loads(l) for l in open("AMR_graphs/rrv2_amr_line_by_line.json", "r", encoding="utf-8").readlines()]
all_graphs_dev = [json.loads(l) for l in open("AMR_graphs/rrv2_amr_dev_line_by_line.json", "r", encoding="utf-8").readlines()]

responses = []
failed = []
is_reply = False
include_context = False

suff = "reply" if is_reply else "review"
output_filename = "AM/gpt4_{}_amr_no_context_32k.json".format(suff)

graphs = [k["amr_response_graphs"] for k in all_graphs_dev] if is_reply else [k["review_amr_graphs"] for k in all_graphs_dev]
all_examples = [(d, g) for d, g in zip(dev, graphs)]


for i in tqdm(range(len(processed))):
    instance = [processed[i], all_graphs_test[i]["amr_response_graphs"]] if is_reply else [processed[i], all_graphs_test[i]["review_amr_graphs"]]
    examples = all_examples[:10]

    prompt = build_prompt_with_amr(instance, examples, is_reply=is_reply, include_context=include_context)
    processed_response = {"prompt": prompt,
                          "index": i,
                          "model_responses": {},
                          "amrs": instance[-1],
                          "full_context": include_context
                          }
    try:
        response = llm_client_gpt4.send_request(prompt)
    except:
        failed.append((i, response))
    if "choices" in response:
        for i, choice in enumerate(response["choices"]):
            processed_response["model_responses"]["try" + str(i)] = choice["text"]
    else:
        failed.append((i, response))
    with open(output_filename, "a", encoding="utf-8") as f:
        f.write(json.dumps(processed_response) + "\n")

if failed != []:
    print([v[0] for v in failed])
    print(failed[-1])
else:
    print(failed)


# Extra: Cross-AM/APE Approaches
- Not in the paper due to low performance. Example below uses zero-shot AM responses to build (a zero-shot) APE response.

In [None]:
processed = load_data_instances(json.load(open("test.json", "r", encoding="utf-8")), -1, False)
dev = load_data_instances(json.load(open("dev.json", "r", encoding="utf-8")), -1, False)
mined = load_to_example(processed, collator(resps["0k"]["review"], returnit=True), collator(resps["0k"]["response"], returnit=True))

responses = []
failed = []

for i in tqdm(range(len(processed))):
    instance = mined[i]
    examples = None
    prompt = build_prompt_pair_x_symbolic(instance, examples)
    processed_response = {"prompt": prompt,
                          "index": i,
                          "model_responses": {},
                          "instance": instance,
                          }
    try:
        response = llm_client_gpt4.send_request(prompt)
    except:
        failed.append((i, response))
    if "choices" in response:
        for i, choice in enumerate(response["choices"]):
            processed_response["model_responses"]["try" + str(i)] = choice["text"]
    else:
        failed.append((i, response))
    with open("gpt4_ape_x_symbolic_zero_shot_zero_shot.json", "a", encoding="utf-8") as f:
        f.write(json.dumps(processed_response) + "\n")
print(failed)