In [1]:
from lib.generate import *
generator = Llama70BGeneratorNotChat()

In [2]:
def mk_answer_extract_prompt(question: str, long_answer: str):
    return f"""### Instruction ###

Concise answers is paramount to your work your work in a team of highly efficient question answerers. Below are examples on how one can extract information from a bad, verbose answer into a concise answer that directly answers the question. If the answer is already concise, "Better Answer" is the same as the original "Possibly Bad Answer".

Question: Who taught the course 51-425 Design Center: Beginning Book Arts Lab in the semester Fall 2023?
Possibly Bad Answer: In the Fall 2023 semester, Joseph Dicey taught the course "51-425 Design Center: Beginning Book Arts Lab".
Better Answer: Joseph Dicey

-----

Question: Which model did Fatemehsadat Mireshghallah find in their paper to have an AUC of 0.81?
Possibly Bad Answer: Fatemehsadat Mireshghallah's paper finds that the OPT-125M model has an AUC of 0.81 in detecting ChatGPT generations, whereas a larger model from the GPT family, GPTJ-6B, has AUC of 0.45.
Better Answer: OPT-125M

-----

Question: Who is sponsoring the event Carnival Activities Tent at CMU's Spring Carnival 2024?
Possibly Bad Answer: Carnival Activities Tent is sponsored by the Carnival Committee and the CMU Alumni Association. No registration required is required.
Better Answer: The Spring Carnival Committee and the CMU Alumni Association

-----

Question: What are Graham Neubig's research interests?
Possibly Bad Answer: Associate Professor is an Associate Professor at Carnegie Mellon University. He has a wide range of research interests, including: Machine Translation, Natural Language Processing, Spoken Language Processing, Machine Learning.
Better Answer: Machine Translation, Natural Language Processing, Spoken Language Processing, Machine Learning

-----

Question: What are David Mortensen's research interests?
Possibly Bad Answer: Natural Language Processing and Computational Linguistics, Corpus Annotation and Resources
Better Answer: Natural Language Processing and Computational Linguistics, Corpus Annotation and Resources

-----

Question: Where was the paper "Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction" published?
Possibly Bad Answer: The publication venue for the paper "Towards Improving Harmonic Sensitivity and Prediction Stability for Singing Melody Extraction" is International Society for Music Information Retrieval Conference.
Better Answer: The International Society for Music Information Retrieval Conference

-----

Question: Where was the paper "Methods for Measuring, Updating, and Visualizing Factual Beliefs in Language Models" published?
Possibly Bad Answer: Conference of the European Chapter of the Association for Computational Linguistics
Better Answer: Conference of the European Chapter of the Association for Computational Linguistics

-----

Question: On what date does the Spring Carnival 2024 Event "SCS Breakfast & Buggy" take place?
Possibly Bad Answer: The School of Computer Science (SCS) will host a breakfast buffet for all SCS alumni and livestream the buggy races at "SCS Breakfast & Buggy". The event is set to take place on April 13, 2024 at 9:00 AM-12:00 PM ET.
Better Answer: April 13, 2024

-----

Question: Which room does 02-261 Quantitative Cell and Molecular Biology Laboratory take place in the Fall 2023 semester?
Possibly Bad Answer: In the Fall 2023 semester it's in HOA 107
Better Answer: HOA 107

-----

Question: {question}
Possibly Bad Answer: {long_answer}
Better Answer: """

In [3]:
def extract_concise_answer(s: str):
    if '-----' in s:
        return s.split('-----')[0].strip()

    if '\n' in s:
        maybe = s.split('\n')[0].strip()
        return maybe
    
    return s

In [4]:
out = generator(mk_answer_extract_prompt(
    'For which year or years is full funding (i.e. tuition and stipend) guaranteed for a Ph.D. student at LTI?',
    'Full funding is guaranteed for the first year for all LTI Ph.D. students, and is normally continued for at least 5 years, with possibility of further continuance, subject to continuing satisfactory progress and availability of funding.'
), max_tokens=64, temperature=0.6, top_k=1)
extract_concise_answer(out)

'5 years'

In [5]:
out = generator(mk_answer_extract_prompt(
    'What year was the paper called Quantitative Evaluation of Dust and Black Carbon Column Concentration in the MERRA-2 Reanalysis Dataset Using Satellite-Based Component Retrievals published?',
    'The paper was published in the year 2023.'
), max_tokens=64, temperature=0.6, top_k=1)
extract_concise_answer(out)

'2023'

In [6]:
out = generator(mk_answer_extract_prompt(
    'At what venue was the paper called GameQA: Gamified Mobile App Platform for Building Multiple-Domain Question-Answering Datasets published?',
    "The paper called 'GameQA: Gamified Mobile App Platform for Building Multiple-Domain Question-Answering Datasets' was published at the Conference of the European Chapter of the Association for Computational Linguistics."
), max_tokens=64, temperature=0.6, top_k=1)
extract_concise_answer(out)

'2023 Conference of the European Chapter of the Association for Computational Linguistics'

## On existing outputs

In [8]:
from pprint import pprint
import pickle
from lib.evaluation import *
import time
from tqdm import tqdm
import os

In [19]:
evaluation_identifier = "txt-ragatouille-noop-llama70b-human-hypotheticalFalse-k10"

In [21]:
with open(f'../experiment/evaluation/(example) eval-{evaluation_identifier}/res_acc.pkl', 'rb') as file:
    unextracted_res_acc = pickle.load(file)

res_acc = []
for entry in tqdm(unextracted_res_acc):
    Q = entry['Q']
    A = entry['A']
    A_hat = entry['A_hat']
    
    time.sleep(1.2)
    out = generator(mk_answer_extract_prompt(
        Q,
        A_hat
    ), max_tokens=64, temperature=0.6, top_k=1)
    
    A_hat_extracted = extract_concise_answer(out)   
    if len(A_hat) <= len(A_hat_extracted): # revert if extracted answer is longer
        A_hat_extracted = A_hat
    
    em = eval_exact_match(A_hat_extracted, [A])
    f1 = eval_f1(A_hat_extracted, [A])
    recall = eval_recall(A_hat_extracted, [A])
    precision = eval_precision(A_hat_extracted, [A])
    
    entry['em'] = em
    entry['f1'] = f1
    entry['recall'] = recall
    entry['precision'] = precision
    entry['A_hat'] = A_hat_extracted
        
    res_acc.append(entry)

100%|██████████| 60/60 [04:03<00:00,  4.05s/it]


In [22]:
# summary stats and file IO by copilot
output_dir = f'../experiment-alt/(example) extract-eval-{evaluation_identifier}'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# 1 > save predictions
with open(f'{output_dir}/res_acc.pkl', 'wb') as f:
    pickle.dump(res_acc, f)

df = pd.DataFrame(res_acc)

df.to_csv(f'{output_dir}/predictions.csv')
df.to_json(f'{output_dir}/predictions.json', indent=2, orient='records')
df.to_html(f'{output_dir}/predictions.html')
df.style.background_gradient(subset=['em', 'f1', 'recall', 'precision'], cmap='RdYlGn', vmin=0, vmax=1).to_html(f'{output_dir}/predictions_s.html')
df[["em", "f1", "recall", "precision", 'Q', "A_hat", 'A']].style.background_gradient(subset=['em', 'f1', 'recall', 'precision'], cmap='RdYlGn', vmin=0, vmax=1).to_html(f'{output_dir}/predictions_s_cleaner.html')

# 2 > make mean calculations
summary_stats = df[['em', 'f1', 'recall', 'precision']].mean().to_frame().T.rename(columns={'em': 'Mean Exact Match', 'f1': 'Mean F1 Score', 'recall': 'Mean Recall', 'precision': 'Mean Precision'})
summary_stats.to_csv(f'{output_dir}/stats.csv', index=False)
summary_stats.to_json(f'{output_dir}/stats.json', indent=2, orient='records')

# 3 > bye
print("done evaluating!")
print("results dumped to", output_dir)
print(summary_stats)


done evaluating!
results dumped to ../experiment-alt/(example) extract-eval-txt-ragatouille-noop-llama70b-human-hypotheticalFalse-k10
   Mean Exact Match  Mean F1 Score  Mean Recall  Mean Precision
0          0.683333       0.809365     0.830976        0.810505


## before and after stats

In [27]:
df = pd.DataFrame(unextracted_res_acc)

# 2 > make mean calculations
summary_stats = df[['em', 'f1', 'recall', 'precision']].mean().to_frame().T.rename(columns={'em': 'Mean Exact Match', 'f1': 'Mean F1 Score', 'recall': 'Mean Recall', 'precision': 'Mean Precision'})

# 3 > bye
print("done evaluating!")
print(summary_stats)


done evaluating!
   Mean Exact Match  Mean F1 Score  Mean Recall  Mean Precision
0          0.083333        0.33248     0.876347         0.24146


In [28]:
df = pd.DataFrame(res_acc)

# 2 > make mean calculations
summary_stats = df[['em', 'f1', 'recall', 'precision']].mean().to_frame().T.rename(columns={'em': 'Mean Exact Match', 'f1': 'Mean F1 Score', 'recall': 'Mean Recall', 'precision': 'Mean Precision'})

# 3 > bye
print("done evaluating!")
print(summary_stats)


done evaluating!
   Mean Exact Match  Mean F1 Score  Mean Recall  Mean Precision
0          0.683333       0.809365     0.830976        0.810505


## before and after generations

In [30]:
for entry1, entry2 in zip(unextracted_res_acc, res_acc):
    print(entry1['A_hat'])
    print(entry2['A_hat'])
    print()


Yonatan Bisk's professional title is Assistant Professor.
Assistant Professor


Ralf Brown's professional title is Principal Systems Scientist.
Principal Systems Scientist


Richard Stern's office is located in B24 Baker-Porter Hall.
B24 Baker-Porter Hall


Malihe Alikhani works at the University of Pittsburgh.
The University of Pittsburgh


Taylor Berg-Kirkpatrick works at the University of California, San Diego.
The University of California, San Diego


The first author of the paper entitled "WebArena: A Realistic Web Environment for Building Autonomous Agents," published in 2023, is Shuyan Zhou.
Shuyan Zhou


The first author of the paper entitled "A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity" is S. Longpre.
S. Longpre


The first author of the paper entitled "Active Retrieval Augmented Generation" is Zhengbao Jiang.
Zhengbao Jiang


The CMU faculty member who is an author on the paper entitled "Clever Hans or Neural