# Investigating Flaky NORA Unit Tests

Keeping in mind the goals of the unit test as we go through these:
* Reliable testing procedures
* Remove unrepresentative or invalid tests
* Eventually: decide which tests should pass every time in order to merge

In [141]:
import csv
import pandas as pd
import json

sheet_id = "1AYI2dFio0Cgf9Z1pRER4I56xj00qXkTJisS70Z5zTFo"
sheet_name = "all_results"

data_url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
df = pd.read_csv(data_url, index_col=False)
df = df.drop(['featureconfigkey','eval_id','commit','model','test_name',
              'link_to_case','unique_case_id','unique_prompt', 'num_times_ran',
              'score', 'metrics', 'num_times_in_unit_test_suite','Unnamed: 21'], axis=1)

## Some basic stats of our unit tests

In [97]:
df.nunique()

case_id                     110
test_type_unofficial          7
initial_prompt               98
handler_cost                768
metrics_json                101
output_score                 21
tool_score                    3
final_response              757
error                        10
events                      834
case_params                 102
short_error                   4
num_times_case_id_failed      5
dtype: int64

In [95]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 990 entries, 0 to 989
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   case_id                   990 non-null    object 
 1   test_type_unofficial      90 non-null     object 
 2   initial_prompt            990 non-null    object 
 3   handler_cost              918 non-null    float64
 4   metrics_json              915 non-null    object 
 5   output_score              492 non-null    float64
 6   tool_score                713 non-null    float64
 7   final_response            923 non-null    object 
 8   error                     75 non-null     object 
 9   events                    990 non-null    object 
 10  case_params               990 non-null    object 
 11  short_error               75 non-null     object 
 12  num_times_case_id_failed  990 non-null    int64  
dtypes: float64(3), int64(1), object(9)
memory usage: 100.7+ KB


### Summary of Stats

General:
* 990 total instances examined
    * 9 runs per case --> 110 total cases
* 75 errors

Notes:
* 110 unit tests but only 98 unique prompts == duplicate tests
* 492 output scores means we're missing 990-492 scores
* 713 tool scores means we're missing 990-713
    * That means we have 713-492 cases that received tool scores but not a final score

## Diving in to cases

### Missing scores

#### Missing both scores

In [114]:
missing_all_scores = df[df['output_score'].isna() & df['tool_score'].isna()]
print("# cases missing both scores:", len(missing_all_scores))
missing_all_scores.nunique()

# cases missing both scores: 153


case_id                      40
test_type_unofficial          1
initial_prompt               29
handler_cost                 84
metrics_json                 16
output_score                  0
tool_score                    0
final_response               70
error                        10
events                      113
case_params                  38
short_error                   4
num_times_case_id_failed      5
dtype: int64

In [127]:
pd.set_option('display.max_colwidth', None)
missing_all_scores[['case_id','initial_prompt','final_response', 'short_error']]

Unnamed: 0,case_id,initial_prompt,final_response,short_error
14,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,Call the execution agent with 'what is 2+2'\ncall the execution agent with 'what is sin(5)'\ncall the paper finder with 'fast polynomial root finding'\ncall the claim verifier with 'word embeddings preserve biases'\ncall me when you're done,,requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
15,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,Call the execution agent with 'what is 2+2'\ncall the execution agent with 'what is sin(5)'\ncall the paper finder with 'fast polynomial root finding'\ncall the claim verifier with 'word embeddings preserve biases'\ncall me when you're done,,requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
16,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,Call the execution agent with 'what is 2+2'\ncall the execution agent with 'what is sin(5)'\ncall the paper finder with 'fast polynomial root finding'\ncall the claim verifier with 'word embeddings preserve biases'\ncall me when you're done,,requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
17,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,Call the execution agent with 'what is 2+2'\ncall the execution agent with 'what is sin(5)'\ncall the paper finder with 'fast polynomial root finding'\ncall the claim verifier with 'word embeddings preserve biases'\ncall me when you're done,,requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
45,59d5909b-061b-4d11-9a3a-a3e6121519ef,can you review https://aries-reviewer.apps.allenai.org/pdf/357bb84deaecf6a0000551b19f4c3d59d9584d44 in three sentences or less,"It seems there was an error in processing the review request for the paper titled ""<Paper corpusId=""259224834"">ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews</Paper>"". The system expected a follow-up question but none was given. If you have any specific questions or need further details about the paper, please let me know so I can assist you accordingly.",
...,...,...,...,...
985,f553d43a-a524-4028-a43b-38ad89e494af,write a review of https://aries-reviewer.apps.allenai.org/pdf/f88ce4c4b0a8b5e180189d7de4e6d9968dc5066b in three sentences or less,"The review for the first paper you requested has been completed. Here are the key points from the review:\n\n1. **Strengths Highlighted**: The paper's innovative approach to multi-task pre-training and leave-one-task-out finetuning addresses significant challenges in NLP. Its comprehensive evaluation across 14 text classification tasks demonstrates strong generalization capabilities, and the detailed comparative analysis provides valuable insights.\n\n2. **Suggestions for Improvement**: The review suggests expanding discussions on the broader impact, including ethical considerations and data privacy. It also recommends a deeper exploration of quantization effects and expanding the comparative analysis to include more recent methods.\n\n3. **Clarification on Methodology**: A more detailed description of the multi-head attention (MHA) pooling and its optimization for each task would be beneficial for replication and further research.\n\nIf you have any follow-up questions or need further clarification on specific points from the review, feel free to ask!",
986,f553d43a-a524-4028-a43b-38ad89e494af,write a review of https://aries-reviewer.apps.allenai.org/pdf/f88ce4c4b0a8b5e180189d7de4e6d9968dc5066b in three sentences or less,"The review for the first paper you requested has been completed. Here are the key points from the review:\n\n1. **Strengths Highlighted:**\n - The paper's approach to multi-task pre-training and leave-one-task-out finetuning is innovative and addresses a significant challenge in NLP.\n - A comprehensive evaluation across 14 text classification tasks provides strong evidence for the generalization capability of the proposed methods.\n - The comparative analysis against baselines and the detailed discussion on computational efficiency and storage considerations are well-executed and contribute valuable insights to the field.\n\n2. **Suggestions for Improvement:**\n - It would be beneficial to include a more detailed discussion on the broader impact of these methods, including potential limitations and ethical considerations, especially in terms of data privacy and bias in multi-task learning scenarios.\n - A deeper exploration into the trade-offs between different quantization levels, especially in terms of model interpretability and the potential impact on fine-grained task performance, would add depth to the analysis.\n - Expanding the comparative analysis to include more recent and advanced methods in multi-task learning and feature extraction could further validate the proposed approach's effectiveness and position it within the current state-of-the-art.\n\n3. **Clarification on Methodology:**\n - A more detailed description of how the task-specific Multi-Head Attention (MHA) pooling is implemented and optimized for each task would provide clarity and could be beneficial for readers interested in replicating or building upon this work.\n\nIf you have any further questions or need more detailed information about the review, feel free to ask!",
987,f553d43a-a524-4028-a43b-38ad89e494af,write a review of https://aries-reviewer.apps.allenai.org/pdf/f88ce4c4b0a8b5e180189d7de4e6d9968dc5066b in three sentences or less,"The review for the first paper is now available. Here are the key points from the review:\n\n1. **Strengths Highlighted**: The paper's innovative approach to multi-task pre-training and leave-one-task-out finetuning addresses a significant challenge in NLP. The comprehensive evaluation across 14 text classification tasks provides strong evidence for the generalization capability of the proposed methods. The comparative analysis against baselines and the detailed discussion on computational efficiency and storage considerations are well-executed and contribute valuable insights to the field.\n\n2. **Suggestions for Improvement**: The review suggests including a more detailed discussion on the broader impact of these methods, including potential limitations and ethical considerations, especially in terms of data privacy and bias in multi-task learning scenarios. It also recommends a deeper exploration into the trade-offs between different quantization levels and expanding the comparative analysis to include more recent and advanced methods in multi-task learning and feature extraction.\n\n3. **Clarification on Methodology**: A more detailed description of the multi-head attention (MHA) pooling and its optimization for each task would provide clarity and could be beneficial for readers interested in replicating or building upon this work.\n\nIf you have any follow-up questions or need further details, feel free to ask!",
988,f553d43a-a524-4028-a43b-38ad89e494af,write a review of https://aries-reviewer.apps.allenai.org/pdf/f88ce4c4b0a8b5e180189d7de4e6d9968dc5066b in three sentences or less,"The review for the second requested paper is not available due to an error in the process. It seems there was an expectation for a follow-up question, but none was provided. If you have specific questions or need further details about this paper, please let me know so I can assist you accordingly!",


In [137]:
expected_followup = missing_all_scores[missing_all_scores['final_response'].str.contains("expected a follow-up question", na=False)]

print("No score due to awaiting follow-up question:", len(expected_followup))
print("Unique case_ids expecting follow-up:", expected_followup['case_id'].nunique())
print("Unique prompts expecting follow-up:", expected_followup['initial_prompt'].nunique())
print()
print("Prompts: ")
unique_initial_prompts = expected_followup['initial_prompt'].unique()
print(unique_initial_prompts)

No score due to awaiting follow-up question: 27
Unique case_ids expecting follow-up: 4
Unique prompts expecting follow-up: 2

Prompts: 
['can you review https://aries-reviewer.apps.allenai.org/pdf/357bb84deaecf6a0000551b19f4c3d59d9584d44 in three sentences or less'
 'write a review of https://aries-reviewer.apps.allenai.org/pdf/f88ce4c4b0a8b5e180189d7de4e6d9968dc5066b in three sentences or less']


In [144]:
no_response = missing_all_scores[missing_all_scores['final_response'].isna()]
pd.reset_option('display.max_colwidth', None)
no_response[:5]

Unnamed: 0,case_id,test_type_unofficial,initial_prompt,handler_cost,metrics_json,output_score,tool_score,final_response,error,events,case_params,short_error,num_times_case_id_failed
14,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,,,,,,message='Expecting value: line 1 column 1 (cha...,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,requests.exceptions.JSONDecodeError: Expecting...,4
15,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,,,,,,message='Expecting value: line 1 column 1 (cha...,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,requests.exceptions.JSONDecodeError: Expecting...,4
16,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,,,,,,message='Expecting value: line 1 column 1 (cha...,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,requests.exceptions.JSONDecodeError: Expecting...,4
17,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,,,,,,message='Expecting value: line 1 column 1 (cha...,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,requests.exceptions.JSONDecodeError: Expecting...,4
150,12790d54-13c8-4f87-8b84-be5f8d347297,,Find me papers about datasets for fairness in ...,,,,,,message='Expecting value: line 1 column 1 (cha...,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{},requests.exceptions.JSONDecodeError: Expecting...,3


In [143]:
for index, row in no_response.iterrows():
    events_json = json.loads(row['events'])  
    print(f"Row {index}:")
    print(json.dumps(events_json, indent=4), '\n')  # pretty-print JSON with indentation

Row 14:
[
    {
        "event_type": "message",
        "role": "user",
        "content": "Call the execution agent with 'what is 2+2'\ncall the execution agent with 'what is sin(5)'\ncall the paper finder with 'fast polynomial root finding'\ncall the claim verifier with 'word embeddings preserve biases'\ncall me when you're done"
    },
    {
        "event_type": "tool_call",
        "call": {
            "event_type": "tool_call",
            "tool_name": "execution_agent",
            "params": {
                "task_description": "what is 2+2"
            },
            "is_result_overridden": false,
            "result": null
        }
    },
    {
        "event_type": "tool_call",
        "call": {
            "event_type": "tool_call",
            "tool_name": "execution_agent",
            "params": {
                "task_description": "what is sin(5)"
            },
            "is_result_overridden": false,
            "result": null
        }
    },
    {
        "even

#### Missing output score only

In [153]:
missing_output_score = df[df['output_score'].isna() & ~df['tool_score'].isna()]
missing_output_score[:5]

Unnamed: 0,case_id,test_type_unofficial,initial_prompt,handler_cost,metrics_json,output_score,tool_score,final_response,error,events,case_params,short_error,num_times_case_id_failed
9,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,0.57395,"{""score"": 1.0, ""subscores"": [{""groups"": [""tool...",,1.0,"The search for papers related to ""fast polynom...",,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,,4
10,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,0.68019,"{""score"": 1.0, ""subscores"": [{""groups"": [""tool...",,1.0,"The claim verification for ""word embeddings pr...",,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,,4
11,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,0.77694,"{""score"": 1.0, ""subscores"": [{""groups"": [""tool...",,1.0,The claim verification process has concluded t...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,,4
12,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,0.78462,"{""score"": 1.0, ""subscores"": [{""groups"": [""tool...",,1.0,"The task to verify the claim ""word embeddings ...",,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,,4
13,8a3ef6f4-8a65-443b-a5d1-646fdcd500cd,paper_finder,Call the execution agent with 'what is 2+2'\nc...,0.79434,"{""score"": 1.0, ""subscores"": [{""groups"": [""tool...",,1.0,"The claim verification for ""word embeddings pr...",,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': '8a3ef6f4-8a65-443b-a5d1-646fdcd50...,,4


In [155]:
pd.set_option('display.max_colwidth', None)
missing_output_score[['final_response']]

Unnamed: 0,final_response
9,"The search for papers related to ""fast polynomial root finding"" has completed, and here are the top 10 results:\n\n1. <Paper corpusId=""125207485"">Simple and Nearly Optimal Polynomial Root-finding by Means of Root Radii Approximation</Paper> (2015)\n2. <Paper corpusId=""121537301"">Finding polynomial roots: a fast algorithm convergent on the complex plane</Paper> (1985)\n3. <Paper corpusId=""251107206"">High-Performance Polynomial Root Finding for Graphics</Paper> (2022)\n4. <Paper corpusId=""267385205"">A Fast Algorithm for Finding the Roots of Polynomials over Finite Fields</Paper> (2023)\n5. <Paper corpusId=""253498978"">Fast Evaluation of Real and Complex Polynomials</Paper> (2022)\n6. <Paper corpusId=""240354020"">FC2T2: The Fast Continuous Convolutional Taylor Transform with Applications in Vision and Graphics</Paper> (2021)\n7. <Paper corpusId=""231846850"">Fast real and complex root-finding methods for well-conditioned polynomials</Paper> (2021)\n8. <Paper corpusId=""246753538"">Computing multiple roots of polynomials in stochastic arithmetic with Newton method and approximate GCD</Paper> (2021)\n9. <Paper corpusId=""231986532"">Root Radii and Subdivision for Polynomial Root-Finding</Paper> (2021)\n10. <Paper corpusId=""173992182"">Software Toolkit for HFE-based Multivariate Schemes</Paper> (2019)\n\nThese papers cover a range of methods and applications for polynomial root finding, from theoretical approaches to practical implementations in graphics and cryptography.\n\nI will continue to monitor the remaining tasks and notify you once they are all completed. Thank you for your patience."
10,"The claim verification for ""word embeddings preserve biases"" has been completed. Here's a summary of the findings:\n\n- **Claim**: Word embeddings preserve biases, particularly those inherent in the training data.\n- **Conclusion**: The evidence largely supports the claim that word embeddings preserve biases. However, some studies suggest that certain models and techniques can mitigate these biases to an extent.\n\n**Supporting Evidence**:\n1. Word embeddings have been shown to contain the same biases due to the source data from which they are trained. This is particularly evident in biases related to gender representation.\n2. Biases from the source data carry through to the semantic relations in the word embeddings, which become embedded in the learning systems built on top of them.\n\n**Refuting Evidence**:\n1. Advanced language models can mitigate biases observed in word embeddings in certain applications or outputs, suggesting that while biases may exist, they can be overcome.\n2. Some dictionary-based embeddings show minor biases compared to colloquial-use embeddings like word2vec, indicating variability in the extent of bias based on the type of embeddings used.\n\nFor a detailed report, you can view the evidence and analysis [here](https://nora-execution-agent.apps.allenai.org/static/2818dc95-1920-47cf-8225-13f561f61759/history.html).\n\nAll tasks have now been completed. If you have any further questions or need additional information, feel free to ask!"
11,"The claim verification process has concluded that the claim ""word embeddings preserve biases"" is largely supported by evidence. The evidence indicates that word embeddings often reflect the biases present in their training data. While some mitigation techniques can reduce these biases, they do not eliminate them entirely, and the effectiveness varies across different types of embeddings.\n\nHere are some key points from the detailed report:\n\n1. **Strong Evidence in Support**:\n - Word embeddings have been shown to contain the same biases due to the source data from which they are trained. This is particularly evident in biases related to gender representation in the data.\n - Biases from the source data carry through to the semantic relations in the word embeddings, which become embedded in the learning systems built on top of them.\n\n2. **Evidence Refuting the Claim**:\n - Contextualized word embeddings show less bias than traditional ones, suggesting that not all word embeddings preserve biases equally.\n - Bias mitigation is not as effective in contextualized word embeddings as in traditional word embeddings, indicating that while biases are reduced, they are not completely eliminated in newer embedding models.\n\nThis comprehensive analysis underscores the persistence of biases in word embeddings and the challenges in completely mitigating them. The effectiveness of bias reduction techniques varies, highlighting the need for ongoing research and development in this area.\n\nAll tasks have now been completed. If you have any further questions or need additional information, feel free to ask!"
12,"The task to verify the claim ""word embeddings preserve biases"" has been completed. The evidence largely supports the claim, although the extent and detectability of these biases can vary depending on the type of embeddings and the methods used to measure bias.\n\nHere are some key points from the detailed report:\n- **Strong Evidence Supporting the Claim**:\n - Word embeddings have been shown to contain biases due to the source data from which they are trained. This directly supports the claim by linking the preservation of biases in word embeddings to their training data.\n - Biases, whether gendered or otherwise, still exist in word embeddings, confirming the presence of biases.\n - Biases in word embeddings are directly influenced by the characteristics of the training data, suggesting that biases are preserved unless explicitly mitigated.\n\n- **Evidence Refuting the Claim**:\n - In some cases, language models can overcome the bias observed in word embeddings, indicating a potential mitigation of bias effects in specific contexts.\n - The extent of bias preservation can vary significantly between different types of embeddings, suggesting that some embeddings might not preserve biases as strongly as others.\n\nThis comprehensive analysis indicates that while biases are present in word embeddings, their impact can be mitigated or varied depending on specific conditions and embedding types.\n\nAll tasks have now been completed. If you have any further questions or need additional information, feel free to ask!"
13,"The claim verification for ""word embeddings preserve biases"" has been completed. The evidence largely supports the claim that word embeddings preserve biases, particularly those inherent in the training data. While some debiasing methods show promise in reducing biases, the fundamental issue of bias preservation in word embeddings remains prevalent across various studies.\n\nHere are some key points from the detailed report:\n\n1. **Strong Evidence Supporting the Claim**:\n - Word embeddings have been shown to contain the same biases due to the source data from which they are trained. This underscores that biases in the training data are directly reflected in the resulting word embeddings.\n - Biases from the source data, such as differences in representation for men and women, carry through to the semantic relations in the word embeddings, which become baked into the learning systems built on top of them.\n - Multiple studies have observed biases in word embeddings, indicating a broad consensus on the presence of biases.\n\n2. **Evidence Refuting the Claim**:\n - Some evidence suggests that higher-level language models may mitigate some of the biases present in underlying word embeddings. For example, the language model is able to readily overcome the bias observed in the word embeddings when it comes to the final '[CLS]' representations, showing that advanced language models can correct or lessen the impact of biases in specific cases of word embeddings.\n\nYou can view the detailed computation and evidence [here](https://nora-execution-agent.apps.allenai.org/static/e36c42f9-9c7b-4fe7-bb52-d98a0e3b2df2/history.html).\n\nAll tasks have now been completed. If you have any further questions or need additional information, feel free to ask!"
...,...
856,"To provide information about the references of a specific paper, I need to know which paper you are referring to. Could you please specify the paper or provide more details about it?"
857,"Since the search for a specific paper on CRISPR gene editing did not yield results, I don't have a paper to reference for finding its cited works. Would you like me to attempt the search again or modify the search criteria? Alternatively, if you have a specific paper in mind, please provide its details, and I can look up the references for that paper."
858,"I haven't successfully retrieved a specific paper on CRISPR gene editing yet. Could you please clarify if you would like me to continue searching for a paper on CRISPR gene editing, or if there's another specific paper you have in mind for which you need the references?"
859,"I haven't been able to retrieve a specific paper about CRISPR gene editing yet due to the search issue. To proceed, please choose one of the following options regarding the search:\n\n1) Run an advanced agentic search anyway (this may yield suboptimal results and will take a long time).\n2) Run the query via only Semantic Scholar (this may also yield non-optimal results).\n3) Cancel the search.\n\nOnce we have a specific paper, I can then look up the works it referenced. Let me know how you'd like to proceed!"


### Highest Std (by *total output* score) Dev Case Ids

In [82]:
df_by_case = df.groupby(['case_id']).std()
highest_std_total_score = df_by_case['output_score'].sort_values(ascending=False)
highest_std_total_score[:10]

case_id
a6797289-061d-4e9b-b280-bbb2598f3a42    0.527046
d546f51d-614a-4042-ae2a-a8906bf4027c    0.517549
2b4b45ee-9be4-4532-9e9b-a7a626201640    0.440959
12790d54-13c8-4f87-8b84-be5f8d347297    0.408248
0733ec2a-0dc7-402b-abac-beaf841436ca    0.353553
a780a51b-4b58-40cb-b65a-a5da5e3c868f    0.333333
f246109e-721e-4239-8e66-93c17b39ae20    0.333333
46f77517-438b-4cb8-a52d-4131d7ec0b70    0.333333
5c33acd1-4048-4c88-80dc-b43f0f58a5b2    0.333333
a4080a3c-82fa-4f3f-b20d-83450f65adb3    0.333333
Name: output_score, dtype: float64

In [93]:
case = 'a6797289-061d-4e9b-b280-bbb2598f3a42'
case_a6797289 = df[df['case_id'] == case]
pd.set_option('display.max_colwidth', None)
case_a6797289[['final_response', 'output_score', 'short_error']]

Unnamed: 0,final_response,output_score,short_error
873,"<Author authorId=""3328733"">Luca Soldaini</Author> is primarily affiliated with the Allen Institute for AI. His research spans various topics in artificial intelligence, although specific details about his research area are not provided in the data I retrieved. For more detailed information, you might consider visiting his [homepage](https://soldaini.net).",0.0,
874,"<Author authorId=""3328733"">Luca Soldaini</Author> is primarily affiliated with the Allen Institute for AI. His research spans various topics in artificial intelligence, although specific details about his research area are not provided in the data I retrieved. You can find more about his work and interests on his [personal homepage](https://soldaini.net).",0.0,
875,"Luca Soldaini, with <Author authorId=""3328733"">authorId 3328733</Author>, is affiliated with the Allen Institute for AI. He has published 60 papers and has an h-index of 21. You can find more about his work on his [homepage](https://soldaini.net).\n\nIf you need more specific details about his research area or any particular papers, please let me know!",0.0,
876,"Luca Soldaini, with <Author authorId=""3328733"">authorId 3328733</Author>, is affiliated with the Allen Institute for AI. He has published 60 papers and has an h-index of 21. You can find more about his work on his [homepage](https://soldaini.net).\n\nFor more specific details about his research area or to explore his publications, please let me know how you would like to proceed!",0.0,
877,"<Author authorId=""3328733"">Luca Soldaini</Author> is affiliated with the Allen Institute for AI. His research primarily focuses on information retrieval, natural language processing, and machine learning. You can find more about his work on his [homepage](https://soldaini.net).",1.0,
878,"<Author authorId=""3328733"">Luca Soldaini</Author> is affiliated with the Allen Institute for AI. His research primarily focuses on information retrieval, natural language processing, and machine learning. You can find more about his work on his [homepage](https://soldaini.net).",1.0,
879,"<Author authorId=""3328733"">Luca Soldaini</Author> is affiliated with the Allen Institute for AI. His research primarily focuses on information retrieval, natural language processing, and machine learning. You can find more about his work on his [homepage](https://soldaini.net).",1.0,
880,"Luca Soldaini, with <Author authorId=""3328733"">author ID 3328733</Author>, is affiliated with the Allen Institute for AI. His research primarily focuses on areas related to artificial intelligence, as indicated by his affiliation and his substantial publication record. For more specific details on his research area, you can visit his [homepage](https://soldaini.net).",1.0,
881,"Luca Soldaini, with <Author authorId=""3328733"">author ID 3328733</Author>, is affiliated with the Allen Institute for AI. His research primarily focuses on areas related to artificial intelligence, as indicated by his affiliation and his substantial publication count and h-index. For more detailed insights into his specific research topics, you can visit his [homepage](https://soldaini.net).",1.0,


In [88]:
case_a6797289['final_response'].iloc[0]

'<Author authorId="3328733">Luca Soldaini</Author> is primarily affiliated with the Allen Institute for AI. His research spans various topics in artificial intelligence, although specific details about his research area are not provided in the data I retrieved. For more detailed information, you might consider visiting his [homepage](https://soldaini.net).'

### Highest Std (by *TOOL* score) Dev Case Ids

In [83]:
highest_std_tool_score = df_by_case['tool_score'].sort_values(ascending=False)
highest_std_tool_score[:10]

case_id
4c0407f4-7f62-4126-a62d-ab5bcea1a035    0.534522
70d557ec-5cea-4022-a280-fbb3971fba8f    0.527046
66c883e2-1f8f-480b-94b7-4be73d7e5179    0.527046
90d14cf1-6edb-4975-ad14-5851e7c9eafb    0.516398
ef5e364b-ce25-4f89-945b-2311ed107745    0.500000
86e7c027-b7a8-42cf-9a0e-bdbba6a5d3ab    0.440959
7cee2603-6bd2-4b5e-9cda-28cce6ad5fb2    0.440959
6ebaa42b-28e7-4bf8-9f62-0889270c3eee    0.440959
81aa2f86-4474-4b8a-b45a-76c24ea383b3    0.440959
6c933f89-9b27-4293-9c38-4de0f18fb600    0.440959
Name: tool_score, dtype: float64

In [76]:
case = '4c0407f4-7f62-4126-a62d-ab5bcea1a035'
case_4c0407f4 = df[df['case_id'] == case]
case_4c0407f4['initial_prompt'].iloc[0]

'What are some recent references of the BERT paper?'

### Other Cases

In [46]:
df[df['case_id'] == 'c6061478-e4f3-4c55-b1fd-c2538bf475c3']

Unnamed: 0,case_id,test_type_unofficial,initial_prompt,handler_cost,metrics,metrics_json,score,output_score,tool_score,final_response,error,events,case_params,short_error,num_times_case_id_failed
0,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.2967,"{'score': 0.28571428571428575, 'subscores': [{...","{""score"": 0.28571428571428575, ""subscores"": [{...",0.285714,0.0,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
1,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.47035,"{'score': 0.28571428571428575, 'subscores': [{...","{""score"": 0.28571428571428575, ""subscores"": [{...",0.285714,0.0,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
2,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.2984,"{'score': 0.4285714285714286, 'subscores': [{'...","{""score"": 0.4285714285714286, ""subscores"": [{""...",0.428571,0.2,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
3,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.29848,"{'score': 0.4285714285714286, 'subscores': [{'...","{""score"": 0.4285714285714286, ""subscores"": [{""...",0.428571,0.2,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
4,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.45453,"{'score': 0.4285714285714286, 'subscores': [{'...","{""score"": 0.4285714285714286, ""subscores"": [{""...",0.428571,0.2,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
5,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.45767,"{'score': 0.4285714285714286, 'subscores': [{'...","{""score"": 0.4285714285714286, ""subscores"": [{""...",0.428571,0.2,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
6,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.29656,"{'score': 0.5714285714285715, 'subscores': [{'...","{""score"": 0.5714285714285715, ""subscores"": [{""...",0.571429,0.4,1.0,The detailed comparison of state-space models ...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
7,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.29824,"{'score': 0.5714285714285715, 'subscores': [{'...","{""score"": 0.5714285714285715, ""subscores"": [{""...",0.571429,0.4,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
8,c6061478-e4f3-4c55-b1fd-c2538bf475c3,corpus_qa,"Based on the latest scientific literature, ple...",0.29809,"{'score': 0.7142857142857144, 'subscores': [{'...","{""score"": 0.7142857142857144, ""subscores"": [{""...",0.714286,0.6,1.0,The detailed comparison between state-space mo...,,"[{""event_type"": ""message"", ""role"": ""user"", ""co...",{'case_id': 'c6061478-e4f3-4c55-b1fd-c2538bf47...,,0
