
Okay so we have around 510 contract files and 31 questions per file. 
We cannot run the evals for all the contract files as it would take a long time. 
We need to find a way to run the evals for a subset of the data and then generalize the results. 

We are going to find 20 files to run the eval on with the following characteristics: 
- These are files that the model got the lowest precision, recall, and F1 score on. 
- We would need to create 2 sets: 
    - One where all the contract files are less than 3500 tokens and the agent performed the worst on them.
    - One where the token count does not matter and the agent performed the worst on them. 

Steps for selecting the files: 
- We will load the llama2:13b_baseline evals and sort them based on precision, recall, and F1 score, with the lowest being the best. The F1 score gets the highest weightage in the sorting.
- We will then select the top 20 files that have the lowest precision, recall, and F1 score. 
- We will also select top 20 files that has the lowest precision, recall, and F1 score and the token count is less than 3500 tokens.
- We will then save both these files as a JSON file. 
- We will then run the evals on these files and save the results in a new folder. 
- We will then generalize the results for the rest of the files. 




In [1]:
from src.utils import load_json, save_json
from src.utils import sort_eval_results, get_worst_performing_files, save_file_subsets, analyze_subset_metrics

In [2]:
# Load evaluation results and document statistics
eval_results = load_json('results/evals/llama2:13b_baseline.json')
doc_stats = load_json('data/processed/txt_doc_stats.json')

# Get worst performing files
sorted_results = sort_eval_results(eval_results, doc_stats)

In [3]:
sorted_results

[('PRECIGEN,INC_01_22_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
  {'weighted_score': 0.06290322580645161,
   'f1_score': 0,
   'precision': 0.0,
   'recall': 0,
   'accuracy': 0.41935483870967744,
   'token_length': 323}),
 ('RMRGROUPINC_01_22_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
  {'weighted_score': 0.06774193548387096,
   'f1_score': 0,
   'precision': 0.0,
   'recall': 0,
   'accuracy': 0.45161290322580644,
   'token_length': 185}),
 ('WPPPLC_04_30_2020-EX-4.28-SERVICE AGREEMENT.txt',
  {'weighted_score': 0.07258064516129033,
   'f1_score': 0,
   'precision': 0.0,
   'recall': 0.0,
   'accuracy': 0.4838709677419355,
   'token_length': 13374}),
 ('IdeanomicsInc_20151124_8-K_EX-10.2_9354744_EX-10.2_Content License Agreement.txt',
  {'weighted_score': 0.07741935483870967,
   'f1_score': 0,
   'precision': 0.0,
   'recall': 0.0,
   'accuracy': 0.5161290322580645,
   'token_length': 7536}),
 ('REWALKROBOTICSLTD_07_10_2014-EX-10.2-STRATEGIC ALLIANCE AGREEMENT.txt',
  {'weighted_score

In [4]:
# Get worst performing files (both overall and token-limited)
worst_overall = get_worst_performing_files(sorted_results, n=20)
worst_token_limited = get_worst_performing_files(sorted_results, n=20, max_tokens=3500)

In [5]:
worst_overall

['PRECIGEN,INC_01_22_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
 'RMRGROUPINC_01_22_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
 'WPPPLC_04_30_2020-EX-4.28-SERVICE AGREEMENT.txt',
 'IdeanomicsInc_20151124_8-K_EX-10.2_9354744_EX-10.2_Content License Agreement.txt',
 'REWALKROBOTICSLTD_07_10_2014-EX-10.2-STRATEGIC ALLIANCE AGREEMENT.txt',
 'SPIENERGYCO,LTD_07_10_2014-EX-10-Cooperation Agreement of 50MWp Photovoltaic Grid-connected Power Generation Project in Yangqiao of~1.txt',
 'JINGWEIINTERNATIONALLTD_10_04_2007-EX-10.7-INTELLECTUAL PROPERTY AGREEMENT.txt',
 'ULTRAGENYXPHARMACEUTICALINC_12_23_2013-EX-10.9-SUPPLY AGREEMENT.txt',
 'PareteumCorp_20081001_8-K_EX-99.1_2654808_EX-99.1_Hosting Agreement.txt',
 'PacificapEntertainmentHoldingsInc_20051115_8-KA_EX-1.01_4300894_EX-1.01_Content License Agreement.txt',
 'NYLIACVARIABLEANNUITYSEPARATEACCOUNTIII_04_10_2020-EX-99.8.KK-SERVICE AGREEMENT.txt',
 'TURNKEYCAPITAL,INC_07_20_2017-EX-1.1-Strategic Alliance Agreement.txt',
 'CHEETAHMOBILEINC_04_2

In [6]:
worst_token_limited

['PRECIGEN,INC_01_22_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
 'RMRGROUPINC_01_22_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
 'REWALKROBOTICSLTD_07_10_2014-EX-10.2-STRATEGIC ALLIANCE AGREEMENT.txt',
 'SPIENERGYCO,LTD_07_10_2014-EX-10-Cooperation Agreement of 50MWp Photovoltaic Grid-connected Power Generation Project in Yangqiao of~1.txt',
 'ULTRAGENYXPHARMACEUTICALINC_12_23_2013-EX-10.9-SUPPLY AGREEMENT.txt',
 'NYLIACVARIABLEANNUITYSEPARATEACCOUNTIII_04_10_2020-EX-99.8.KK-SERVICE AGREEMENT.txt',
 'TURNKEYCAPITAL,INC_07_20_2017-EX-1.1-Strategic Alliance Agreement.txt',
 'VgrabCommunicationsInc_20200129_10-K_EX-10.33_11958828_EX-10.33_Development Agreement.txt',
 'QBIOMEDINC_04_08_2020-EX-99.1-JOINT FILING AGREEMENT.txt',
 'GALERATHERAPEUTICS,INC_02_14_2020-EX-99.A-JOINT FILING AGREEMENT.txt',
 'Cerus Corporation - FIRST AMEND TO SUPPLY AND MANUFACTURING AGREEMENT.txt',
 'VnueInc_20150914_8-K_EX-10.1_9259571_EX-10.1_Promotion Agreement.txt',
 'TcPipelinesLp_20160226_10-K_EX-99.12_9454048

In [7]:
# Save the subsets
save_path = 'data/file_subsets/llama2:13b_baseline.json'
save_file_subsets(worst_overall, worst_token_limited, save_path, token_cutoff=3500)

In [8]:
# Analyze metrics for both subsets
overall_metrics = analyze_subset_metrics(eval_results, worst_overall)
filtered_metrics = analyze_subset_metrics(eval_results, worst_token_limited)

print("Metrics for worst performing files overall:")
print(overall_metrics)
print("\nMetrics for worst performing files under 3500 tokens:")
print(filtered_metrics)

Metrics for worst performing files overall:
{'avg_f1_score': 0.0, 'avg_precision': 0.0, 'avg_recall': 0.0, 'avg_accuracy': 0.5370967741935484}

Metrics for worst performing files under 3500 tokens:
{'avg_f1_score': 0.0, 'avg_precision': 0.0, 'avg_recall': 0.0, 'avg_accuracy': 0.5725806451612904}
