### Best Model Identification for Innovation Readiness Level

This code will read the AI assessment of test output data w.r.t Innovation Readiness Level which consists of 5 different AI models for each result code. Then, we will compare the performance of each model against the golden levels assessed by human, and identify the best performing model.

Since innovation readiness level are continous numbers from 0 to 9, we will need to identify how much each model is off from the golden level

In [1]:
import os
import pandas as pd
import json
import re
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

#### Reading the output from AI models

In [2]:
inn_read_ai_results_dir = "test_input_data"
inn_read_ai_results_file = os.path.join(inn_read_ai_results_dir, "results_innovation_readiness.xlsx")
inn_read_ai_results = pd.read_excel(inn_read_ai_results_file, engine="openpyxl")
inn_read_ai_results.head()

Unnamed: 0,Result Code,Part,Content,input_text,model_name,Innovation_Readiness
0,1005,1,A living lab for people for \nlow-emission foo...,A living lab for people for \nlow-emission foo...,gpt-4.1,"{\n“Prime Innovation”: ""Living Lab for People ..."
1,1005,1,A living lab for people for \nlow-emission foo...,A living lab for people for \nlow-emission foo...,gpt-4o,"```json\n{\n ""Prime Innovation"": ""Living Lab ..."
2,1005,1,A living lab for people for \nlow-emission foo...,A living lab for people for \nlow-emission foo...,grok-3,"{\n ""Prime Innovation"": ""Living Lab for Peopl..."
3,1005,1,A living lab for people for \nlow-emission foo...,A living lab for people for \nlow-emission foo...,grok-4,"{\n ""Prime Innovation"": ""Living Lab for Peopl..."
4,1005,1,A living lab for people for \nlow-emission foo...,A living lab for people for \nlow-emission foo...,o3,{\n“Prime Innovation”: “Living Lab for People ...


Getting only relevant columns

In [3]:
inn_read_ai_results = inn_read_ai_results[['Result Code', 'model_name', 'Innovation_Readiness']]
inn_read_ai_results.head()

Unnamed: 0,Result Code,model_name,Innovation_Readiness
0,1005,gpt-4.1,"{\n“Prime Innovation”: ""Living Lab for People ..."
1,1005,gpt-4o,"```json\n{\n ""Prime Innovation"": ""Living Lab ..."
2,1005,grok-3,"{\n ""Prime Innovation"": ""Living Lab for Peopl..."
3,1005,grok-4,"{\n ""Prime Innovation"": ""Living Lab for Peopl..."
4,1005,o3,{\n“Prime Innovation”: “Living Lab for People ...


Parsing the Json string of the outputs from AI models

In [4]:
pattern = r"(\"\s*Prime Innovation\s*\"\s*:\s*\")(.*)(\"\s*,\s*\"\s*Evidence-Level Justification\s*\"\s*:\s*\")(.*)(\"\s*,\s*\"\s*Off-Level Justification\s*\"\s*:\s*\")(.*)(\"\s*,\s*\"\s*Environment-Level Justification\s*\"\s*:\s*\")(.*)(\"\s*,\s*\"\s*Innovation Readiness Level.*)"

In [5]:
def replace_char_in_group(match):
    group_two_modified = match.group(2).replace('"', "'")
    group_four_modified = match.group(4).replace('"', "'")
    group_six_modified = match.group(6).replace('"', "'")
    group_eight_modified = match.group(8).replace('"', "'")
    
    return match.group(1) + group_two_modified + match.group(3) + group_four_modified + match.group(5) + group_six_modified + match.group(7) + group_eight_modified + match.group(9)

In [6]:
def parse_json(json_str):
    try:
        # Removing left and right double quotes from excel into double quotes for successfull parsing
        json_str = json_str.encode('utf-8').replace(b'\xe2\x80\x9c', b'"').replace(b'\xe2\x80\x9d', b'"').decode('utf-8')
        # Replace the extra characters
        json_str = json_str.strip().replace('\n', '').replace("'", "").replace('```', '').replace('json', '')
        # Use re.sub with the replacement function to replace double quotes in json value text since it causes issues in json parsing
        json_str = re.sub(pattern, replace_char_in_group, json_str)
        return json.loads(json_str)
    except json.JSONDecodeError:
        return None

In [7]:
inn_read_ai_results['parsed_json'] = inn_read_ai_results['Innovation_Readiness'].apply(parse_json)
inn_read_ai_results.head()

Unnamed: 0,Result Code,model_name,Innovation_Readiness,parsed_json
0,1005,gpt-4.1,"{\n“Prime Innovation”: ""Living Lab for People ...",{'Prime Innovation': 'Living Lab for People (L...
1,1005,gpt-4o,"```json\n{\n ""Prime Innovation"": ""Living Lab ...",{'Prime Innovation': 'Living Lab for People (L...
2,1005,grok-3,"{\n ""Prime Innovation"": ""Living Lab for Peopl...",{'Prime Innovation': 'Living Lab for People (L...
3,1005,grok-4,"{\n ""Prime Innovation"": ""Living Lab for Peopl...",{'Prime Innovation': 'Living Lab for People (L...
4,1005,o3,{\n“Prime Innovation”: “Living Lab for People ...,{'Prime Innovation': 'Living Lab for People (L...


Extracting the readiness level from the parsed json

In [8]:
inn_read_ai_results['readiness_level'] = inn_read_ai_results['parsed_json'].apply(lambda x: x['Innovation Readiness Level'] if x is not None else None)
inn_read_ai_results.head()

Unnamed: 0,Result Code,model_name,Innovation_Readiness,parsed_json,readiness_level
0,1005,gpt-4.1,"{\n“Prime Innovation”: ""Living Lab for People ...",{'Prime Innovation': 'Living Lab for People (L...,2
1,1005,gpt-4o,"```json\n{\n ""Prime Innovation"": ""Living Lab ...",{'Prime Innovation': 'Living Lab for People (L...,2
2,1005,grok-3,"{\n ""Prime Innovation"": ""Living Lab for Peopl...",{'Prime Innovation': 'Living Lab for People (L...,2
3,1005,grok-4,"{\n ""Prime Innovation"": ""Living Lab for Peopl...",{'Prime Innovation': 'Living Lab for People (L...,2
4,1005,o3,{\n“Prime Innovation”: “Living Lab for People ...,{'Prime Innovation': 'Living Lab for People (L...,2


Creating different columns of each model with their readiness scores for a single result code

In [9]:
inn_read_ai_results = inn_read_ai_results[['Result Code', 'model_name', 'readiness_level']]
inn_read_ai_results = inn_read_ai_results.pivot(index='Result Code', columns = 'model_name', values = 'readiness_level').rename_axis(columns=None).reset_index()
inn_read_ai_results

Unnamed: 0,Result Code,gpt-4.1,gpt-4o,grok-3,grok-4,o3
0,126,4,3.0,3,5,5
1,128,6,2.0,2,2,2
2,276,5,2.0,2,3,3
3,453,9,8.0,9,9,9
4,455,3,5.0,4,3,2
5,778,9,9.0,9,9,9
6,814,5,7.0,5,5,3
7,824,7,,3,5,5
8,1005,2,2.0,2,2,2
9,1024,8,8.0,8,8,8


#### Reading the Golden Assesses Results

In [10]:
golden_ipsr_input_dir = "test_input_data"
golden_ipsr_input_file = os.path.join(golden_ipsr_input_dir, "golden_ipsr_assessment.csv")
golden_ipsr_data = pd.read_csv(golden_ipsr_input_file)
golden_ipsr_data.head()

Unnamed: 0,Result Code,Short title,Innovation nature,Readiness level,Evidence 1,Evidence 2,Evidence 3,Assessed level,Notes
0,4075.0,Eco-friendly biopesticides for control of rice...,Technological innovation,1.0,https://hdl.handle.net/10568/135290,,,5.0,
1,1787.0,Data to Action Portal for Aquatic Foods,Technological innovation,2.0,https://hdl.handle.net/10568/127521,https://hdl.handle.net/10568/127519,,2.0,
2,10492.0,Timely disease identification point-of-care sy...,Technological innovation,2.0,https://hdl.handle.net/10568/138529,,,4.0,
3,1010.0,Ex-post dashboard for informing parameterizati...,Technological innovation,3.0,https://hdl.handle.net/10568/137897,,,3.0,
4,8689.0,PathoTracer 2.0,Technological innovation,3.0,https://cgiar.sharepoint.com/:b:/s/OneCGIARPRM...,,,7.0,The PathoTracer 2.0 dashboard is fully develop...


In [11]:
golden_ipsr_data = golden_ipsr_data[['Result Code', 'Readiness level', 'Assessed level']]
golden_ipsr_data.head()

Unnamed: 0,Result Code,Readiness level,Assessed level
0,4075.0,1.0,5.0
1,1787.0,2.0,2.0
2,10492.0,2.0,4.0
3,1010.0,3.0,3.0
4,8689.0,3.0,7.0


#### Combining two dataframes to figure out the relevance between human results and those of given by each model

In [12]:
ai_golden_results = pd.merge(inn_read_ai_results, golden_ipsr_data, how='inner', on='Result Code')
ai_golden_results = ai_golden_results.dropna() #Dropping results having empty results
ai_golden_results.head()

Unnamed: 0,Result Code,gpt-4.1,gpt-4o,grok-3,grok-4,o3,Readiness level,Assessed level
0,126,4,3,3,5,5,7.0,7.0
1,128,6,2,2,2,2,5.0,5.0
2,276,5,2,2,3,3,8.0,8.0
3,453,9,8,9,9,9,8.0,8.0
4,455,3,5,4,3,2,6.0,6.0


In [13]:
ai_golden_results

Unnamed: 0,Result Code,gpt-4.1,gpt-4o,grok-3,grok-4,o3,Readiness level,Assessed level
0,126,4,3,3,5,5,7.0,7.0
1,128,6,2,2,2,2,5.0,5.0
2,276,5,2,2,3,3,8.0,8.0
3,453,9,8,9,9,9,8.0,8.0
4,455,3,5,4,3,2,6.0,6.0
5,778,9,9,9,9,9,9.0,9.0
6,814,5,7,5,5,3,7.0,7.0
8,1005,2,2,2,2,2,2.0,2.0
9,1024,8,8,8,8,8,9.0,9.0
10,1035,8,8,9,9,9,7.0,7.0


In [14]:
ai_golden_results['grok-4'].to_numpy().astype('float')

array([5., 2., 3., 9., 3., 9., 5., 2., 8., 9., 2., 2., 7., 9., 7., 7., 8.,
       5., 7., 9., 7., 9., 9., 2., 9., 2., 7., 7., 1.])

Calculating sum of absolute difference for every model

In [15]:
print('Sum of Absolute Difference GPT-4.1:', np.sum(np.abs(ai_golden_results['gpt-4.1'].to_numpy().astype('float') - ai_golden_results['Assessed level'].to_numpy())))
print('Sum of Absolute Difference grok3:', np.sum(np.abs(ai_golden_results['grok-3'].to_numpy().astype('float') - ai_golden_results['Assessed level'].to_numpy())))
print('Sum of Absolute Difference o3:', np.sum(np.abs(ai_golden_results['o3'].to_numpy().astype('float') - ai_golden_results['Assessed level'].to_numpy())))
print('Sum of Absolute Difference GPT-4o:', np.sum(np.abs(ai_golden_results['gpt-4o'].to_numpy().astype('float') - ai_golden_results['Assessed level'].to_numpy())))
print('Sum of Absolute Difference grok4:', np.sum(np.abs(ai_golden_results['grok-4'].to_numpy().astype('float') - ai_golden_results['Assessed level'].to_numpy())))

Sum of Absolute Difference GPT-4.1: 42.0
Sum of Absolute Difference grok3: 54.0
Sum of Absolute Difference o3: 56.0
Sum of Absolute Difference GPT-4o: 60.0
Sum of Absolute Difference grok4: 61.0


Calculating Mean Absolute Error for every model

In [16]:
print("Mean Absolute Error GPT-4.1: ", mean_absolute_error(ai_golden_results['gpt-4.1'], ai_golden_results['Assessed level']))
print("Mean Absolute Error grok3: ", mean_absolute_error(ai_golden_results['grok-3'], ai_golden_results['Assessed level']))
print("Mean Absolute Error o3: ", mean_absolute_error(ai_golden_results['o3'], ai_golden_results['Assessed level']))
print("Mean Absolute Error GPT-4o: ", mean_absolute_error(ai_golden_results['gpt-4o'], ai_golden_results['Assessed level']))
print("Mean Absolute Error grok4: ", mean_absolute_error(ai_golden_results['grok-4'], ai_golden_results['Assessed level']))

Mean Absolute Error GPT-4.1:  1.4482758620689655
Mean Absolute Error grok3:  1.8620689655172413
Mean Absolute Error o3:  1.9310344827586208
Mean Absolute Error GPT-4o:  2.0689655172413794
Mean Absolute Error grok4:  2.103448275862069


Calculating mean squared error

In [17]:
print("Mean Squared Error GPT-4.1: ", mean_squared_error(ai_golden_results['gpt-4.1'], ai_golden_results['Assessed level']))
print("Mean Squared Error o3: ", mean_squared_error(ai_golden_results['o3'], ai_golden_results['Assessed level']))
print("Mean Squared Error grok3: ", mean_squared_error(ai_golden_results['grok-3'], ai_golden_results['Assessed level']))
print("Mean Squared Error grok4: ", mean_squared_error(ai_golden_results['grok-4'], ai_golden_results['Assessed level']))
print("Mean Squared Error GPT-4o: ", mean_squared_error(ai_golden_results['gpt-4o'], ai_golden_results['Assessed level']))


Mean Squared Error GPT-4.1:  3.310344827586207
Mean Squared Error o3:  5.310344827586207
Mean Squared Error grok3:  6.344827586206897
Mean Squared Error grok4:  6.448275862068965
Mean Squared Error GPT-4o:  7.310344827586207


#### Calculating the following things for each model
1. The number of cases where the innovation readiness level was confirmed by the AI as proposed by Edwin.
2. The number of cases where it was Reduced by:
 - 1–2 points
 - 3–4 points
 - more than 5 points

3. The number of cases where it was Increased by:
 - 1–2 points
 - 3–4 points
 - more than 5 points

In [34]:
ai_golden_results['gpt-4.1-diff'] = ai_golden_results['gpt-4.1'].astype('float') - ai_golden_results['Assessed level'].astype('float')
ai_golden_results['gpt-4o-diff'] = ai_golden_results['gpt-4o'].astype('float') - ai_golden_results['Assessed level'].astype('float')
ai_golden_results['grok-3-diff'] = ai_golden_results['grok-3'].astype('float') - ai_golden_results['Assessed level'].astype('float')
ai_golden_results['grok-4-diff'] = ai_golden_results['grok-4'].astype('float') - ai_golden_results['Assessed level'].astype('float')
ai_golden_results['o3-diff'] = ai_golden_results['o3'].astype('float') - ai_golden_results['Assessed level'].astype('float')
ai_golden_results

Unnamed: 0,Result Code,gpt-4.1,gpt-4o,grok-3,grok-4,o3,Readiness level,Assessed level,gpt-4.1-diff,gpt-4o-diff,grok-3-diff,grok-4-diff,go3-diff,o3-diff
0,126,4,3,3,5,5,7.0,7.0,-3.0,-4.0,-4.0,-2.0,-2.0,-2.0
1,128,6,2,2,2,2,5.0,5.0,1.0,-3.0,-3.0,-3.0,-3.0,-3.0
2,276,5,2,2,3,3,8.0,8.0,-3.0,-6.0,-6.0,-5.0,-5.0,-5.0
3,453,9,8,9,9,9,8.0,8.0,1.0,0.0,1.0,1.0,1.0,1.0
4,455,3,5,4,3,2,6.0,6.0,-3.0,-1.0,-2.0,-3.0,-4.0,-4.0
5,778,9,9,9,9,9,9.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0
6,814,5,7,5,5,3,7.0,7.0,-2.0,0.0,-2.0,-2.0,-4.0,-4.0
8,1005,2,2,2,2,2,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1024,8,8,8,8,8,9.0,9.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
10,1035,8,8,9,9,9,7.0,7.0,1.0,1.0,2.0,2.0,2.0,2.0


In [30]:
result_match = ai_golden_results[ai_golden_results['gpt-4.1-diff']==0]
result_inc_1_or_2 = ai_golden_results[(ai_golden_results['gpt-4.1-diff']==1) | (ai_golden_results['gpt-4.1-diff']==2)]
result_inc_3_or_4 = ai_golden_results[(ai_golden_results['gpt-4.1-diff']==3) | (ai_golden_results['gpt-4.1-diff']==4)]
result_inc_5_or_more = ai_golden_results[(ai_golden_results['gpt-4.1-diff']>=5)]
result_dec_1_or_2 = ai_golden_results[(ai_golden_results['gpt-4.1-diff']==-1) | (ai_golden_results['gpt-4.1-diff']==-2)]
result_dec_3_or_4 = ai_golden_results[(ai_golden_results['gpt-4.1-diff']==-3) | (ai_golden_results['gpt-4.1-diff']==-4)]
result_dec_5_or_more = ai_golden_results[(ai_golden_results['gpt-4.1-diff']<=-5)]
print("GPT-4.1 Results Analysis:")
print("Total number of results: ", len(ai_golden_results))
print("Number of results matched exactly: ", len(result_match))
print("Number of results increased by 1 or 2 levels: ", len(result_inc_1_or_2))
print("Number of results increased by 3 or 4 levels: ", len(result_inc_3_or_4))
print("Number of results increased by 5 or more levels: ", len(result_inc_5_or_more))
print("Number of results decreased by 1 or 2 levels: ", len(result_dec_1_or_2))
print("Number of results decreased by 3 or 4 levels: ", len(result_dec_3_or_4))
print("Number of results decreased by 5 or more levels: ", len(result_dec_5_or_more))

GPT-4.1 Results Analysis:
Total number of results:  29
Number of results matched exactly:  6
Number of results increased by 1 or 2 levels:  11
Number of results increased by 3 or 4 levels:  2
Number of results increased by 5 or more levels:  0
Number of results decreased by 1 or 2 levels:  6
Number of results decreased by 3 or 4 levels:  4
Number of results decreased by 5 or more levels:  0


In [32]:
result_match = ai_golden_results[ai_golden_results['grok-3-diff']==0]
result_inc_1_or_2 = ai_golden_results[(ai_golden_results['grok-3-diff']==1) | (ai_golden_results['grok-3-diff']==2)]
result_inc_3_or_4 = ai_golden_results[(ai_golden_results['grok-3-diff']==3) | (ai_golden_results['grok-3-diff']==4)]
result_inc_5_or_more = ai_golden_results[(ai_golden_results['grok-3-diff']>=5)]
result_dec_1_or_2 = ai_golden_results[(ai_golden_results['grok-3-diff']==-1) | (ai_golden_results['grok-3-diff']==-2)]
result_dec_3_or_4 = ai_golden_results[(ai_golden_results['grok-3-diff']==-3) | (ai_golden_results['grok-3-diff']==-4)]
result_dec_5_or_more = ai_golden_results[(ai_golden_results['grok-3-diff']<=-5)]
print("grok-3 Results Analysis:")
print("Total number of results: ", len(ai_golden_results))
print("Number of results matched exactly: ", len(result_match))
print("Number of results increased by 1 or 2 levels: ", len(result_inc_1_or_2))
print("Number of results increased by 3 or 4 levels: ", len(result_inc_3_or_4))
print("Number of results increased by 5 or more levels: ", len(result_inc_5_or_more))
print("Number of results decreased by 1 or 2 levels: ", len(result_dec_1_or_2))
print("Number of results decreased by 3 or 4 levels: ", len(result_dec_3_or_4))
print("Number of results decreased by 5 or more levels: ", len(result_dec_5_or_more))

grok-3 Results Analysis:
Total number of results:  29
Number of results matched exactly:  7
Number of results increased by 1 or 2 levels:  6
Number of results increased by 3 or 4 levels:  1
Number of results increased by 5 or more levels:  0
Number of results decreased by 1 or 2 levels:  7
Number of results decreased by 3 or 4 levels:  6
Number of results decreased by 5 or more levels:  2


In [33]:
result_match = ai_golden_results[ai_golden_results['grok-4-diff']==0]
result_inc_1_or_2 = ai_golden_results[(ai_golden_results['grok-4-diff']==1) | (ai_golden_results['grok-4-diff']==2)]
result_inc_3_or_4 = ai_golden_results[(ai_golden_results['grok-4-diff']==3) | (ai_golden_results['grok-4-diff']==4)]
result_inc_5_or_more = ai_golden_results[(ai_golden_results['grok-4-diff']>=5)]
result_dec_1_or_2 = ai_golden_results[(ai_golden_results['grok-4-diff']==-1) | (ai_golden_results['grok-4-diff']==-2)]
result_dec_3_or_4 = ai_golden_results[(ai_golden_results['grok-4-diff']==-3) | (ai_golden_results['grok-4-diff']==-4)]
result_dec_5_or_more = ai_golden_results[(ai_golden_results['grok-4-diff']<=-5)]
print("grok-4 Results Analysis:")
print("Total number of results: ", len(ai_golden_results))
print("Number of results matched exactly: ", len(result_match))
print("Number of results increased by 1 or 2 levels: ", len(result_inc_1_or_2))
print("Number of results increased by 3 or 4 levels: ", len(result_inc_3_or_4))
print("Number of results increased by 5 or more levels: ", len(result_inc_5_or_more))
print("Number of results decreased by 1 or 2 levels: ", len(result_dec_1_or_2))
print("Number of results decreased by 3 or 4 levels: ", len(result_dec_3_or_4))
print("Number of results decreased by 5 or more levels: ", len(result_dec_5_or_more))

grok-4 Results Analysis:
Total number of results:  29
Number of results matched exactly:  4
Number of results increased by 1 or 2 levels:  7
Number of results increased by 3 or 4 levels:  4
Number of results increased by 5 or more levels:  1
Number of results decreased by 1 or 2 levels:  7
Number of results decreased by 3 or 4 levels:  5
Number of results decreased by 5 or more levels:  1


In [35]:
result_match = ai_golden_results[ai_golden_results['o3-diff']==0]
result_inc_1_or_2 = ai_golden_results[(ai_golden_results['o3-diff']==1) | (ai_golden_results['o3-diff']==2)]
result_inc_3_or_4 = ai_golden_results[(ai_golden_results['o3-diff']==3) | (ai_golden_results['o3-diff']==4)]
result_inc_5_or_more = ai_golden_results[(ai_golden_results['o3-diff']>=5)]
result_dec_1_or_2 = ai_golden_results[(ai_golden_results['o3-diff']==-1) | (ai_golden_results['o3-diff']==-2)]
result_dec_3_or_4 = ai_golden_results[(ai_golden_results['o3-diff']==-3) | (ai_golden_results['o3-diff']==-4)]
result_dec_5_or_more = ai_golden_results[(ai_golden_results['o3-diff']<=-5)]
print("o3 Results Analysis:")
print("Total number of results: ", len(ai_golden_results))
print("Number of results matched exactly: ", len(result_match))
print("Number of results increased by 1 or 2 levels: ", len(result_inc_1_or_2))
print("Number of results increased by 3 or 4 levels: ", len(result_inc_3_or_4))
print("Number of results increased by 5 or more levels: ", len(result_inc_5_or_more))
print("Number of results decreased by 1 or 2 levels: ", len(result_dec_1_or_2))
print("Number of results decreased by 3 or 4 levels: ", len(result_dec_3_or_4))
print("Number of results decreased by 5 or more levels: ", len(result_dec_5_or_more))

o3 Results Analysis:
Total number of results:  29
Number of results matched exactly:  4
Number of results increased by 1 or 2 levels:  8
Number of results increased by 3 or 4 levels:  2
Number of results increased by 5 or more levels:  0
Number of results decreased by 1 or 2 levels:  8
Number of results decreased by 3 or 4 levels:  6
Number of results decreased by 5 or more levels:  1


In [36]:
result_match = ai_golden_results[ai_golden_results['gpt-4o-diff']==0]
result_inc_1_or_2 = ai_golden_results[(ai_golden_results['gpt-4o-diff']==1) | (ai_golden_results['gpt-4o-diff']==2)]
result_inc_3_or_4 = ai_golden_results[(ai_golden_results['gpt-4o-diff']==3) | (ai_golden_results['gpt-4o-diff']==4)]
result_inc_5_or_more = ai_golden_results[(ai_golden_results['gpt-4o-diff']>=5)]
result_dec_1_or_2 = ai_golden_results[(ai_golden_results['gpt-4o-diff']==-1) | (ai_golden_results['gpt-4o-diff']==-2)]
result_dec_3_or_4 = ai_golden_results[(ai_golden_results['gpt-4o-diff']==-3) | (ai_golden_results['gpt-4o-diff']==-4)]
result_dec_5_or_more = ai_golden_results[(ai_golden_results['gpt-4o-diff']<=-5)]
print("GPT-4o Results Analysis:")
print("Total number of results: ", len(ai_golden_results))
print("Number of results matched exactly: ", len(result_match))
print("Number of results increased by 1 or 2 levels: ", len(result_inc_1_or_2))
print("Number of results increased by 3 or 4 levels: ", len(result_inc_3_or_4))
print("Number of results increased by 5 or more levels: ", len(result_inc_5_or_more))
print("Number of results decreased by 1 or 2 levels: ", len(result_dec_1_or_2))
print("Number of results decreased by 3 or 4 levels: ", len(result_dec_3_or_4))
print("Number of results decreased by 5 or more levels: ", len(result_dec_5_or_more))

GPT-4o Results Analysis:
Total number of results:  29
Number of results matched exactly:  6
Number of results increased by 1 or 2 levels:  8
Number of results increased by 3 or 4 levels:  3
Number of results increased by 5 or more levels:  1
Number of results decreased by 1 or 2 levels:  5
Number of results decreased by 3 or 4 levels:  4
Number of results decreased by 5 or more levels:  2
