<a href="https://colab.research.google.com/github/philosophy-question-answerer/model-tests-automated/blob/main/model_test_results_scoring.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installations, Imports and Third-Party Services

In [None]:
! pip install cohere

Collecting cohere
  Downloading cohere-4.51-py3-none-any.whl (52 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/52.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.0/52.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting backoff<3.0,>=2.0 (from cohere)
  Downloading backoff-2.2.1-py3-none-any.whl (15 kB)
Collecting fastavro<2.0,>=1.8 (from cohere)
  Downloading fastavro-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting importlib_metadata<7.0,>=6.0 (from cohere)
  Downloading importlib_metadata-6.11.0-py3-none-any.whl (23 kB)
Installing collected packages: importlib_metadata, fastavro, backoff, cohere
  Attempting uninstall: importlib_metadata
    Found existing installation: importlib-metadata 7.0.1
    Uninstalling importlib-metadata-7.

In [None]:
import os
import re
import cohere
from google.colab import userdata, drive

In [None]:
COHERE_API_KEY = userdata.get('COHERE_API_KEY')

In [None]:
drive.mount('/content/drive')
test_results_dir = '/content/drive/My Drive/test'
# test_results_dir = '/content/drive/My Drive/Model Tests Results'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Data Preparation

## Extract All QA Pairs From All Test Result Files on Google Drive

In [None]:
all_combinations = {}

qa_pattern = re.compile(r"QUESTION:\s*(.*?)\s*ANSWER:\s*(.*?)\s*INFERENCE_TIME:\s*([\d.]+)( seconds)?", re.DOTALL)

for combination in os.listdir(test_results_dir):

    combination_path = os.path.join(test_results_dir, combination)

    with open(combination_path, 'r') as file:
        file_content = file.read()

        # Initialize a list to store data for each question-answer pair
        question_data = []

        # Extract question, answer, inference time using regex
        matches = qa_pattern.findall(file_content)

        # Create dictionary entries for each question-answer pair
        for match in matches:
            question_data.append({
                'question': match[0].strip(),  # Remove leading and trailing whitespace
                'answer': match[1].strip(),
                'inference_time': int(match[2]),
                'score': None
            })

        # Use combination name as key and question_data as value in the dictionary
        combination_key = os.path.splitext(combination)[0]
        all_combinations[combination_key] = question_data

print(all_combinations)

{'Llama_2_7b_Chat_1500_300_True': [{'question': 'What determines the meaning of a word?', 'answer': '1) Aristotle and the British empiricists, as well as other philosophers, have argued that words meanings are determined by their use in sentences. However, some theorists believe that word meanings are shaped by abstract or psychological entities. The grammatical category to which a word belongs also influences its combinatorial possibilities in sentences, and thus its meaning ismediately determined by the ontological category of what the word stands for. \n2) Words are linked with others through their meanings, such as what they apply to, signify, represent, and are a word for. The meaning of a word is also connected to synonymy and antonymy, as different words may have similar or opposite meanings. What a constituent word in a proposition-expressing or statement-making sentence means contributes to the determination of what must be the case if the proposition expressed or statement ma

## Print (vertically) Resulting Dictionary

In [None]:
import pprint
pprint.pprint(all_combinations)

{'Llama_2_7b_Chat_1500_300_True': [{'answer': '1) Aristotle and the British '
                                              'empiricists, as well as other '
                                              'philosophers, have argued that '
                                              'words meanings are determined '
                                              'by their use in sentences. '
                                              'However, some theorists believe '
                                              'that word meanings are shaped '
                                              'by abstract or psychological '
                                              'entities. The grammatical '
                                              'category to which a word '
                                              'belongs also influences its '
                                              'combinatorial possibilities in '
                                              'sentences, and th

# Cohere Functions

## Generate Prompt

In [None]:
def generate_prompt(question, answer):
  return f'''
  Consider the following question about Ludwig Wittgenstein's Philosophical Investigations that a philosophy student may ask his/her professor: "{question}"\n
  The following is a candidate answer to the given question provided by an AI model in training: "{answer}"\n
  Evaluate this answer based on its accuracy, thoroughness, coherency and relevancy using your own knowledge of Wittgenstein's Philosophical Investigations, and strictly return ONLY an integer score out of 100.
  '''

## Query Cohere

In [None]:
co = cohere.Client(COHERE_API_KEY)

def query_cohere(prompt):
    response = co.chat(message=prompt, model='command', temperature=0.9)
    return response.text

# Score All Model Combinations

## Parse Cohere Responses

In [None]:
def parse_response(response):

    pattern = re.compile(r'\b([1-9]|[1-9][0-9]|100)\b')

    match = re.search(pattern, response)

    if match:
        return int(match.group())
    else:
        return None

## Collect Cohere Responses

In [None]:
for combination in all_combinations:

  for qa_pair in all_combinations[combination]:

    question = qa_pair['question']
    answer = qa_pair['answer']
    inference_time = qa_pair['inference_time']

    prompt = generate_prompt(question, answer)

    response = query_cohere(prompt)

    score = parse_response(response)

    qa_pair['score'] = score

    print(qa_pair['score'])
    print('====')
    print(response + "\n\n#####\n\n")

71
====
71 
The answer provides an overview of some philosophical arguments regarding word meaning, and introduces the connections and contributions to sentences and speech-acts. However, it does not specifically invoke or reflect a particular understanding of Wittgenstein's theory from Philosophical Investigations, nor discuss any ideas regarding the importance of use-contexts or the impossibility of private language. 

Therefore, the answer is overall coherent and relevant but lacks thoroughness and specificity to the aforementioned text, warranting a score of 71 out of 100.

#####


89
====
89-92 The answer is overall very strong and incorporates key concepts from Wittgenstein's Philosophical Investigations about the nature of language and concept formation. 

Here is a breakdown of the score in each category: 

Accuracy: 90

The answer accurately represents key points from Wittgenstein's philosophy, such as the idea that concepts evolve from how we use language and that definitions

KeyboardInterrupt: 

In [None]:
pprint.pprint(all_combinations)


{'Llama_2_7b_Chat_1500_300_True': [{'answer': '1) Aristotle and the British '
                                              'empiricists, as well as other '
                                              'philosophers, have argued that '
                                              'words meanings are determined '
                                              'by their use in sentences. '
                                              'However, some theorists believe '
                                              'that word meanings are shaped '
                                              'by abstract or psychological '
                                              'entities. The grammatical '
                                              'category to which a word '
                                              'belongs also influences its '
                                              'combinatorial possibilities in '
                                              'sentences, and th

## Find Average Score for Each Combination

In [None]:
all_combinations_avg_scores = {}

for combination in all_combinations:

  combination_avg_score = 0
  non_none_count = 0

  for qa_pair in all_combinations[combination]:

        print(qa_pair['score'])
        if qa_pair['score'] is None:
            continue
        combination_avg_score += qa_pair['score']
        non_none_count += 1

  if non_none_count > 0:
    combination_avg_score /= non_none_count
  else:
    combination_avg_score = -1

  all_combinations_avg_scores[combination] = round(combination_avg_score, 1)

print(all_combinations_avg_scores)

71
89
96
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
{'Llama_2_7b_Chat_1500_300_True': 85.3}
