## Overview

In the notebook we evaluate the overall performance of the system. The task is to send a natural-language question about some financial data to the system, and for the system to determine the correct answer based on the dataset it has been given.

## Metrics

The evaluation is a relatively simple determination of accuracy: does the system give the correct answer or not? In the evaluation, there is some leeway granted to the system's response, given that the expected answers often include additional symbols ('%' for percentages, currency symbols for monetary amounts, etc). We have also tried to account for errors introduced by rounding, by assessing a response to be correct if the expected value and actual answer differ by a small amount. It is possible that this introduces some error in the evaluation process.

## Conclusions

The accuracy of the system is determined to be around 15%. This is low, but not surprising given the difficulty of the task and the lack of time for refinement and tuning of the system. 

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
sys.path.insert(0, module_path)

In [3]:
# read in the records
from tomoro.record import read_records

data_path = '../data/train.json'
all_records = read_records(data_path)
len(all_records)

3037

In [4]:
import random

# use a random sample of the records
records = random.sample(all_records, 100)
len(records)

100

In [5]:
from tomoro.utils import make_table_name

# the correct answer for the required table name is derived from the record id 
questions = []
expected_answers = []

for record in records:
    table_name = make_table_name(record.id)
    qs = [qa.question for qa in record.qa]
    expected_as = [qa.answer for qa in record.qa]
    questions.extend(qs)
    expected_answers.extend(expected_as)

print(len(questions), len(expected_answers))

129 129


In [6]:
from tomoro.vector_store import get_store
from tomoro.config import get_env_var
from tqdm import tqdm 
from langchain_community.utilities import SQLDatabase
from tomoro.utils import get_llm
from tomoro.sql import SQLGenerator
from tomoro.retrieve import Retriever

store_type = get_env_var('VECTOR_STORE')
db_path = '../.vector_db'
collection_name = get_env_var('VECTOR_DB_COLLECTION_NAME')
vector_store = get_store(store_type, db_path=db_path, collection_name=collection_name)
SQL_DB_NAME = get_env_var('SQL_DB_NAME')
sqlite_uri = f'sqlite:///../{SQL_DB_NAME}'
sql_db = SQLDatabase.from_uri(sqlite_uri)
LLM_TYPE = get_env_var('LLM')
llm = get_llm(LLM_TYPE)
sql_generator = SQLGenerator(sqlite_uri, llm)
retriever = Retriever(vector_store, llm, sql_db, sql_generator)

In [None]:
k = 10
correct = 0
for question, expected_answer in tqdm(zip(questions, expected_answers)):
    # print(f'question: {question}')
    actual_answer = retriever.retrieve(question, k)
    if actual_answer is None or actual_answer == '':
        continue
    if expected_answer.endswith('%'):
        expected_answer = expected_answer[:-1]
    if actual_answer.endswith('%'):
        actual_answer = actual_answer[:-1]
    if expected_answer.startswith('$'):
        expected_answer = expected_answer[1:]
    if expected_answer==actual_answer:
        correct += 1
    try:
        expected_f = float(expected_answer)
        actual_f = float(actual_answer)
        if abs(expected_f - actual_f) < 1: # if the float values are close, it's probably just a rounding issue
            correct +=1 
    except Exception as e:
        print(e)
        print(f'expected: {expected_answer}, actual: {actual_answer}')

In [30]:
accuracy = round(100 * correct / len(questions), 3)
print(f'Accuracy: {accuracy}%')

Accuracy: 14.729%
