In [1]:
cd ../..

/home/neon/Documents/cwi_assignament


# Pipeline

In [15]:
import pandas as pd
import numpy as np
import chromadb

from src.pipelines.naive import naive_pipeline
from src.encode import encode_corpus_query
from src.utils import format_table, get_accuracy, fix_duplicated_columns


%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We first load the `corpus.parquet` and `queries.parquet` files, which contain the tables and desired queries to be answered

In [3]:
corpus = pd.read_parquet('./data/corpus.parquet')
queries = pd.read_parquet('./data/queries.parquet')

Since pandas dataframes that have been saved lose their data types, it is necessary to format each row to be a proper table.

In [4]:
corpus['table'] = corpus['table'].apply(lambda x: format_table(x)) 
corpus.iloc[2]['table']

Unnamed: 0,Year,Award,Nominee,Category,Result
0,2013,DJ Magazine Awards,Dyro,Top 100 DJs,30
1,2014,DJ Magazine Awards,Dyro,Top 100 DJs,27
2,2015,DJ Magazine Awards,Dyro,Top 100 DJs,27
3,2016,DJ Magazine Awards,Dyro,Top 100 DJs,93


Here is an example of a Query and Answer from our data:

In [5]:
idx = queries.sample()['database_id'].values[0]
print('Q:{}\nA:{}'.format(queries[queries['database_id'] == idx]['query'].values[0],
queries[queries['database_id'] == idx]['answer'].values[0]))

Q:From which movies did Skye McCole Bartusiak play a role as Rose Wilder and Megan Matheson?
A:Skye McCole Bartusiak appeared as Rose Wilder in Beyond the Prairie: The True Story of Laura Ingalls Wilder (2002) and in 24 (2002–03) as Megan Matheson.


## RAG PIPELINES


In the following sections we are going to use our custom pipelines, but first of all we need to check for errors in table:

In [6]:
corpus['table'] = corpus['table'].apply(lambda x: fix_duplicated_columns(x))

### 1.1 Getting embeddings

For getting embeddings we transform tables to HTML strings. We chose HTML since LLM are often trained on web content, so it is likely that they are more familiar with this format.

Context information from `corpus` database is also important. We encode `corpus['table']` and `corpus['context']` separately, then we average both representation to create the final embedding

In [7]:
embeddings, queries_embeddings = encode_corpus_query(corpus, queries) 

### Naive Solution

In [143]:
naive_top_five = naive_pipeline(embeddings, queries_embeddings, dbids=corpus['database_id'].to_numpy())

In [144]:
'Naive solution Accuracy: {}'.format(np.mean(get_accuracy(queries, naive_top_five)))

'Naive solution Accuracy: 0.75'

## Indexation

### Split and Indexation 

In [170]:
chroma_client = chromadb.Client()
try:
    collection = chroma_client.create_collection(name="vec_db", metadata={"hnsw:space": "ip"} )
except:
    chroma_client.delete_collection("vec_db")
    collection = chroma_client.create_collection(name="vec_db", metadata={"hnsw:space": "ip"} )

collection.add(
    embeddings=embeddings,
    ids=corpus['database_id'].astype(str).to_list())

TypeError: Client() got an unexpected keyword argument 'normalize_embeddings'

In [167]:
results = collection.query(query_embeddings=queries_embeddings, n_results=5)

In [168]:
accval = np.mean(get_accuracy(queries, np.array(results['ids'], dtype='int')))
'Naive solution Accuracy: {:.2f}'.format(accval)


'Naive solution Accuracy: 0.60'