In [1]:
cd ../..

/home/neon/Documents/cwi_assignament


# Pipeline

In [7]:
import pandas as pd
import numpy as np
import chromadb

from src.pipelines.naive import naive_pipeline
from src.encode import encode_corpus_query
from src.utils import format_table, get_accuracy, fix_duplicated_columns


%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We first load the `corpus.parquet` and `queries.parquet` files, which contain the tables and desired queries to be answered

In [16]:
corpus = pd.read_parquet('./data/corpus.parquet')
queries = pd.read_parquet('./data/queries.parquet')
corpus.shape

(1001, 4)

Since pandas dataframes that have been saved lose their data types, it is necessary to format each row to be a proper table.

In [9]:
corpus['table'] = corpus['table'].apply(lambda x: format_table(x)) 
corpus.iloc[2]['table']

Unnamed: 0,Year,Award,Nominee,Category,Result
0,2013,DJ Magazine Awards,Dyro,Top 100 DJs,30
1,2014,DJ Magazine Awards,Dyro,Top 100 DJs,27
2,2015,DJ Magazine Awards,Dyro,Top 100 DJs,27
3,2016,DJ Magazine Awards,Dyro,Top 100 DJs,93


Here is an example of a Query/Answer tuple from our data:

In [11]:
idx = queries.sample()['database_id'].values[0]
print('Q:{}\nA:{}'.format(queries[queries['database_id'] == idx]['query'].values[0],
queries[queries['database_id'] == idx]['answer'].values[0]))

Q:In what shows has Vanessa Marcil played a main role?
A:Vanessa Marcil appeared in roles as Brenda Barrett on General Hospital, Gina Kincaid on Beverly Hills, 90210, and Sam Marquez in Las Vegas.


## RAG PIPELINES


In the following sections we are going to use our custom pipelines, but first of all we need to check for errors in table:

In [6]:
corpus['table'] = corpus['table'].apply(lambda x: fix_duplicated_columns(x))

In [14]:
for i, row in corpus.iterrows():
    print(row['table'])
    break

    Year                         Award  \
0   2013              Drama Desk Award   
1   2013  Broadway.com Audience Awards   
2   2014                    Tony Award   
3   2014              Drama Desk Award   
4   2014            Drama League Award   
5   2014    Outer Critics Circle Award   
6   2014                 Astaire Award   
7   2015                    Tony Award   
8   2015              Drama Desk Award   
9   2015            Drama League Award   
10  2015    Outer Critics Circle Award   
11  2016        Evening Standard Award   
12  2017        Laurence Olivier Award   
13  2017         The Stage Debut Award   
14  2017                    Tony Award   
15  2017              Drama Desk Award   
16  2017            Drama League Award   
17  2017    Outer Critics Circle Award   

                                       Category                        Work  \
0       Outstanding Featured Actor in a Musical  The Mystery of Edwin Drood   
1   Favorite Onstage Pair (with Jessie Muel

### 1.1 Getting embeddings

For getting embeddings we transform tables to HTML strings. We chose HTML since LLM are often trained on web content, so it is likely that they are more familiar with this format.

Context information from `corpus` database is also important. We encode `corpus['table']` and `corpus['context']` separately, then we average both representation to create the final embedding

In [7]:
embeddings, queries_embeddings = encode_corpus_query(corpus, queries) 

### Naive Solution

In [257]:
naive_top_five = naive_pipeline(embeddings, queries_embeddings, dbids=corpus['database_id'].to_numpy())

In [258]:
'Naive solution Accuracy: {}'.format(np.mean(get_accuracy(queries, naive_top_five)))

'Naive solution Accuracy: 0.75'

## Indexation

### Chroma

In [259]:
chroma_client = chromadb.Client()
try:
    collection = chroma_client.create_collection(name="vec_db", metadata={"hnsw:space": "cosine"} )
except:
    chroma_client.delete_collection("vec_db")
    collection = chroma_client.create_collection(name="vec_db", metadata={"hnsw:space": "cosine"} )

collection.add(
    embeddings=embeddings,
    ids=corpus['database_id'].astype(str).to_list())

In [260]:
results = collection.query(query_embeddings=queries_embeddings, n_results=5)

In [262]:
accval = np.mean(get_accuracy(queries, np.array(results['ids'], dtype='int')))
'Chroma solution Accuracy: {:.2f}'.format(accval)

'Chroma solution Accuracy: 0.62'