In [1]:
# only run this if your have an editable install
%load_ext autoreload
%autoreload 2

## Prepare dataset

prepare dataset so that it is off the form
```
question: list[str]
answer: list[str]
contexts: list[list[str]]
ground_truths: list[list[str]]
```

In [2]:
from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "fiqa"
url = (
    "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(
        dataset
    )
)
data_path = util.download_and_unzip(url, "datasets")

  from tqdm.autonotebook import tqdm


In [3]:
import os
import json
import pandas as pd

In [4]:
with open(os.path.join(data_path, "corpus.jsonl")) as f:
    cs = [pd.Series(json.loads(l)) for l in f.readlines()]

corpus_df = pd.DataFrame(cs)
corpus_df

Unnamed: 0,_id,title,text,metadata
0,3,,I'm not saying I don't like the idea of on-the...,{}
1,31,,So nothing preventing false ratings besides ad...,{}
2,56,,You can never use a health FSA for individual ...,{}
3,59,,Samsung created the LCD and other flat screen ...,{}
4,63,,Here are the SEC requirements: The federal sec...,{}
...,...,...,...,...
57633,599946,,"&gt;Well, first off, the roads are more than j...",{}
57634,599953,,Yes they do. There are billions and billions s...,{}
57635,599966,,&gt;It's biggly sad you don't understand human...,{}
57636,599975,,"""Did your CTO let a major group use """"admin/ad...",{}


In [5]:
corpus_df = corpus_df.rename(columns={"_id": "corpus-id", "text": "ground_truth"})
corpus_df = corpus_df.drop(columns=["title", "metadata"])
corpus_df["corpus-id"] = corpus_df["corpus-id"].astype(int)
corpus_df.head()

Unnamed: 0,corpus-id,ground_truth
0,3,I'm not saying I don't like the idea of on-the...
1,31,So nothing preventing false ratings besides ad...
2,56,You can never use a health FSA for individual ...
3,59,Samsung created the LCD and other flat screen ...
4,63,Here are the SEC requirements: The federal sec...


In [6]:
with open(os.path.join(data_path, "queries.jsonl")) as f:
    qs = [pd.Series(json.loads(l)) for l in f.readlines()]

queries_df = pd.DataFrame(qs)
queries_df = queries_df.rename(columns={"_id": "query-id", "text": "question"})
queries_df = queries_df.drop(columns=["metadata"])
queries_df["query-id"] = queries_df["query-id"].astype(int)
queries_df.head()

Unnamed: 0,query-id,question
0,0,What is considered a business expense on a bus...
1,4,Business Expense - Car Insurance Deductible Fo...
2,5,Starting a new online business
3,6,“Business day” and “due date” for bills
4,7,New business owner - How do taxes work for the...


In [7]:
splits = ["dev", "test", "train"]
split_df = {}
for s in splits:
    split_df[s] = pd.read_csv(os.path.join(data_path, f"qrels/{s}.tsv"), sep="\t").drop(
        columns=["score"]
    )

split_df["dev"].head()

Unnamed: 0,query-id,corpus-id
0,1,14255
1,2,308938
2,3,296717
3,3,100764
4,3,314352


In [9]:
# how many unique querries
len(set(split_df["dev"]["query-id"]))

500

In [10]:
final_split_df = {}
for split in split_df:
    df = queries_df.merge(split_df[split], on="query-id")
    df = df.merge(corpus_df, on="corpus-id")
    df = df.drop(columns=["corpus-id"])
    grouped = df.groupby("query-id").apply(
        lambda x: pd.Series(
            {
                "question": x["question"].sample().values[0],
                "ground_truths": x["ground_truth"].tolist(),
            }
        )
    )

    grouped = grouped.reset_index()
    grouped = grouped.drop(columns="query-id")
    final_split_df[split] = grouped

In [11]:
final_split_df["test"]

Unnamed: 0,question,ground_truths
0,How to deposit a cheque issued to an associate...,"[Have the check reissued to the proper payee.,..."
1,Can I send a money order from USPS as a business?,[Sure you can. You can fill in whatever you w...
2,1 EIN doing business under multiple business n...,[You're confusing a lot of things here. Compan...
3,Applying for and receiving business credit,"[""I'm afraid the great myth of limited liabili..."
4,401k Transfer After Business Closure,[You should probably consult an attorney. Howe...
...,...,...
643,Closing a futures position,"[""Assuming these are standardized and regulate..."
644,Net loss not distributed by mutual funds to th...,[When you invest (say $1000) in (say 100 share...
645,Pay off credit card debt or earn employer 401(...,[A matching pension scheme is like free money....
646,Short Term Capital Gains tax vs. IRA Withdrawa...,"[""There is not a special rate for short-term c..."


In [12]:
corpus_df = corpus_df.drop(columns="corpus-id")

In [13]:
corpus_df = corpus_df.rename(columns={"ground_truth": "doc"})
corpus_df

Unnamed: 0,doc
0,I'm not saying I don't like the idea of on-the...
1,So nothing preventing false ratings besides ad...
2,You can never use a health FSA for individual ...
3,Samsung created the LCD and other flat screen ...
4,Here are the SEC requirements: The federal sec...
...,...
57633,"&gt;Well, first off, the roads are more than j..."
57634,Yes they do. There are billions and billions s...
57635,&gt;It's biggly sad you don't understand human...
57636,"""Did your CTO let a major group use """"admin/ad..."


### upload to hf space

uploading to `explodinggradients/fiqa`. To make change however you have to clone it into the file path

In [14]:
# change if new
path_to_ds_repo = "../../../../datasets/fiqa/"
import os

assert os.path.exists(path_to_ds_repo), f"{path_to_ds_repo} doesnot exist!"

for s in final_split_df:
    final_split_df[s].to_csv(os.path.join(path_to_ds_repo, f"{s}.csv"), index=False)

corpus_df.to_csv(os.path.join(path_to_ds_repo, "corpus.csv"), index=False)

Now you have to go and commit the changes and upload the dataset

In [54]:
from datasets import load_dataset

fiqa = load_dataset(path_to_ds_repo, "main")
fiqa

Downloading and preparing dataset fiqa/main to /home/jjmachan/.cache/huggingface/datasets/fiqa/main/1.0.0/a3f6249910dbf03a46c718025ed536ea37e6975f8f4753704a0a9bf6d4a7d29f...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset fiqa downloaded and prepared to /home/jjmachan/.cache/huggingface/datasets/fiqa/main/1.0.0/a3f6249910dbf03a46c718025ed536ea37e6975f8f4753704a0a9bf6d4a7d29f. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['question', 'ground_truths'],
        num_rows: 5500
    })
    validation: Dataset({
        features: ['question', 'ground_truths'],
        num_rows: 500
    })
    test: Dataset({
        features: ['question', 'ground_truths'],
        num_rows: 648
    })
})

## Baseline

Lets use llamaindex to create a baseline for our dataset

In [86]:
from llama_index import Document

docs = []
# ideally we should run this
# for d in corpus_df["doc"]:
#     docs.append(Document(text=d))

# for test
for ds in final_split_df["test"]["ground_truths"]:
    docs.extend([Document(text=d) for d in ds])

In [87]:
len(docs)

1706

In [88]:
# make nodes
from llama_index.node_parser import SimpleNodeParser
from langchain.text_splitter import TokenTextSplitter

spliter = TokenTextSplitter(chunk_size=100, chunk_overlap=50)

parser = SimpleNodeParser(text_splitter=spliter)

nodes = parser.get_nodes_from_documents(documents=docs)

In [89]:
nodes[0]

Node(text='Have the check reissued to the proper payee.', doc_id='0d5a55b8-8931-4686-afc7-2351ae917c1a', embedding=None, doc_hash='fa12bca32af448285826c121ac50184bb97eb0d581fa487bfcd761f871561af1', extra_info=None, node_info=None, relationships={<DocumentRelationship.SOURCE: '1'>: '2d0d59b6-ba7d-4c91-8ec3-fcb5255883c8'})

In [90]:
from llama_index import GPTVectorStoreIndex, MockEmbedding
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext, StorageContext

# load in HF embedding model from langchain
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
hf_sc = ServiceContext.from_defaults(embed_model=embed_model)

# mock embeddings
embed_model = MockEmbedding(embed_dim=1536)
mock = ServiceContext.from_defaults(embed_model=embed_model)

# openai embeddings
openai_sc = ServiceContext.from_defaults()

Now we are going to create the index but there is a lot of problems with how llama_index does this

1. No batching or async for computing embeddings
2. No progress bar to show how much has been embedded
3. I need to see how much this embeddings is going to cost me

In [92]:
# create index
index = GPTVectorStoreIndex.from_documents(
    documents=docs,
    service_context=openai_sc,
)

# query with embed_model specified
qe = index.as_query_engine(mode="embedding", verbose=True, service_context=openai_sc)

In [93]:
q = final_split_df["test"]["question"][0]
print("question: ", q)

r = qe.query(q)
print("answer: ", r.response)

question:  How to deposit a cheque issued to an associate in my business into my business account?
answer:  
The only way to deposit a cheque issued to an associate in your business into your business account is to open a business account with the bank. This requires a state-issued "dba" certificate from the county clerk's office as well as an Employer ID Number (EIN) issued by the IRS. The cheapest business banking account typically costs $15/month.


In [45]:
print([n.node.text for n in r.source_nodes])

["Just have the associate sign the back and then deposit it.  It's called a third party cheque and is perfectly legal.  I wouldn't be surprised if it has a longer hold period and, as always, you don't get the money if the cheque doesn't clear. Now, you may have problems if it's a large amount or you're not very well known at the bank.  In that case you can have the associate go to the bank and endorse it in front of the teller with some ID.  You don't even technically have to be there.  Anybody can deposit money to your account if they have the account number. He could also just deposit it in his account and write a cheque to the business.", '"I have checked with Bank of America, and they say the ONLY way to cash (or deposit, or otherwise get access to the funds represented by a check made out to my business) is to open a business account. They tell me this is a Federal regulation, and every bank will say the same thing.  To do this, I need a state-issued ""dba"" certificate (from the 

In [99]:
# save the index
index.storage_context.persist(persist_dir="./storage")

In [61]:
# load the index
from llama_index import StorageContext, load_index_from_storage, ServiceContext

# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir="./storage")

# load index
index = load_index_from_storage(storage_context)

# query with embed_model specified
qe = index.as_query_engine(
    mode="embedding", verbose=True, service_context=openai_sc, use_async=False
)

In [94]:
from llama_index import (
    GPTVectorStoreIndex,
    ResponseSynthesizer,
)
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.postprocessor import SimilarityPostprocessor

# configure retriever
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=3,
)

# configure response synthesizer
response_synthesizer = ResponseSynthesizer.from_args(
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.7)]
)

# assemble query engine
qe = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

## Inference and uploading to huggingface

In [95]:
test_ds = fiqa["test"]
test_ds

Dataset({
    features: ['question', 'ground_truths'],
    num_rows: 648
})

In [96]:
def generate_response(row):
    r = qe.query(row["question"])
    row["answer"] = r.response
    row["contexts"] = [sn.node.text for sn in r.source_nodes]

    return row


# generate_response(test_ds[0])

In [97]:
gen_ds = test_ds.select(range(30)).map(generate_response)
gen_ds

Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'ground_truths', 'answer', 'contexts'],
    num_rows: 30
})

In [66]:
path_to_ds_repo = "../../../../datasets/fiqa/"
import os

assert os.path.exists(path_to_ds_repo), f"{path_to_ds_repo} doesnot exist!"
gen_ds.to_csv(os.path.join(path_to_ds_repo, "baseline.csv"))

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

149647

Now commit the repo and push changes

In [84]:
load_dataset(path_to_ds_repo, "ragas_eval")

Downloading and preparing dataset fiqa/ragas_eval to /home/jjmachan/.cache/huggingface/datasets/fiqa/ragas_eval/1.0.0/953cfddc4a440cf2e290172be2563e5b51a953f2e4266940fc2b311e135cea69...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating baseline split: 0 examples [00:00, ? examples/s]

../../../../datasets/fiqa/baseline.csv
Dataset fiqa downloaded and prepared to /home/jjmachan/.cache/huggingface/datasets/fiqa/ragas_eval/1.0.0/953cfddc4a440cf2e290172be2563e5b51a953f2e4266940fc2b311e135cea69. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

DatasetDict({
    baseline: Dataset({
        features: ['question', 'ground_truths', 'answer', 'contexts'],
        num_rows: 30
    })
})

## Evaluate with Ragas

In [98]:
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy
from ragas import evaluate

evaluate(gen_ds, metrics=[faithfulness, answer_relevancy, context_relevancy])

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
100%|█████████████████████████████████████████████████████████████| 2/2 [01:15<00:00, 37.90s/it]


Map:   0%|          | 0/30 [00:00<?, ? examples/s]

100%|█████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.57s/it]


Map:   0%|          | 0/30 [00:00<?, ? examples/s]

100%|█████████████████████████████████████████████████████████████| 3/3 [03:06<00:00, 62.05s/it]


{'NLI_score': 0.8655555555555556, 'answer_relevancy': 0.8737666666666667, 'context_ relevancy': 0.8181444444444443, 'ragas_score': 0.8517704492684051}