# QA System on fiqa dataset

In [2]:
from datasets import load_dataset

fiqa_corpus = load_dataset("explodinggradients/fiqa", "corpus")['corpus']
fiqa_qa = load_dataset("explodinggradients/fiqa", "main")["test"]
fiqa_corpus, fiqa_qa

Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/corpus/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)


  0%|          | 0/1 [00:00<?, ?it/s]

Found cached dataset fiqa (/home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/main/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8)


  0%|          | 0/3 [00:00<?, ?it/s]

(Dataset({
     features: ['doc'],
     num_rows: 57638
 }),
 Dataset({
     features: ['question', 'ground_truths'],
     num_rows: 648
 }))

In [75]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

data = fiqa_qa["ground_truths"]
docs = []
for r in data:
    for t in r:
        docs.append(Document(page_content=t))
        
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
)

1706


In [90]:
q = fiqa_qa["question"][10]
q

'Using credit card points to pay for tax deductible business expenses'

In [91]:
response_docs = vectorstore.similarity_search(q)
response_docs

[Document(page_content='"For simplicity, let\'s start by just considering cash back. In general, cash back from credit cards for personal use is not taxable, but for business use it is taxable (sort of, I\'ll explain later). The reason is most personal purchases are made with after tax dollars; you typically aren\'t deducting the cost of what you purchased from your personal income, so if you purchase something that costs $100 and you receive $2 back from the CC company, effectively you have paid $98 for that item but that wouldn\'t affect your tax bill. However, since businesses typically deduct most expenses, that same $100 deduction would have only been a $98 deduction for business tax purposes, so in this case the $2 should be accounted for. Note, you should not consider that $2 as income though; that would artificially inflate your revenue. It should be treated as a negative expense, similar to how you would handle returning an item you purchased and receiving a CC refund. Now for

In [143]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True,
)

result = qa_chain({"query": q})

In [144]:
result

{'query': 'Using credit card points to pay for tax deductible business expenses',
 'result': 'Using credit card points to pay for tax deductible business expenses can be a bit tricky. Generally, if you use credit card points to cover business expenses, those expenses may not be deductible. This is because the points effectively reduce the cost of the expenses, and deducting them would be counteracted by the value of the points. However, if you have a company policy where employees make purchases with their personal credit cards and submit receipts for reimbursement, the employer may not be concerned with the rewards or points earned. In this case, as long as the expenses are legitimate business expenses, they should still be deductible. It\'s important to note that this "don\'t ask, don\'t tell" approach may not be advisable to abuse, and it\'s always best to consult with a tax professional for specific guidance.',
 'source_documents': [Document(page_content='"For simplicity, let\'s st

In [126]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k")
print(llm.predict(q))

Using credit card points to pay for tax deductible business expenses can be a smart financial move for business owners. Here are a few steps to follow:

1. Understand your credit card rewards program: Familiarize yourself with the specific terms and conditions of your credit card rewards program. Determine the value of the points and how they can be redeemed for statement credits, gift cards, or other options.

2. Identify tax-deductible business expenses: Make a list of all the qualifying tax-deductible business expenses you plan to pay for. This may include office supplies, travel expenses, marketing costs, or professional services.

3. Calculate the value of your points: Determine the value of your credit card points based on the redemption options available. For example, if each point is worth $0.01 and you have 10,000 points, you have $100 in credit to use.

4. Choose the best redemption option: Evaluate the redemption options available to you and choose the one that maximizes the

## Seed questions

In [3]:
SEED = 512

fiqa_qa.shuffle(seed=SEED)
fiqa_qa = fiqa_qa.select(range(60))

Loading cached shuffled indices for dataset at /home/jjmachan/.cache/huggingface/datasets/explodinggradients___fiqa/main/1.0.0/3dc7b639f5b4b16509a3299a2ceb78bf5fe98ee6b5fee25e7d5e4d290c88efb8/cache-2f0a816219d98636.arrow


## Evaluate with GroundTruth

In [6]:
from langchain import PromptTemplate
from langchain.chat_models import ChatOpenAI

qa_with_gt = PromptTemplate.from_template(
"""\
Answer the question using only the context provided

context: {context}
question: {question}
answer:
"""
)

llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k")

In [9]:
from ragas.metrics.base import make_batches
from tqdm import tqdm

generations = []
for b in tqdm(make_batches(len(fiqa_qa), 20)):
    r_batch = fiqa_qa.select(b)
    prompts = []
    for r in r_batch:
        p = qa_with_gt.format_prompt(
            context=' '.join(r["ground_truths"]),
            question=r["question"]
        )
        prompts.append(p)
    res = llm.generate_prompt(prompts)
    res_texts = [i[0].text for i in res.generations]
    generations.extend(res_texts)

100%|█████████████████████████████████████████████████████████████| 3/3 [03:20<00:00, 66.67s/it]


In [10]:
len(generations)

60

In [11]:
final_ds = fiqa_qa.add_column("answer_with_gt", generations)
final_ds

Dataset({
    features: ['question', 'ground_truths', 'answer_with_gt'],
    num_rows: 60
})

### Evaluate with Ragas

In [31]:
from ragas import evaluate

ragas_grounded_result = evaluate(final_ds, column_map={
    "question": "question",
    "contexts": "ground_truths",
    "answer": "answer_with_gt"
})

ragas_grounded_result

evaluating with [answer_relevancy]


100%|█████████████████████████████████████████████████████████████| 4/4 [02:44<00:00, 41.07s/it]


evaluating with [context_relavency]


100%|████████████████████████████████████████████████████████████| 4/4 [08:07<00:00, 121.91s/it]


evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████| 4/4 [09:19<00:00, 139.96s/it]


{'ragas_score': 0.5712, 'answer_relevancy': 0.9003, 'context_relavency': 0.3491, 'faithfulness': 0.7833}

## Evaluate context-free

In [14]:
llm = ChatOpenAI()

In [15]:
from langchain import PromptTemplate

qa_with_nocontext = PromptTemplate.from_template(
"""\
Answer the question

question: {question}
answer:
"""
)

In [16]:
from tqdm import tqdm

generations = []
for b in tqdm(make_batches(len(fiqa_qa), 20)):
    r_batch = fiqa_qa.select(b)
    prompts = []
    for r in r_batch:
        p = qa_with_nocontext.format_prompt(
            question=r["question"]
        )
        prompts.append(p)
    res = llm.generate_prompt(prompts)
    res_texts = [i[0].text for i in res.generations]
    generations.extend(res_texts)

100%|████████████████████████████████████████████████████████████| 3/3 [06:19<00:00, 126.51s/it]


In [17]:
len(generations)

60

In [18]:
final_ds = final_ds.add_column("answer_with_no_context", generations)
final_ds

Dataset({
    features: ['question', 'ground_truths', 'answer_with_gt', 'answer_with_no_context'],
    num_rows: 60
})

In [19]:
ragas_result = evaluate(final_ds, column_map={
    "question": "question",
    "contexts": "ground_truths",
    "answer": "answer_with_no_context"
})

ragas_result

evaluating with [answer_relevancy]


100%|█████████████████████████████████████████████████████████████| 4/4 [02:23<00:00, 35.83s/it]


evaluating with [context_relavency]


 25%|███████████████                                             | 1/4 [01:51<05:34, 111.48s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).
100%|████████████████████████████████████████████████████████████| 4/4 [11:34<00:00, 173.74s/it]


evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████| 4/4 [16:34<00:00, 248.65s/it]


{'ragas_score': 0.4915, 'answer_relevancy': 0.9374, 'context_relavency': 0.3158, 'faithfulness': 0.5347}

## Evaluate a RAG pipeline

In [20]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.schema import Document

data = fiqa_qa["ground_truths"]
docs = []
for r in data:
    for t in r:
        docs.append(Document(page_content=t))
        
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(),
)

In [22]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True,
)


# try it out
q = fiqa_qa["question"][0]
result = qa_chain({"query": q})
result["result"]

"To deposit a cheque issued to an associate in your business into your business account, you have a few options:\n\n1. Third-Party Endorsement: Have the associate sign the back of the cheque and then deposit it into your business account. This is called a third-party cheque and is generally allowed. However, there may be a longer hold period for the funds, and if the cheque doesn't clear, you won't receive the money.\n\n2. In-Person Endorsement: If the cheque is for a large amount or you're not well-known at the bank, you can have the associate go to the bank with the cheque and endorse it in front of a teller, providing some form of identification. This can help establish the legitimacy of the transaction.\n\n3. Deposit into Associate's Account: Alternatively, the associate can deposit the cheque into their own account and then write a cheque to your business for the same amount. This may be a simpler option if you encounter difficulties with the first two methods.\n\nIt's important t

In [23]:
answer_from_rag = []
contexts = []
for q in tqdm(fiqa_qa["question"]):
    result = qa_chain({"query": q})
    answer_from_rag.append(result["result"])
    contexts.append(
        [i.page_content for i in result["source_documents"]]
    )

 37%|█████████████████████▋                                     | 22/60 [02:04<02:31,  3.98s/it]Failed to patch https://api.smith.langchain.com/runs/59b41992-e741-4f86-ad48-b5ff0773a88b in LangSmith API. {"detail":"Cannot update a run that has already finished"}
100%|███████████████████████████████████████████████████████████| 60/60 [05:28<00:00,  5.47s/it]


In [24]:
final_ds = final_ds.add_column("answer_with_rag", answer_from_rag)
final_ds = final_ds.add_column("contexts_from_rag", contexts)
final_ds

Dataset({
    features: ['question', 'ground_truths', 'answer_with_gt', 'answer_with_no_context', 'answer_with_rag', 'contexts_from_rag'],
    num_rows: 60
})

In [26]:
ragas_rag_result = evaluate(final_ds, column_map={
    "question": "question",
    "contexts": "contexts_from_rag",
    "answer": "answer_with_rag"
})

ragas_rag_result

evaluating with [answer_relevancy]


100%|█████████████████████████████████████████████████████████████| 4/4 [02:24<00:00, 36.23s/it]


evaluating with [context_relavency]


100%|████████████████████████████████████████████████████████████| 4/4 [07:10<00:00, 107.54s/it]


evaluating with [faithfulness]


100%|████████████████████████████████████████████████████████████| 4/4 [14:47<00:00, 221.97s/it]


{'ragas_score': 0.3355, 'answer_relevancy': 0.9110, 'context_relavency': 0.1512, 'faithfulness': 0.8122}

In [29]:
ragas_free_result = ragas_result
ragas_free_result

{'ragas_score': 0.4915, 'answer_relevancy': 0.9374, 'context_relavency': 0.3158, 'faithfulness': 0.5347}

In [30]:
ragas_rag_result

{'ragas_score': 0.3355, 'answer_relevancy': 0.9110, 'context_relavency': 0.1512, 'faithfulness': 0.8122}

In [32]:
ragas_grounded_result

{'ragas_score': 0.5712, 'answer_relevancy': 0.9003, 'context_relavency': 0.3491, 'faithfulness': 0.7833}

### Langchain stringEvaluator

In [37]:
from langchain.evaluation import StringEvaluator
from dataclasses import dataclass
from ragas.metrics.base import Metric
from langchain.chains import LLMChain

@dataclass
class LangchainEvaluator(StringEvaluator, Chain):
    metrics: list[Metric]
    
    @classmethod
    def from_metric(cls, metric: Metric):
        ...
        
    def _evaluate_strings(
        self,
        *,
        prediction: str,
        reference: str | None = None,
        input: str 
    ):
        ...