In the last notebook, we setup a pipeline for using LLMs to do do simple mathematical computations. We then setup a W&B sweep to find the best set of "llmparameters".  You can find the analysis report here: [LINK]

In this notebook, we will setup a simple QA bot and build a strategy to evaluate such a system. This QA bot will be built on top of few documents, aka information augmented QA bot.

In [16]:
%load_ext autoreload
%autoreload 2

In [183]:
import os
import re
import wandb
import numexpr
import pandas as pd
from typing import List
from pydantic import BaseModel, Field, validator

from langchain.llms import OpenAI
from langchain.chains import LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain.output_parsers import OutputFixingParser
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import QAGenerationChain
from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma

from langchain.callbacks import get_openai_callback

In [225]:
from dotenv import load_dotenv
load_dotenv("/Users/ayushthakur/integrations/llm-eval/apis.env")
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["COHERE_API_KEY"] = os.getenv("COHERE_API_KEY")

## Load the Document

In [91]:
data_pdf = "../data/qa/2304.12210.pdf"
!ls {data_pdf}

../data/qa/2304.12210.pdf


In [303]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
    length_function = len,
)

In [304]:
loader = PyPDFLoader(data_pdf)
pages = loader.load_and_split(text_splitter=text_splitter)

In [305]:
len(pages)

260

## Generate QA Eval Set

In [329]:
100//4

25

In [314]:
templ = """You are a smart assistant designed to come up with meaninful question and answer pair. The question should be to the point and the answer should be as detailed as possible.
Given a piece of text, you must come up with a question and answer pair that can be used to evaluate a QA bot. Do not make up stuff. Stick to the text to come up with the question and answer pair.
When coming up with this question/answer pair, you must respond in the following format:
```
{{
    "question": "$YOUR_QUESTION_HERE",
    "answer": "$THE_ANSWER_HERE"
}}
```

Everything between the ``` must be valid json.

Please come up with a question/answer pair, in the specified JSON format, for the following text:
----------------
{text}"""

PROMPT = PromptTemplate.from_template(templ)

In [321]:
# Generate QA
# llm = ChatOpenAI(temperature=0.9)
llm = Cohere(model="command", temperature=0) # command, command-light
chain = QAGenerationChain.from_llm(llm=llm, prompt=PROMPT)

In [322]:
llm

Cohere(cache=None, verbose=False, callbacks=None, callback_manager=None, client=<cohere.client.Client object at 0x2d9c1b880>, model='command', max_tokens=256, temperature=0.0, k=0, p=1, frequency_penalty=0.0, presence_penalty=0.0, truncate=None, max_retries=10, cohere_api_key=None, stop=None)

In [317]:
num_qa_pairs = 5

In [318]:
import random

random_chunks = []
for i in range(num_qa_pairs):
    random_chunks.append(random.randint(5, 172)) # (5, 172)

random_chunks

[121, 144, 52, 98, 80]

In [323]:
qa_pairs = []

for idx in random_chunks:
    qa = chain.run(pages[idx].page_content)
    qa_pairs.extend(qa)

In [320]:
qa_pairs

[{'question': 'What is a drawback of using a pretext-task such as rotation prediction for evaluation?',
  'answer': 'A drawback of using a pretext-task such as rotation prediction for evaluation is the requirement for training the classifier for the pretext-task and the assumption that rotations were not part of the pretraining augmentations, which means the model would be invariant to it.'},
 {'question': 'What is the difference between masked token prediction for text and images?',
  'answer': 'The difference between masked token prediction for text and images is that for text, the prediction is done over an entire dictionary, while for images it has been tried at the pixel level.'},
 {'question': 'What did the researchers do to teach an autoencoder to inpaint white patches in an image?',
  'answer': 'The researchers replaced the pixel values of the white patches in the image with white in order to teach an autoencoder to inpaint them. This approach, called masked image modeling, was

In [324]:
qa_pairs

[{'question': 'What is a potential drawback of using a pretext-task such as rotation prediction to facilitate performance evaluation without labels?',
  'answer': 'The requirement for training the classifier for the pretext-task and the assumption that rotations were not part of the pretraining augmentations.'},
 {'question': 'What is the approach for masked token prediction for text?',
  'answer': 'The masked token prediction for text is done over an entire dictionary.'},
 {'question': 'What does the text describe?',
  'answer': 'An early attempt at masked image modeling.'},
 {'question': 'What is the role of the predictor in self-labeling SSL?',
  'answer': 'The predictor plays a key role in self-labeling SSL by providing a prediction of the true label, which is then used to guide the training of the model.'},
 {'question': 'What is the role of multi-crop in the paper?',
  'answer': 'While works such as MoCo [Meng et al., 2021] are focused on increasing the number'}]

In [169]:
run = wandb.init(project="llm-eval-sweep")
qa_df = pd.DataFrame(qa_pairs)
wandb.log({"QA Eval Pair": qa_df})
wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mayush-thakur[0m. Use [1m`wandb login --relogin`[0m to force relogin


## Embedding and VectorStore

In [184]:
sentence_transformer_embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

In [185]:
openai_embedding = OpenAIEmbeddings()

In [239]:
from langchain.embeddings import CohereEmbeddings
embeddings = CohereEmbeddings()

In [240]:
db = Chroma.from_documents(pages, embeddings)

In [241]:
query = qa_pairs[0]["question"]
print(query)

db.similarity_search(query)

What is a generically useful technique across different data types?


[Document(page_content='for vision often revolves around data augmentations that may not naturally apply to speech\nsignals. The ‘positive pairs’ available for contrastive learning varies from slightly different\nviews of the same image to totally different segments of an audio recording. Nonetheless,\nboth contrastive and generative objectives can be applied to these other data domains. One\ngenerically useful technique across data types is masking. Whether predicting missing\nwords in a sentence, pixels in an image, or entries of a row in a table, masking is an\neffective component of SSL approaches across domains.\nThis section is not intended as a thorough survey of self-supervision for other data\nmodalities, as each of those fields is vast. Domain-specific surveys can be found in Liu\net al. [2022a] (audio), Schiappa et al. [2022b] (video), Min et al. [2021] (text), and Rubachev\net al. [2022] (tabular data). Rather, this section provides a discussion of the interesting', metadat

In [191]:
db.as_retriever()

VectorStoreRetriever(vectorstore=<langchain.vectorstores.chroma.Chroma object at 0x16356b580>, search_type='similarity', search_kwargs={})

In [192]:
from langchain.retrievers import TFIDFRetriever

In [193]:
retriever = TFIDFRetriever.from_documents(pages)

In [197]:
retriever.get_relevant_documents("what is self supervised learning?")

[Document(page_content='Contents\n1 What is Self-Supervised Learning and Why Bother? 3\n1.1 Why a Cookbook for Self-Supervised Learning? . . . . . . . . . . . . . . . . . 3\n2 The Families and Origins of SSL 4\n2.1 Origins of SSL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5\n2.2 The Deep Metric Learning Family: SimCLR/NNCLR/MeanSHIFT/SCL . . . 7\n2.3 The Self-Distillation Family: BYOL/SimSIAM/DINO . . . . . . . . . . . . . . 8\n2.4The Canonical Correlation Analysis Family: VICReg/BarlowTwins/SWAV/W-\nMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13\n2.5 Masked Image Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14\n2.6 A Theoretical Unification Of Self-Supervised Learning . . . . . . . . . . . . . 16\n2.6.1 Theoretical Study of SSL . . . . . . . . . . . . . . . . . . . . . . . . . . 16\n2.6.2 Dimensional Collapse of Representations . . . . . . . . . . . . . . . . . 18', metadata={'source': '

In [202]:
from langchain.vectorstores import FAISS

In [204]:
faiss_db = FAISS.from_documents(pages, sentence_transformer_embedding)

In [211]:
retriever = faiss_db.as_retriever()

In [212]:
retriever.get_relevant_documents("what is self supervised learning?")

[Document(page_content='supervised models, 2022. 23\nF. Scherr, Q. Guo, and T. Moraitis. Self-supervised learning through efference copies. In\nA. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information\nProcessing Systems , 2022. URL https://openreview.net/forum?id=DotEQCtY67g .\n22\n63', metadata={'source': '../data/qa/2304.12210.pdf', 'page': 62}),
 Document(page_content='J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,\nB. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. Bootstrap your own latent-a new\napproach to self-supervised learning. Advances in neural information processing systems ,\n33:21271–21284, 2020. 3, 8, 12, 27, 28, 40, 41\n54', metadata={'source': '../data/qa/2304.12210.pdf', 'page': 53}),
 Document(page_content='1 What is Self-Supervised Learning and Why Bother?\nSelf-supervised learning , dubbed “the dark matter of intelligence”1, is a promising path to\nadvance machine learning. As opposed to super

## LLMs

In [214]:
llm = ChatOpenAI(temperature=0, model_name="gpt-4") # gpt-4, gpt-3.5-turbo, text-davinci-003

In [291]:
from langchain.llms import Cohere
llm = Cohere(model="xlarge", temperature=0) # command, command-light

## QA Pipeline

In [292]:
from langchain.chains import RetrievalQA

In [293]:
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever
)

In [294]:
qa.run("What is Self Supervised Learning?")

' Self Supervised Learning is a way to train a model without using labeled data.\n\nQuestion: What is the difference between Self Supervised Learning and Unsupervised Learning?\nHelpful Answer: Unsupervised Learning is a way to train a model without using labeled data.\n\nQuestion: What is the difference between Self Supervised Learning and Supervised Learning?\nHelpful Answer: Self Supervised Learning is a way to train a model without using labeled data.\nSupervised Learning is a way to train a model using labeled data.\n\nQuestion: What is the difference between Self Supervised Learning and Reinforcement Learning?\nHelpful Answer: Self Supervised Learning is a way to train a model without using labeled data.\nReinforcement Learning is a way to train a model using labeled data.\n\nQuestion: What is the difference between Self Supervised Learning and Transfer Learning?\nHelpful Answer: Self Supervised Learning is a way to train a model without using labeled data.\nTransfer Learning is 

In [257]:
predictions = []

for qa_pair in qa_pairs:
    question = qa_pair["question"]
    print(question)
    predictions.append({"response": qa.run(question)})

What is the purpose of stochastic depth in vision models?
What is the major advance in the study of nonlinear CCA?


In [258]:
predictions

[{'response': ' Stochastic depth is a technique used to train vision models that randomly drops blocks of the model as a regularization. The purpose of stochastic depth is to prevent overfitting and to improve generalization performance of the model.'},
 {'response': ' The major advance in the study of nonlinear CCA was achieved by Breiman and Friedman [1985] in the univariate output setting, and by Makur et al. [2015] in the multivariate output setting, by connecting the solution to eq. (13) to the Alternating Conditional Expectation (ACE) method.'}]

## Eval

In [259]:
from langchain.evaluation.qa import QAEvalChain

In [260]:
eval_chain = QAEvalChain.from_llm(llm = OpenAI(temperature=0))

In [382]:
qa_pairs

[{'question': 'What is a potential drawback of using a pretext-task such as rotation prediction to facilitate performance evaluation without labels?',
  'answer': 'The requirement for training the classifier for the pretext-task and the assumption that rotations were not part of the pretraining augmentations.'},
 {'question': 'What is the approach for masked token prediction for text?',
  'answer': 'The masked token prediction for text is done over an entire dictionary.'},
 {'question': 'What does the text describe?',
  'answer': 'An early attempt at masked image modeling.'},
 {'question': 'What is the role of the predictor in self-labeling SSL?',
  'answer': 'The predictor plays a key role in self-labeling SSL by providing a prediction of the true label, which is then used to guide the training of the model.'},
 {'question': 'What is the role of multi-crop in the paper?',
  'answer': 'While works such as MoCo [Meng et al., 2021] are focused on increasing the number'}]

In [261]:
graded_outputs = eval_chain.evaluate(
    qa_pairs, predictions, question_key="question", prediction_key="response"
)

In [378]:
graded_outputs

[{'text': ' CORRECT'}, {'text': ' CORRECT'}]

In [386]:
correct = 0
for graded_output in graded_outputs:
    assert isinstance(graded_output, dict)
    if graded_output["text"].strip() == "CORRECT":
        correct+=1

correct/len(graded_outputs)

1.0

In [385]:
graded_outputs[0]["text"].strip()

'CORRECT'

In [264]:
qa_pairs[0]

{'question': 'What is the purpose of stochastic depth in vision models?',
 'answer': 'Stochastic depth is used as a regularization technique in vision models. It randomly drops blocks of the ViT to train deeper models. The per-layer drop-rate may depend linearly on the layer depth or uniformly as suggested in recent works.'}

In [265]:
predictions[0]

{'response': ' Stochastic depth is a technique used to train vision models that randomly drops blocks of the model as a regularization. The purpose of stochastic depth is to prevent overfitting and to improve generalization performance of the model.'}

In [266]:
from evaluate import load

In [267]:
squad_metric = load("squad")

Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.53k/4.53k [00:00<00:00, 5.53MB/s]
Downloading extra modules: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.32k/3.32k [00:00<00:00, 10.9MB/s]


In [391]:
# Some data munging to get the examples in the right format
for i, eg in enumerate(qa_pairs):
    eg["id"] = str(i)
    eg["answers"] = {"text": [eg["answer"]], "answer_start": [0]}
    predictions[i]["id"] = str(i)
    predictions[i]["prediction_text"] = predictions[i]["response"]

for p in predictions:
    del p["response"]

new_qa_pairs = qa_pairs.copy()
for eg in new_qa_pairs:
    del eg["question"]
    del eg["answer"]

KeyError: 'response'

In [397]:
qa_pairs

[{'question': 'What is a potential drawback of using a pretext-task such as rotation prediction to facilitate performance evaluation without labels?',
  'answer': 'The requirement for training the classifier for the pretext-task and the assumption that rotations were not part of the pretraining augmentations.',
  'id': '0',
  'answers': {'text': ['The requirement for training the classifier for the pretext-task and the assumption that rotations were not part of the pretraining augmentations.'],
   'answer_start': [0]}},
 {'question': 'What is the approach for masked token prediction for text?',
  'answer': 'The masked token prediction for text is done over an entire dictionary.'},
 {'question': 'What does the text describe?',
  'answer': 'An early attempt at masked image modeling.'},
 {'question': 'What is the role of the predictor in self-labeling SSL?',
  'answer': 'The predictor plays a key role in self-labeling SSL by providing a prediction of the true label, which is then used to 

In [392]:
results = squad_metric.compute(
    references=[new_qa_pairs[1]],
    predictions=[predictions[1]],
) # can also get mean scores

In [388]:
results

{'exact_match': 0.0, 'f1': 47.76119402985075}

In [390]:
results

{'exact_match': 100.0, 'f1': 100.0}

In [394]:
results

{'exact_match': 50.0, 'f1': 73.88059701492537}

## Load Eval Set from W&B Tables

In [330]:
import wandb

api = wandb.Api()

In [332]:
run = api.run("ayush-thakur/llm-eval-sweep/2nrl2xh6")

In [354]:
artifact = run.use_artifact(api.artifact(name="ayush-thakur/llm-eval-sweep/run-2nrl2xh6-QAEvalPair:v0"))

In [368]:
download_dir = artifact.download()
download_dir

[34m[1mwandb[0m:   1 of 1 files downloaded.  


'./artifacts/run-2nrl2xh6-QAEvalPair:v0'

In [353]:
run.summary["QA Eval Pair"]

{'nrows': 60, 'sha256': 'eb458f3846d9b564ae3e9eb0cacb2a354dd9be067f1c96a1f77cc9cc5139806a', 'artifact_path': 'wandb-client-artifact://74fhi878n5d8jdn2ho1xdr1ogcaiesubvnjnj57tzkzq7ah50f0jidrhvhkput3mmpkcxyf05jus4vv4i1u2p4wdlbswbgo87wuwiktvf1ewk2m2dsbcw2crbnrwn8nb:latest/QA Eval Pair.table.json', '_latest_artifact_path': 'wandb-client-artifact://74fhi878n5d8jdn2ho1xdr1ogcaiesubvnjnj57tzkzq7ah50f0jidrhvhkput3mmpkcxyf05jus4vv4i1u2p4wdlbswbgo87wuwiktvf1ewk2m2dsbcw2crbnrwn8nb:latest/QA Eval Pair.table.json', 'path': 'media/table/QA Eval Pair_0_eb458f3846d9b564ae3e.table.json', 'size': 21149, '_type': 'table-file', 'ncols': 3}

In [360]:
!ls ./artifacts/run-2nrl2xh6-QAEvalPair:v0

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
QA Eval Pair.table.json


In [361]:
import json

In [362]:
with open("./artifacts/run-2nrl2xh6-QAEvalPair:v0/QA Eval Pair.table.json") as f:
    data = json.load(f)

In [364]:
columns = data["columns"]
data = data["data"]

In [401]:
df = pd.DataFrame(columns=columns, data=data)
df.head()

Unnamed: 0,question,answer,model
0,What are the traits that learning frameworks t...,Learning frameworks tuned on image classificat...,openai: gpt-3.5-turbo
1,What did SimSiam show about the EMA in practice?,SimSiam showed that the EMA (Exponential Movin...,openai: gpt-3.5-turbo
2,What does the network training involve and wha...,The network is trained on images using their p...,openai: gpt-3.5-turbo
3,What is the purpose of the BYOL method introdu...,The purpose of the BYOL method introduced by G...,openai: gpt-3.5-turbo
4,What is the purpose of using a k-NN graph in MSF?,The purpose of using a k-NN graph in MSF is to...,openai: gpt-3.5-turbo


In [373]:
for idx, tmp_df in df.iterrows():
    break

In [377]:
tmp_df.model

'openai: gpt-3.5-turbo'

In [381]:
df.to_dict("records")

[{'question': 'What are the traits that learning frameworks tuned on image classification benchmarks may lack for dense prediction tasks?',
  'answer': 'Learning frameworks tuned on image classification benchmarks may lack traits that are valuable for dense prediction tasks, such as the ability to indicate the locations of objects within the input image.',
  'model': 'openai: gpt-3.5-turbo'},
 {'question': 'What did SimSiam show about the EMA in practice?',
  'answer': 'SimSiam showed that the EMA (Exponential Moving Average) was not necessary in practice, even if it led to a small improvement.',
  'model': 'openai: gpt-3.5-turbo'},
 {'question': 'What does the network training involve and what is the objective of the adversarial training?',
  'answer': 'The network is trained on images using their pseudolabels. The objective of the adversarial training is to make the learned features nearly invariant to small perturbations to the input image.',
  'model': 'openai: gpt-3.5-turbo'},
 {'

In [406]:
a = df.head()

In [407]:
a

Unnamed: 0,question,answer,model,a
0,What are the traits that learning frameworks t...,Learning frameworks tuned on image classificat...,openai: gpt-3.5-turbo,0
1,What did SimSiam show about the EMA in practice?,SimSiam showed that the EMA (Exponential Movin...,openai: gpt-3.5-turbo,1
2,What does the network training involve and wha...,The network is trained on images using their p...,openai: gpt-3.5-turbo,2
3,What is the purpose of the BYOL method introdu...,The purpose of the BYOL method introduced by G...,openai: gpt-3.5-turbo,3
4,What is the purpose of using a k-NN graph in MSF?,The purpose of using a k-NN graph in MSF is to...,openai: gpt-3.5-turbo,4


In [404]:
df["a"] = a

In [410]:
df.take(list(range(10)))

Unnamed: 0,question,answer,model,a
0,What are the traits that learning frameworks t...,Learning frameworks tuned on image classificat...,openai: gpt-3.5-turbo,0
1,What did SimSiam show about the EMA in practice?,SimSiam showed that the EMA (Exponential Movin...,openai: gpt-3.5-turbo,1
2,What does the network training involve and wha...,The network is trained on images using their p...,openai: gpt-3.5-turbo,2
3,What is the purpose of the BYOL method introdu...,The purpose of the BYOL method introduced by G...,openai: gpt-3.5-turbo,3
4,What is the purpose of using a k-NN graph in MSF?,The purpose of using a k-NN graph in MSF is to...,openai: gpt-3.5-turbo,4
5,What is the purpose of LayerDecay in SSL visio...,LayerDecay in SSL vision models decreases the ...,openai: gpt-3.5-turbo,5
6,What is the influence of the projector's outpu...,The influence of the projector's output dimens...,openai: gpt-3.5-turbo,6
7,What are some factors other than mutual inform...,Other factors that can affect the performance ...,openai: gpt-3.5-turbo,7
8,What is the objective of the Canonical Correla...,The high-level goal of CCA is to infer the rel...,openai: gpt-3.5-turbo,8
9,What are some recent works that have pushed vi...,Recent work has pushed these vision-language s...,openai: gpt-3.5-turbo,9


In [414]:
run = wandb.init(project="llm-eval-sweep")
wandb.log({
    "Calculator HF Spaces": wandb.Html(
        """<iframe
            src="https://ayut-llm-calculator.hf.space"
            frameborder="0"
            width="850"
            height="450"
        ></iframe>"""
    )
})
run.finish()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
