## evaluate llm application

how to evaluate llm applications ?


In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

In [2]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.indexes import VectorstoreIndexCreator
from IPython.display import display, Markdown

In [3]:
file = 'data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [4]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [6]:
llm = ChatOpenAI(temperature=0.0)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>"
    }
)

In [14]:
data[0].page_content

": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries."

## Use chain to generate Q&A

In [15]:
from langchain.evaluation.qa import QAGenerateChain

In [16]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [30]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": data[i].page_content} for i in range(3)]
)

new_examples = [eg['qa_pairs'] for eg in new_examples]



In [31]:
new_examples[0]

{'query': "What is the weight of each pair of Women's Campside Oxfords?",
 'answer': "The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz."}

**manually call LLM**

In [32]:
qa.run(new_examples[0]['query'])



[1m> Entering new RetrievalQA chain...[0m


Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-fGiXRt9ovlGsNkfpE6yz98o3 on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..



[1m> Finished chain.[0m


"The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz."

**enable langchain DEBUGGING** to see more details

In [33]:
import langchain
langchain.debug = True

In [34]:
qa.run(new_examples[0]['query'])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "What is the weight of each pair of Women's Campside Oxfords?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What is the weight of each pair of Women's Campside Oxfords?",
  "context": ": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole wi

Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-fGiXRt9ovlGsNkfpE6yz98o3 on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..


[36;1m[1;3m[llm/end][0m [1m[1:chain:RetrievalQA > 3:chain:StuffDocumentsChain > 4:chain:LLMChain > 5:llm:ChatOpenAI] [5.82s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.",
        "generation_info": {
          "finish_reason": "stop"
        },
        "message": {
          "lc": 1,
          "type": "constructor",
          "id": [
            "langchain",
            "schema",
            "messages",
            "AIMessage"
          ],
          "kwargs": {
            "content": "The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.",
            "additional_kwargs": {}
          }
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 620,
      "completion_tokens": 23,
      "total_tokens": 643
    },
    "model_name": "gpt-3.5-turbo"
  },
  "run": null
}
[36;1m[1;3m[chain/end][0m [1m[1:ch

"The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz."

## automatically evaluating

In [35]:
langchain.debug = False

In [37]:
# use RetrievalQA to predict every query's result
predictions = qa.apply(new_examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [38]:
from langchain.evaluation.qa import QAEvalChain

In [39]:
llm = ChatOpenAI(temperature=0)

In [40]:
eval_chain = QAEvalChain.from_llm(llm)

In [48]:
# evaluate by QAEvalChain
graded_outputs = eval_chain.evaluate(new_examples, predictions)

In [47]:
for i, eg in enumerate(new_examples):
    print(f"E.g. {i}")
    print("Q: " + predictions[i]['query'])
    print("Real A: " + predictions[i]['answer'])
    print("Pred A: " + predictions[i]['result'])
    print("Pred Grade: " + graded_outputs[i]['results'])
    print()

E.g. 0
Q: What is the weight of each pair of Women's Campside Oxfords?
Real A: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Pred A: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Pred Grade: CORRECT

E.g. 1
Q: What are the dimensions of the small size Recycled Waterhog Dog Mat, Chevron Weave?
Real A: The dimensions of the small size Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28".
Pred A: The dimensions of the small size Recycled Waterhog Dog Mat, Chevron Weave are 18" x 28".
Pred Grade: CORRECT

E.g. 2
Q: What features does the Infant and Toddler Girls' Coastal Chill Swimsuit have?
Real A: The swimsuit has bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated sun protection, crossover no-slip straps, a fully lined bottom, and is machine washable and line dry recommended.
Pred A: The Infant and Toddler Girls' Coastal Chill Swimsuit features bright col