# LangChain: Evaluation
When building a complex application using an LLM on of the important but sometimes tricky steps is how do you evaluate how your application performance? 
Also, if you decide to change your implementation, maybe change the parameters of your system or the strategy of how you use a vector database or something else, how do you know if you're making it better or worse?

In this notebook, we will dive into some frameworks for evaluating a LLM-based application as well as some tools to help you do that.
## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
import openai
from dotenv import load_dotenv, find_dotenv
import os

_ = load_dotenv(find_dotenv())  # add .env to .gitignore
openai.api_key = os.getenv("OPENAI_API_KEY")

: 

## Create our Q&A application

First, we need to have the chain or the application that we're going to evaluate in the first place. We're going to use the document question answering chain from the previous lesson.

In [9]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [7]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding='utf-8')
data = loader.load()

In [11]:
# Create index
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

Then, we're going to reate the retrieval QA chain by specifying the language model, the chain type, the retriever, and then the verbosity that we're going to print out.

In [12]:

llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

 The first thing we need to do is we need to really figure out what are some data points that we want to evaluate it on.
 So there's a few different methods that we're going to cover for doing this:
 

## Example Generation

### 1. Manual Generation (not practical)

The first is the most simple, which is basically we're going to come up with data points that we think are good examples ourselves. So to do that, we can just look at some of the data and come up with example questions and then example ground truth answers that we can later use to evaluate. 

In [23]:
data[10]

Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10})

In [24]:
data[11]

Document(page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11})

And from these details, we can create some example query and answer pairs:

In [25]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

 But this doesn't really scale that well. We can automate it is with language models themselves.
 So we have a chain in LingChain that can do exactly that. In other words, the LLM will take in documents and it will create a question answer pair from each document.

### 2. LLM-Generated examples

In [26]:
from langchain.evaluation.qa import QAGenerateChain

In [27]:
# Create a QA pair for each document
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [28]:
new_examples = example_gen_chain.apply_and_parse( # to apply output parser to the results to get dict not str
    [{"doc": t} for t in data[:5]]
)

In [29]:
new_examples

[{'query': "What is the approximate weight of the Women's Campside Oxfords per pair?",
  'answer': "The approximate weight of the Women's Campside Oxfords per pair is 1 lb.1 oz."},
 {'query': 'What are the dimensions of the small and medium sizes for the Recycled Waterhog Dog Mat?',
  'answer': 'The small size of the Recycled Waterhog Dog Mat measures 18" x 28" and the medium size measures 22.5" x 34.5".'},
 {'query': "What features does the Infant and Toddler Girls' Coastal Chill Swimsuit have?",
  'answer': 'The swimsuit has bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, a fully lined bottom, and is machine washable and should be line dried for best results.'},
 {'query': 'What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?',
  'answer': 'The body of the swimwear is made of 82% recycled nylon and 18% Lycra spandex, while the lining is ma

We just generated a bunch of question answer pairs without writting them ourselves.

In [30]:
# Combine examples 
examples += new_examples

So we got these examples now, but how exactly do we evaluate what's going on? 

* The first thing we want to do is just run an example through the chain and take a look at the output it produces.

In [31]:
qa.run(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'The Cozy Comfort Pullover Set, Stripe has side pockets on the pull-on pants.'

But it's oftentimes **not enough** to just look at the final answer to understand what is or could be going wrong in the chain.

We want to see what is actually happening inside the chain. For example, 
* What is the actual prompt that's going into the language model? 
* What are the documents that it retrieves? 
* If this were a more complex chain with multiple steps in it, what are the intermediate results?

So to do that, we can use a ``langchain.debug``:

## Manual Evaluation
### 1. Manual Evaluation (not practical)

Set ``langchain.debug`` equals true and we now rerun the same example as above, we can see that it starts printing out a lot more information:

In [32]:
import langchain
langchain.debug = True

In [33]:
qa.run(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\

'The Cozy Comfort Pullover Set, Stripe has side pockets.'

Similar to creating the examples, that starts to get a little bit tedious over time. And so let's go back to our favorite solution.
Can we ask a language model to do it?

In [35]:
# Turn off the debug mode
langchain.debug = False

: 

### 2. LLM assisted evaluation
First, we need to create predictions for all the examples (we have 7 examples). 

In [None]:
predictions = qa.apply(examples)

[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "answer": "Yes"
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% ray

: 

: 

Now that we've got these examples, we can think about evaluating them. So we're going to import the ``QAEvalChain``:

In [20]:
from langchain.evaluation.qa import QAEvalChain

We are going to create this chain with a language model, because again, we're going to be using a language model to help do the evaluation. And then we're going to call evaluate on this chain.

In [21]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

 We're going to pass in examples and predictions, and we're going to get back a bunch of graded outputs. And so in order to see what exactly is going on for each example, we're going to loop through them.

In [22]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [23]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set, Stripe has side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the medium-sized Recycled Waterhog Dog Mat?
Real Answer: The dimensions of the medium-sized Recycled Waterhog Dog Mat are 22.5" x 34.5".
Predicted Answer: The dimensions of the medium-si