# LangChain: Evaluation

## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

When building a complex application using an LLM, one of the 
important but sometimes tricky steps is how do you 
evaluate how well your application is doing? 

Is it meeting some accuracy criteria? 

And also, if you decide to change your implementation, 
maybe swap in a different LLM, or change the strategy 
of how you use a vector database or something else to retrieve chunks, 
or change some other parameters of your system, 
how do you know if you're making it better or worse?

**Frameworks on how to think about evaluating a LLM-based application**

These applications are really chains and sequences of 
a lot of different steps. And so honestly, part of the first 
thing that you should do is just understand 
what exactly is going in and coming out of each step. 
And so some of the tools can really just be thought of 
as visualizers or debuggers in that vein. 
But it's often really useful to get a more holistic picture on a lot 
of different data points of how the model is doing. And 
one way to do that is by looking at things by eye. But 
there's also this really cool idea of using language models themselves 
and chains themselves to evaluate other 
language models, and other chains, and other 
applications.

## Import API key

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

## Create our QandA application

First, we 
need to have the chain or the application that we're going 
to evaluate in the first place. 
And we're going to use the document question answering chain 
from the previous lesson. 

In [None]:
# import everything we need
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [None]:
# load the same data that we were using

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [None]:
# create that index with one line

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

In [None]:
# create the retrieval QA chain by specifying the language model, the chain type, the retriever,
# and then the verbosity that we're going to print out

llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

We need to really figure out what are 
some data points that we want to evaluate it on.

- The first method is the most simple, which is basically we're 
going to come up with data points that we think 
are good examples ourselves. 
And so to do that, we can just look at some of the data and come up 
with example questions and then example ground truth 
answers that we can later use to evaluate.

### Coming up with test datapoints

In [None]:
data[10]

*OUTPUT*


```
Document(page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10})
```

In [None]:
data[11]

*OUTPUT*


```
Document(page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.', metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11})
```

### Hard-coded examples

In [None]:
# we can create some example query and answer pairs.

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

So here we've 
created two examples. 
But this doesn't really scale that well. It 
takes a bit of time to look through each example 
and figure out what's going on. And so is there a way that 
we can automate it? 

We think we can automate it is with language models themselves. 

### LLM-Generated examples

In [None]:
#  import the QA generation chain

from langchain.evaluation.qa import QAGenerateChain

# this will take in documents and it will create a question answer pair from each document. 

In [None]:
# create this chain by passing in the Chat OpenAI language model

example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [None]:
# create a bunch of examples. 
# And so we're going to use the apply and parse method because this is applying an output parser to 
# the result because we want to get back a dictionary that has the query and answer pair, not 
# just a single string. 

new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

What exactly is returned here, 
we can see a query, and we can see an answer, and let's 
check the document that this is a question and answer for.

In [None]:
new_examples[0]

*OUTPUT*


```
{'query': "What is the weight of each pair of Women's Campside Oxfords?",
 'answer': "The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz."}
 ```

We just generated a bunch of question-answer pairs. We didn't 
have to write it all ourselves. Saves us a bunch of 
time

In [None]:
data[0]

*OUTPUT*


```
Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0})
```

### Combine examples

In [None]:
# add these examples into the examples that we already created

examples += new_examples

How exactly do we evaluate what's going on?

The first thing we want to do is just run 
an example through the chain, and take a look at the output it 
produces. 

In [None]:
qa.run(examples[0]["query"])

*OUTPUT*


```
> Entering new RetrievalQA chain...

> Finished chain.
'The Cozy Comfort Pullover Set, Stripe does have side pockets.'
```

This is a little bit limiting in terms 
of what we can see that's actually happening inside the chain. 
 
What is the actual prompt that's going into the language model? What 
are the documents that it retrieves? If this were a more 
complex chain with multiple steps in it, what 
are the intermediate results? 
It's oftentimes **not enough to just look at the final answer to understand 
what is or could be going wrong in the chain.**

## Manual Evaluation

In [None]:
# fun little util in LangChain called "langchain.debug"

import langchain
langchain.debug = True

In [None]:
# rerun the same example as above

qa.run(examples[0]["query"])

*OUTPUT*

```python
[chain/start] [1:chain:RetrievalQA] Entering Chain run with input:
{
  "query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
[chain/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] Entering Chain run with input:
[inputs]
[chain/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] Entering Chain run with input:
{
  "question": "Do the Cozy Comfort Pullover Set have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.<<<<>>>>>: 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. \n\nSize & Fit \nPants are Favorite Fit: Sits lower on the waist. \nRelaxed Fit: Our most generous fit sits farthest from the body. \n\nFabric & Care \nIn the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features \nRelaxed fit top with raglan sleeves and rounded hem. \nPull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg. \nImported.<<<<>>>>>: 151\nname: Cozy Quilted Sweatshirt\ndescription: Our sweatshirt is an instant classic with its great quilted texture and versatile weight that easily transitions between seasons. With a traditional fit that is relaxed through the chest, sleeve, and waist, this pullover is lightweight enough to be worn most months of the year. The cotton blend fabric is super soft and comfortable, making it the perfect casual layer. To make dressing easy, this sweatshirt also features a snap placket and a heritage-inspired Mt. Katahdin logo patch. For care, machine wash and dry. Imported.<<<<>>>>>: 265\nname: Cozy Workout Vest\ndescription: For serious warmth that won't weigh you down, reach for this fleece-lined vest, which provides you with layering options whether you're inside or outdoors.\nSize & Fit\nRelaxed Fit. Falls at hip.\nFabric & Care\nSoft, textured fleece lining. Nylon shell. Machine wash and dry. \nAdditional Features \nTwo handwarmer pockets. Knit side panels stretch for a more flattering fit. Shell fabric is treated to resist water and stains. Imported."
}
[llm/start] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] Entering LLM run with input:
{
  "prompts": [
    "System: Use the following pieces of context to answer the users question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n: 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.<<<<>>>>>: 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. \n\nSize & Fit \nPants are Favorite Fit: Sits lower on the waist. \nRelaxed Fit: Our most generous fit sits farthest from the body. \n\nFabric & Care \nIn the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features \nRelaxed fit top with raglan sleeves and rounded hem. \nPull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg. \nImported.<<<<>>>>>: 151\nname: Cozy Quilted Sweatshirt\ndescription: Our sweatshirt is an instant classic with its great quilted texture and versatile weight that easily transitions between seasons. With a traditional fit that is relaxed through the chest, sleeve, and waist, this pullover is lightweight enough to be worn most months of the year. The cotton blend fabric is super soft and comfortable, making it the perfect casual layer. To make dressing easy, this sweatshirt also features a snap placket and a heritage-inspired Mt. Katahdin logo patch. For care, machine wash and dry. Imported.<<<<>>>>>: 265\nname: Cozy Workout Vest\ndescription: For serious warmth that won't weigh you down, reach for this fleece-lined vest, which provides you with layering options whether you're inside or outdoors.\nSize & Fit\nRelaxed Fit. Falls at hip.\nFabric & Care\nSoft, textured fleece lining. Nylon shell. Machine wash and dry. \nAdditional Features \nTwo handwarmer pockets. Knit side panels stretch for a more flattering fit. Shell fabric is treated to resist water and stains. Imported.\nHuman: Do the Cozy Comfort Pullover Set have side pockets?"
  ]
}
[llm/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain > 4:llm:ChatOpenAI] [1.73s] Exiting LLM run with output:
{
  "generations": [
    [
      {
        "text": "The Cozy Comfort Pullover Set, Stripe does have side pockets.",
        "generation_info": null,
        "message": {
          "content": "The Cozy Comfort Pullover Set, Stripe does have side pockets.",
          "additional_kwargs": {},
          "example": false
        }
      }
    ]
  ],
  "llm_output": {
    "token_usage": {
      "prompt_tokens": 628,
      "completion_tokens": 14,
      "total_tokens": 642
    },
    "model_name": "gpt-3.5-turbo"
  }
}
[chain/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain > 3:chain:LLMChain] [1.73s] Exiting Chain run with output:
{
  "text": "The Cozy Comfort Pullover Set, Stripe does have side pockets."
}
[chain/end] [1:chain:RetrievalQA > 2:chain:StuffDocumentsChain] [1.73s] Exiting Chain run with output:
{
  "output_text": "The Cozy Comfort Pullover Set, Stripe does have side pockets."
}
[chain/end] [1:chain:RetrievalQA] [1.97s] Exiting Chain run with output:
{
  "result": "The Cozy Comfort Pullover Set, Stripe does have side pockets."
}
'The Cozy Comfort Pullover Set, Stripe does have side pockets.'
```

When doing question answering, oftentimes when a wrong result is 
returned, it's not necessarily the language model itself that's messing 
up. It's actually the retrieval step that's messing up. 
 
And so taking a really close look at 
what exactly the question is, and what exactly the context is, 
can help debug what's going wrong.

We also get back a 
bunch of information like the "token_usage", so the "prompt_tokens", the 
"completion_tokens", "total_tokens", and the "model_name". 
 
And this can be really useful to track the tokens that you're 
using in your chains or calls to language models over time and 
keep track of the total number of tokens, which 
corresponds very closely to the total cost.

In [None]:
# Turn off the debug mode
# turn off the debug mode in order to just not print everything out onto the screen

langchain.debug = False

But what about all the examples we created? 
How are we going to evaluate those? 
Similarly to when creating them, one way to do it would be manually. 
We could run the chain over all the examples, then look at the outputs, 
and try to figure out what's going on, whether it's correct, incorrect, partially 
correct. Similar to creating the examples, that starts 
to get a little bit tedious over time. 
And so let's go back to our favorite solution. Can 
we ask a language model to do it?

## LLM assisted evaluation

In [None]:
# create predictions for all the different examples

predictions = qa.apply(examples)

```
> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.


> Entering new RetrievalQA chain...

> Finished chain.
```

We can think about evaluating them

In [None]:
# import the QA, question answering, eval chain

from langchain.evaluation.qa import QAEvalChain

In [None]:
# create this chain with a language model, because 
# again, we're going to be using a language model to help do the evaluation

llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [None]:
# we're going to call evaluate on this chain. We're 
# going to pass in examples and predictions, and we're going to 
# get back a bunch of graded outputs.

graded_outputs = eval_chain.evaluate(examples, predictions)

In [None]:
# in order to see what exactly is going on for each example, we're going to loop through them.

for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

*OUTPUT*

```
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set, Stripe does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question: What is the weight of each pair of Women's Campside Oxfords?
Real Answer: The approximate weight of each pair of Women's Campside Oxfords is 1 lb. 1 oz.
Predicted Answer: The weight of each pair of Women's Campside Oxfords is approximately 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question: What are the dimensions of the small and medium Recycled Waterhog Dog Mat?
Real Answer: The dimensions of the small Recycled Waterhog Dog Mat are 18" x 28" and the dimensions of the medium Recycled Waterhog Dog Mat are 22.5" x 34.5".
Predicted Answer: The small Recycled Waterhog Dog Mat has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Grade: CORRECT

Example 4:
Question: What are some features of the Infant and Toddler Girls' Coastal Chill Swimsuit?
Real Answer: The swimsuit features bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric, ensuring that it keeps its shape and resists snags. The swimsuit is also UPF 50+ rated, providing the highest rated sun protection possible by blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. Finally, it can be machine washed and line dried for best results.
Predicted Answer: The Infant and Toddler Girls' Coastal Chill Swimsuit is a two-piece swimsuit with bright colors, ruffles, and exclusive whimsical prints. It is made of four-way-stretch and chlorine-resistant fabric that keeps its shape and resists snags. The swimsuit has UPF 50+ rated fabric that provides the highest rated sun protection possible, blocking 98% of the sun's harmful rays. The crossover no-slip straps and fully lined bottom ensure a secure fit and maximum coverage. It is machine washable and should be line dried for best results.
Predicted Grade: CORRECT

Example 5:
Question: What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?
Real Answer: The body of the Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is made of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The Refresh Swimwear V-Neck Tankini Contrasts is made of 82% recycled nylon with 18% Lycra® spandex for the body and 90% recycled nylon with 10% Lycra® spandex for the lining.
Predicted Grade: CORRECT

Example 6:
Question: What is the fabric composition of the EcoFlex 3L Storm Pants?
Real Answer: The EcoFlex 3L Storm Pants are made of 100% nylon, exclusive of trim.
Predicted Answer: The fabric composition of the EcoFlex 3L Storm Pants is 100% nylon, exclusive of trim.
Predicted Grade: CORRECT
```

We're going to print out the question, and 
again, this was generated by a language model. 
We're going to print out the real answer, and 
again, this was also generated by a language model when it had the whole 
document in front of it, and so it could generate 
a ground truth answer. 
We're going to print out the predicted answer, and 
this is generated by a language model when it's doing 
the QA chain, when it's doing the retrieval with the embeddings and the 
vector databases, passing that into a language model, and then trying 
to guess the predicted answer. 
And then we're also going to print out the grade, and 
again, this is also generated by a 
language model when it's asking the eval chain to grade what's going on 
and whether it's correct or incorrect. 

So let's look at the first example. The question here is, "Does 
the Cozy Comfort Pullover Set have side pockets?". 

The 
real answer, and we created this, is "Yes". 

The predicted answer, which the language model produced was, 
"The Cozy Comfort Pullover Set, Stripe does have side pockets". 

And so we can understand that this is a correct answer. 
And actually the language model does as well, 
and it grades it correct. 

But let's think about why we actually need to use the language 
model in the first place. 
These two strings are actually nothing alike. 
They're very different. One's really short, one's really long. I don't even 
think, yes doesn't appear anywhere in this string. So if we were to 
try to do some string matching, or exact matching, or even 
some regexes here, it wouldn't know what to 
do. They're not the same thing. And that shows off the importance of 
using the language model to do evaluation here. You've got these answers, 
which are arbitrary strings. There's no single one 
truth string that is the best possible answer. There's 
many different variants. And as long as they have 
the same semantic meaning, they should be graded as 
being similar.

#### LangChain Evaluation Platform
This is a way to 
do everything that we just did in the notebook, but 
persisted and show it in a UI. And so let's check it out. Here, 
we can see that we have a session. We called it, deeplearningai.

![Evaluation](immagini/26_evaluation.png)

![Evaluation](immagini/27_evaluation.png)

![Evaluation](immagini/28_evaluation.png)