# Lab | Summarization evaluation using LangSmith
Let's revisit your capstone project 2? Well, sort of. Pick diffierent sets of data and re-run this notebook. Maybe parts of the dataset you used in your last project week. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications using LangSmith.

What did you learn? - Let's discuss that in class

## LangSmith - LangChain evaluation

In [2]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [3]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())


OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
LANGCHAIN_API_KEY = os.getenv("LANGCHAIN_API_KEY")
HUGGINGFACEHUB_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")

In [4]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_PROJECT"]="langsmith_max-test"

In [5]:
#Importing Client from Langsmith
from langsmith import Client
client = Client(api_key=LANGCHAIN_API_KEY)

In [8]:
!pip install -qU datasets

[0m

### Create Dataset


In [47]:
from datasets import load_dataset
cnn_dataset = load_dataset( "akshatmehta98/amazon_reviews", split="test" )
    # version ="3.0.0",
    # trust_remote_code=True
# ccdv/cnn_dailymail
# data = load_dataset("yelp_review_full", split="test")

Generating train split:   0%|          | 0/303316 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/116661 [00:00<?, ? examples/s]

In [53]:
def add_prefix(example):
    return {
        **example,
        "text": f"Summarize this news:\n{example['text']}"
    }

#cnn_dataset = cnn_dataset.map(add_prefix)

In [54]:
cnn_dataset

Dataset({
    features: ['HelpfulnessNumerator', 'HelpfulnessDenominator', 'Summary', 'text', 'labels', 'sentiment_code', 'input_ids', 'attention_mask'],
    num_rows: 116661
})

In [56]:
cnn_dataset[0]

{'HelpfulnessNumerator': 3,
 'HelpfulnessDenominator': 4,
 'Summary': 'Flavored ones are much better',
 'text': 'i purchased this product for its probiotics but found it difficult to disguise the sour taste lifeways flavored products were a much better option this product can make a good substitute for sour cream as far as taste goes it has a much thinner consistancy though the following reviews helpfulness score is1',
 'labels': 'neutral',
 'sentiment_code': 1,
 'input_ids': '[     0     17  59038     71    903  12996    100   6863    502   8730\n  41637   1284  14037    442  34844     47   2837  17086    184     70\n    221    474  90365   6897 102966 196634    297  38742   3542     10\n   5045  11522  35829    903  12996    831   3249     10   4127 161740\n     13    100    221    474  24709    237   2060    237  90365  60899\n    442   1556     10   5045   6117  13857  58055     66   2408  21208\n     70  25632  98865  98893   7432  47763     83    418      2]',
 'attention_mask': 

In [57]:
#Get just a few news to test
MAX_NEWS=10
sample_cnn = cnn_dataset.select(range(MAX_NEWS)).map(add_prefix)

sample_cnn

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Dataset({
    features: ['HelpfulnessNumerator', 'HelpfulnessDenominator', 'Summary', 'text', 'labels', 'sentiment_code', 'input_ids', 'attention_mask'],
    num_rows: 10
})

The dataset contains three columns: article, highlights, and id. To use LangSmith, we need to create a dataset in LangSmith format.

LangSmith expects a prompt and a result. To achieve this, we will transform the article into a prompt by adding the prefix: "Summarize this news." As a result, we will use the content of highlights, which represents the summaries created by humans.

In [58]:
print(sample_cnn[0])

{'HelpfulnessNumerator': 3, 'HelpfulnessDenominator': 4, 'Summary': 'Flavored ones are much better', 'text': 'Summarize this news:\ni purchased this product for its probiotics but found it difficult to disguise the sour taste lifeways flavored products were a much better option this product can make a good substitute for sour cream as far as taste goes it has a much thinner consistancy though the following reviews helpfulness score is1', 'labels': 'neutral', 'sentiment_code': 1, 'input_ids': '[     0     17  59038     71    903  12996    100   6863    502   8730\n  41637   1284  14037    442  34844     47   2837  17086    184     70\n    221    474  90365   6897 102966 196634    297  38742   3542     10\n   5045  11522  35829    903  12996    831   3249     10   4127 161740\n     13    100    221    474  24709    237   2060    237  90365  60899\n    442   1556     10   5045   6117  13857  58055     66   2408  21208\n     70  25632  98865  98893   7432  47763     83    418      2]', 'at

Now We have the Dataset with the prompt and the Reference Summary, it is time to create a Dataset in LangSmith with this information.
### Create the Dataset in Langsmith

The dataset in LangSmith is composed of an input, which is the prompt passed to the model for evaluation, and an output, which should contain what we expect the model to return.

In [59]:
import datetime

In [64]:
import uuid
# input_key=['article']
# output_key=['highlights']

input_key = ['text']
output_key = ['labels']

NAME_DATASET=f"Summarize_dataset_{datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

In [65]:
#This creates the dataset in LangSmith with the content in sample_cnn - If you run this more than once you will get POST errors
dataset = client.upload_dataframe(
    df=sample_cnn,
    input_keys=input_key,
    output_keys=output_key,
    name=NAME_DATASET,
    description="Test Embedding distance between model summarizations",
    data_type="kv"
)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

In this image, we can see an example from the dataset once it's been registered in LangSmith.

In the Input column, there is the prompt to be sent, while in the Output column, the expected output is stored.

When performing the comparison, the model will be given the prompt, and the Cosine distance between its response and the one stored in the sample dataset will be calculated.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Dataset.jpg?raw=true">

### Recovering Models From Hugging Face
Let's retrieve both models from HuggingFace. A base T5 model and a model that has been fine-tuned using the training portion of this same dataset to generate summaries.

In [19]:
!pip install langchain_community

[0mCollecting langchain_community
  Downloading langchain_community-0.3.22-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading mypy_extensions-1.1.0-py

In [20]:
from langchain import HuggingFaceHub

In [66]:
summarizer_base = HuggingFaceHub(
    repo_id="t5-base",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

In [67]:
summarizer_finetuned = HuggingFaceHub(
    repo_id="flax-community/t5-base-cnn-dm",
    model_kwargs={"temperature":0, "max_length":180},
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN
)

## Defining Evaluator
The first step is to define an evaluator, where we specify the variables we want to evaluate. In our case, I have chosen to measure only the "embedding_distance."

I've left the "string_distance" as a comment in case you want to conduct a test with two evaluations instead of one.


In [23]:
from langchain.smith import run_on_dataset, RunEvalConfig
!pip install -q rapidfuzz==3.6.1

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[0m

In [68]:
#We are using just one of the multiple evaluator avaiable on LangSmith.

evaluation_config = RunEvalConfig(
    evaluators=[
        "embedding_distance",
        #"string_distance"
    ],
)



### Running Evaluator
With the same configuration, we can launch two evaluations on the same dataset. One for each of the chosen models.

In [25]:
!pip install tiktoken

[0mCollecting tiktoken
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: tiktoken
Successfully installed tiktoken-0.9.0


In [69]:
project_name = f"T5-BASE {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

base_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_base,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-BASE 2025-04-26 21:13:38' at:
https://smith.langchain.com/o/f6bdcd35-f834-4a91-ad55-34d27ed5e316/datasets/52cfe019-1d0f-4992-8286-f55c3b9eae3a/compare?selectedSessions=f9cd3bc8-98d5-45db-9fe8-492ee4c5b903

View all tests for Dataset Summarize_dataset_2025-04-26 21:13:21 at:
https://smith.langchain.com/o/f6bdcd35-f834-4a91-ad55-34d27ed5e316/datasets/52cfe019-1d0f-4992-8286-f55c3b9eae3a
[>                                                 ] 0/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base


[---->                                             ] 1/10[--------->                                        ] 2/10[-------------->                                   ] 3/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base


[------------------->                              ] 4/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base


[------------------------>                         ] 5/10[----------------------------->                    ] 6/10[---------------------------------->               ] 7/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base


[--------------------------------------->          ] 8/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/t5-base


[-------------------------------------------->     ] 9/10[------------------------------------------------->] 10/10

In [70]:
#Ignore the error shown below
project_name = f"T5-FineTuned {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=summarizer_finetuned,
    evaluation=evaluation_config,
)

View the evaluation results for project 'T5-FineTuned 2025-04-26 21:13:45' at:
https://smith.langchain.com/o/f6bdcd35-f834-4a91-ad55-34d27ed5e316/datasets/52cfe019-1d0f-4992-8286-f55c3b9eae3a/compare?selectedSessions=c9994136-ed21-4b9f-90b8-7e07ec904536

View all tests for Dataset Summarize_dataset_2025-04-26 21:13:21 at:
https://smith.langchain.com/o/f6bdcd35-f834-4a91-ad55-34d27ed5e316/datasets/52cfe019-1d0f-4992-8286-f55c3b9eae3a
[>                                                 ] 0/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm


[---->                                             ] 1/10[--------->                                        ] 2/10[-------------->                                   ] 3/10[------------------->                              ] 4/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm


[------------------------>                         ] 5/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm


[----------------------------->                    ] 6/10[---------------------------------->               ] 7/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm
Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm


[--------------------------------------->          ] 8/10[-------------------------------------------->     ] 9/10

Error Type: HfHubHTTPError, Message: 503 Server Error: Service Temporarily Unavailable for url: https://router.huggingface.co/hf-inference/models/flax-community/t5-base-cnn-dm


[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_Tests.jpg?raw=true">

In the image below you can see the comparision between two tests.
<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareTestst.jpg?raw=true">

Well, since it has been so straightforward, why don't we try to make the comparison with an OpenAI model?

In [30]:
!pip install langchain_openai

[0mCollecting langchain_openai
  Downloading langchain_openai-0.3.14-py3-none-any.whl.metadata (2.3 kB)
Downloading langchain_openai-0.3.14-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.4/62.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[0mInstalling collected packages: langchain_openai
Successfully installed langchain_openai-0.3.14


In [31]:
from langchain_openai import OpenAI
open_aillm=OpenAI(temperature=0.0)

In [71]:
project_name = f"OpenAI {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

finetuned_t5_results = run_on_dataset(
    client=client,
    project_name=project_name,
    dataset_name=NAME_DATASET,
    llm_or_chain_factory=open_aillm,
    evaluation=evaluation_config,
)

View the evaluation results for project 'OpenAI 2025-04-26 21:13:57' at:
https://smith.langchain.com/o/f6bdcd35-f834-4a91-ad55-34d27ed5e316/datasets/52cfe019-1d0f-4992-8286-f55c3b9eae3a/compare?selectedSessions=e51a774d-8bcb-47fc-a8af-56707439d017

View all tests for Dataset Summarize_dataset_2025-04-26 21:13:21 at:
https://smith.langchain.com/o/f6bdcd35-f834-4a91-ad55-34d27ed5e316/datasets/52cfe019-1d0f-4992-8286-f55c3b9eae3a
[------------------------------------------------->] 10/10

<img src="https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/img/Martra_Figure_4_2SDL_CompareOpenAI_HF.jpg?raw=true">

The experiment with the OpenAI model has yielded the best results. But, be aware! As we can see, there is a cost involved since we are using an API, and it needs to be paid for.

Another crucial piece of information is that we can view performance data for the models. This data could also be useful for minimally evaluating our inference server.