# LangSmith Deep Dive

This notebook is co-authored with my friends at [AI MakerSpace](https://aimakerspace.io/). Check out their [YouTube channel](https://www.youtube.com/@AI-Makerspace/featured) for, hands down, the best educational content for all things LLMs.

Be sure to connect with [Chris Alexiuk](https://ca.linkedin.com/in/csalexiuk) and [Greg Loughnane](https://www.linkedin.com/in/gregloughnane) on LinkedIn!

## Depenedencies and OpenAI API Key

We'll be using OpenAI's suite of models today to help us generate and embed our documents for a simple RAG system built on top of LangChain's blogs!

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

Enter your OpenAI API Key:¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


## Basic RAG Chain

Now we'll set up our basic RAG chain, first up we need a model!

### OpenAI Model


We'll use OpenAI's `gpt-3.5-turbo` model to ensure we can use a stronger model for decent evaluation later!

Notice that we can tag our resources - this will help us be able to keep track of which resources were used where later on!

In [None]:
pip install langchain-openai

In [4]:
from langchain_openai import ChatOpenAI

base_llm = ChatOpenAI(model="gpt-3.5-turbo", tags=["base_llm"])

#### Asyncio Bug Handling

This is necessary for Colab.

In [5]:
import nest_asyncio
nest_asyncio.apply()

### SiteMap Loader

We'll use a SiteMapLoader to scrape the LangChain blogs.

In [None]:
pip install langchain-community

In [8]:
from langchain_community.document_loaders import SitemapLoader

documents = SitemapLoader(web_path="https://blog.langchain.dev/sitemap-posts.xml").load()

Fetching pages: 100%|##########| 214/214 [00:22<00:00,  9.48it/s]


In [9]:
documents[0]

Document(page_content="\n\n\nHow Dosu Used LangSmith to Achieve a 30% Accuracy Improvement with No Prompt Engineering\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nBy LangChain\n\n\n\n\nRelease Notes\n\n\n\n\nCase Studies\n\n\n\n\nLangChain\n\n\n\n\nGitHub\n\n\n\n\nDocs\n\n\n\n\n\nSign in\nSubscribe\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nHow Dosu Used LangSmith to Achieve a 30% Accuracy Improvement with No Prompt Engineering\n\n6 min read\nMay 2, 2024\n\n\n\n\n\nEditor's Note: the following is authored by Devin Stein, CEO of Dosu.\xa0In this blog we walk through how Dosu uses LangSmith to improve the performance of their application - with NO prompt engineering. Rather, they collected feedback from their users, transformed that into few shot examples, and then fed that back into their application.This is a relatively simple and general technique that can lead to automatic performance imp

In [10]:
documents[0].metadata["source"]

'https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/'

### RecursiveCharacterTextSplitter

We're going to use a relatively naive text splitting strategy today!

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

split_documents = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size = 256,
    chunk_overlap = 16
).split_documents(documents)

In [12]:
len(split_documents)

2569

In [13]:
split_documents[42]

Document(page_content='are useful when you have a dataset with multiple categories to evaluate separately. This allows you to test new use cases by adding examples to a separate split to test, while preserving your evaluation workflow.In addition to splits, you can speed up finding relevant information with the following actions for your dataset examples:Clone examples to another datasetEdit metadata directly in the UISearch for specific examplesWalk through an example of how to use dataset splits in this video.üîÅ\xa0Repetitions to build confidence in your experiment resultsYou can now run multiple repetitions of your experiment in LangSmith. This helps smooth out noise from variability introduced by your application or from your LLM-as-a-judge evaluator, so you can build more confidence in the results of your experiment.In this video, learn how to evaluate on a dataset with repetitions. You can check the mean score across N repetitions, and also compare the outputs for variability a

### Embeddings

We'll be leveraging OpenAI's Embeddings Models today!

In [14]:
from langchain_openai import OpenAIEmbeddings

base_embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

### FAISS VectorStore Retriever

Now we can use a FAISS VectorStore to embed and store our documents and then convert it to a retriever so it can be used in our chain!

In [None]:
pip install faiss-cpu

In [16]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(split_documents, base_embeddings_model)

In [17]:
base_retriever = vectorstore.as_retriever()

### Prompt Template

All we have left is a prompt template, which we'll create here!

In [18]:
from langchain.prompts import ChatPromptTemplate

base_rag_prompt_template = """\
Using the provided context, please answer the user's question. If you don't know the answer based on the context, say you don't know.

Context:
{context}

Question:
{question}
"""

base_rag_prompt = ChatPromptTemplate.from_template(base_rag_prompt_template)

### LCEL Chain

Now that we have:

- Embeddings Model
- Generation Model
- Retriever
- Prompt

We're ready to build our LCEL chain!

Keep in mind that we're returning our source documents with our queries - while this isn't necessary, it's a great thing to get into the habit of doing.

In [19]:
from operator import itemgetter
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import StrOutputParser

base_rag_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": base_rag_prompt | base_llm | StrOutputParser(), "context": itemgetter("context")}
)

Let's test it out!

In [20]:
base_rag_chain.invoke({"question" : "What is a good way to evaluate agents?"})

{'response': 'A good way to evaluate agents is through assisted evaluation, which can help guide you to the most interesting datapoint to look at. Evaluating LLM output using LLMs is not perfect, but it is currently considered the best available solution and is seen as promising in the long run.',
 'context': [Document(page_content='Agents may be the ‚Äúkiller‚Äù LLM app, but building and evaluating agents is hard. Function calling is a key skill for effective tool use, but there aren‚Äôt many good benchmarks for measuring function calling performance. Today, we are excited to release four new test environments for benchmarking LLMs‚Äô ability to effectively use tools to accomplish tasks. We hope this makes it easier for everyone to test different LLM and prompting strategies to show what enables the best agentic behavior.Example successful tool use for the Relational Data taskWe designed these tasks to test capabilities we consider to be prerequisites for common agentic workflows, suc

## LangSmith

Now that we have a chain - we're ready to get started with LangSmith!

We're going to go ahead and use the following `env` variables to get our Colab notebook set up to start reporting.

If all you needed was simple monitoring - this is all you would need to do!

In [21]:
from uuid import uuid4

unique_id = uuid4().hex[0:8]

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"Langsmith_RAG_{unique_id}"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

### LangSmith API


In [22]:
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass('Enter your LangSmith API key: ')

Enter your LangSmith API key: ¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


Let's test our our first generation!

In [23]:
base_rag_chain.invoke({"question" : "What is LangSmith?"}, {"tags" : ["Demo Run"]})['response']

'LangSmith is a framework built on the shoulders of LangChain that is designed to track the inner workings of LLMs (Large Language Models) and AI agents within a product. It helps with debugging, testing, monitoring, and evaluating LLM applications. LangSmith can be used independently of LangChain and provides tools to assist users in managing and improving their AI applications.'

## Create Testing Dataset

Now we can create a dataset using some user defined questions, and providing the retrieved context as a "ground truth" context.

> NOTE: There are many different ways you can approach this specific task - generating ground truth answers with AI, using human experts to generate golden datasets, and more!

In [24]:
from langsmith import Client

test_inputs = [
    "What is LangSmith?",
    "What is LangServe?",
    "How could I benchmark RAG on tables?",
    "What was exciting about LangChain's first birthday?",
    "What features were released for LangChain on August 7th?",
    "What is a conversational retrieval agent?"
]

client = Client()

dataset_name = "langsmith-demo-dataset-v1"

dataset = client.create_dataset(
    dataset_name=dataset_name, description="LangChain Blog Test Questions"
)

for input in test_inputs:
  client.create_example(
      inputs={"question" : input},
      outputs={"answer" : base_rag_chain.invoke({"question" : input})["context"]},
      dataset_id=dataset.id
  )

### Evaluation

Now we can run the evaluation!

In [25]:
from langchain.smith import RunEvalConfig, run_on_dataset

eval_llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)

eval_config = RunEvalConfig(
  evaluators=[
    RunEvalConfig.CoTQA(llm=eval_llm, prediction_key="response"),
    RunEvalConfig.Criteria("harmfulness", prediction_key="response"),
  ]
)

base_rag_base_run = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=base_rag_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project 'unique-wire-90' at:
https://smith.langchain.com/o/b3011959-9014-5768-9a47-c909b1ca0ccc/datasets/fa434207-2632-42de-813e-cbab8332a052/compare?selectedSessions=24cd7702-a4e1-4356-913c-8efc3a84d36d

View all tests for Dataset langsmith-demo-dataset-v1 at:
https://smith.langchain.com/o/b3011959-9014-5768-9a47-c909b1ca0ccc/datasets/fa434207-2632-42de-813e-cbab8332a052
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.Contextual Accuracy  feedback.harmfulness error  execution_time                                run_id
count                           6.00                  6.00     0            6.00                                     6
unique                           NaN                   NaN     0             NaN                                     6
top                              NaN                   NaN   NaN             NaN  9a5d793b-4df2-4404-8019-505b25b6c3fb
freq                             NaN   

## Adding Reranking

We'll add reranking to our RAG application to confirm the claim made by [Cohere](https://cohere.com/rerank)!

`Improve search performance with a single line of code`

We'll put that to the test today!

In [26]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Enter your Cohere API Key:")

Enter your Cohere API Key:¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑¬∑


In [27]:
base_retriever_expander = vectorstore.as_retriever(
    search_kwargs={"k" : 10}
)

In [None]:
pip install cohere

In [30]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

reranker = CohereRerank()
rerank_retriever = ContextualCompressionRetriever(
    base_compressor=reranker, base_retriever=base_retriever_expander
)

### Recreating our Chain with Reranker

Now we can recreate our chain using the reranker.

In [31]:
rerank_rag_chain = (
    {"context": itemgetter("question") | rerank_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": base_rag_prompt | base_llm | StrOutputParser(), "context": itemgetter("context")}
)

rerank_rag_chain = rerank_rag_chain.with_config({"tags" : ["cohere-rerank"]})

### Improved Evaluation

Now we can leverage the full suite of LangSmith's evaluation to evaluate our chains on multiple metrics, including custom metrics!

In [32]:
eval_config = RunEvalConfig(
  evaluators=[
    RunEvalConfig.CoTQA(llm=eval_llm, prediction_key="response"),
    RunEvalConfig.Criteria("harmfulness", prediction_key="response"),
    RunEvalConfig.LabeledCriteria(
        {
            "helpfulness" : (
                "Is this submission helpful to the user,"
                "taking into account the correct reference answer?"
            )
        },
        prediction_key="response"
    ),
    RunEvalConfig.LabeledCriteria(
        {
            "litness" : (
                "Is this submission lit, dope, or cool?"
            )
        },
        prediction_key="response"
    ),
    RunEvalConfig.LabeledCriteria("conciseness", prediction_key="response"),
    RunEvalConfig.LabeledCriteria("coherence", prediction_key="response"),
    RunEvalConfig.LabeledCriteria("relevance", prediction_key="response")
  ]
)

### Running Eval on Each Chain

Now we can evaluate each of our chains!

In [36]:
base_chain_results = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=base_rag_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project 'left-map-62' at:
https://smith.langchain.com/o/b3011959-9014-5768-9a47-c909b1ca0ccc/datasets/fa434207-2632-42de-813e-cbab8332a052/compare?selectedSessions=5eb7f6b0-3815-4fd8-b10d-7557662ae28a

View all tests for Dataset langsmith-demo-dataset-v1 at:
https://smith.langchain.com/o/b3011959-9014-5768-9a47-c909b1ca0ccc/datasets/fa434207-2632-42de-813e-cbab8332a052
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.Contextual Accuracy  feedback.harmfulness  feedback.helpfulness  feedback.litness  feedback.conciseness  feedback.coherence  feedback.relevance error  execution_time                                run_id
count                           6.00                  6.00                  5.00              3.00                  3.00                3.00                3.00     0            6.00                                     6
unique                           NaN                   NaN                  

In [35]:
rerank_chain_results = run_on_dataset(
    client=client,
    dataset_name=dataset_name,
    llm_or_chain_factory=rerank_rag_chain,
    evaluation=eval_config,
    verbose=True,
)

View the evaluation results for project 'long-science-94' at:
https://smith.langchain.com/o/b3011959-9014-5768-9a47-c909b1ca0ccc/datasets/fa434207-2632-42de-813e-cbab8332a052/compare?selectedSessions=ab551064-fb88-4406-9e4f-f36067029a99

View all tests for Dataset langsmith-demo-dataset-v1 at:
https://smith.langchain.com/o/b3011959-9014-5768-9a47-c909b1ca0ccc/datasets/fa434207-2632-42de-813e-cbab8332a052
[------------------------------------------------->] 6/6
 Experiment Results:
        feedback.Contextual Accuracy  feedback.harmfulness  feedback.helpfulness  feedback.litness  feedback.conciseness  feedback.coherence  feedback.relevance error  execution_time                                run_id
count                           6.00                  6.00                  6.00              5.00                  3.00                3.00                3.00     0            6.00                                     6
unique                           NaN                   NaN              