# Introduction to LangChain v0.2.0 and LCEL: LangChain Powered RAG

In the following notebook we're going to focus on learning how to navigate and build useful applications using LangChain, specifically LCEL, and how to integrate different APIs together into a coherent RAG application!

In the notebook, you'll complete the following Tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Initialize a Simple Chain using LCEL
  4. Implement Naive RAG using LCEL
  
Let's get started!



## Task 1: Installing Required Libraries

One of the [key features](https://blog.langchain.dev/langchain-v02-leap-to-stability/) of LangChain v0.2.0 is the compartmentalization of the various LangChain ecosystem packages and added stability.

Instead of one all encompassing Python package - LangChain has a `core` package and a number of additional supplementary packages.

We'll start by grabbing all of our LangChain related packages!

In [1]:
!poetry install

[34mInstalling dependencies from lock file[39m

No dependencies to install or update


## Task 2: Set Environment Variables

We'll be leveraging OpenAI's suite of APIs - so we'll set our `OPENAI_API_KEY` `env` variable here!

In [1]:
import os
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key: ")


## Task 3: Initialize a Simple Chain using LCEL

The first thing we'll do is familiarize ourselves with LCEL and the specific ins and outs of how we can use it!

### LLM Orchestration Tool (LangChain)

Let's dive right into [LangChain](https://www.langchain.com/)!

The first thing we want to do is create an object that lets us access OpenAI's `gpt-4o` model.

In [2]:
from langchain_openai import ChatOpenAI

openai_chat_model = ChatOpenAI(model="gpt-4o")

####❓ Question #1:

What other models could we use, and how would the above code change?

> HINT: Check out [this page](https://platform.openai.com/docs/models) to find the answer!

#### 🔍Answer #1:
Based on the OpenAI models documentation, there are several other models we could use instead of "gpt-4o". Here are some options and how the code would change for each:

1. GPT-4o mini:
```python
openai_chat_model = ChatOpenAI(model="gpt-4o-mini")
```

2. GPT-4 Turbo:
```python
openai_chat_model = ChatOpenAI(model="gpt-4-turbo")
```

3. GPT-4:
```python
openai_chat_model = ChatOpenAI(model="gpt-4")
```

4. GPT-3.5 Turbo:
```python
openai_chat_model = ChatOpenAI(model="gpt-3.5-turbo")
```

5. ChatGPT-4o latest :
```python
openai_chat_model = ChatOpenAI(model="chatgpt-4o-latest")
```

### Prompt Template

Now, we'll set up a prompt template - more specifically a `ChatPromptTemplate`. This will let us build a prompt we can modify when we call our LLM!

In [3]:
from langchain_core.prompts import ChatPromptTemplate

system_template = "You are a legendary and mythical Wizard. You speak in riddles and make obscure and pun-filled references to exotic cheeses."
# NOTE: directly pass in user input to the start of message/chain pipeline
human_template = "{content}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", human_template)
])

### Our First Chain

Now we can set up our first chain!

A chain is simply two components that feed directly into eachother in a sequential fashion!

You'll notice that we're using the pipe operator `|` to connect our `chat_prompt` to our `llm`.

This is a simplified method of creating chains and it leverages the LangChain Expression Language, or LCEL.

You can read more about it [here](https://python.langchain.com/v0.2/docs/concepts/#langchain-expression-language-lcel), but there a few features we should be aware of out of the box (taken directly from LangChain's documentation linked above):

- **Async, Batch, and Streaming Support** Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.

- **Fallbacks** The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

- **Parallelism** Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.

In the following code cell we have two components:

- `chat_prompt`, which is a formattable `ChatPromptTemplate` that contains a system message and a human message.
- `openai_chat_model`, which is a LangChain Runnable wrapped OpenAI client.

We'd like to be able to pass our own `content` (as found in our `human_template`) and then have the resulting message pair sent to our model and responded to!

In [4]:
chain = chat_prompt | openai_chat_model

Notice the pattern here:

We invoke our chain with the `dict` `{"content" : "Hello world!"}`.

It enters our chain:

`{"content" : "Hello world!"}` -> `invoke()` -> `chat_prompt`

Our `chat_prompt` returns a `PromptValue`, which is the formatted prompt. We then "pipe" the output of our `chat_prompt` into our `llm`.

`PromptValue` -> `|` -> `llm`

Our `llm` then takes the list of messages and provides an output which is return as a `str`!







In [5]:
response = chain.invoke({"content": "Hello world!"})
response.pretty_print()


Ah, greetings, seeker of wisdom! The world, you say? Why, it is but a wheel of Emmental, full of holes and mysteries to unravel. What riddle do you bring to this aged and curdled mind today?


Let's try it out with a different prompt!

In [6]:
response = chain.invoke({"content" : "Could I please have some advice on how to become a better Python Programmer?"})
response.pretty_print()


Ah, seeker of serpentine script and cheddar challenges, listen well to my riddle-bound wisdom!

To master the Python's slithering code,
Follow these steps down the gouda road:

1. **Read the Grimoire**: Dive into the "Python Enhancement Proposals" (PEPs), especially PEP 8. It's the Necronomicon of good practices, as essential as a wheel of Parmigiano-Reggiano to a gourmet.

2. **Practice the Incantations**: Write code daily, for repetition is the key to a sharp mind, much like aging is to a fine Roquefort. Conjure projects, solve puzzles, and weave your spells with consistency.

3. **Join the Coven**: Engage with the Python community. Be it forums, local meetups, or online guilds, the exchange of knowledge is as enriching as a slice of Camembert shared among friends.

4. **Study the Tomes**: Consume books and scrolls, such as "Automate the Boring Stuff with Python" and "Fluent Python". These are the Fontina and the Havarti of your educational feast.

5. **Invoke the Elders**: Seek out

Notice how we specifically referenced our `content` format option!

Now that we have the basics set up - let's see what we mean by "Retrieval Augmented" Generation.

## Naive RAG - Manually adding context through the Prompt Template

Let's look at how our model performs at a simple task - defining what LangChain is!

We'll redo some of our previous work to change the `system_template` to be less...verbose.

In [7]:
system_template = "You are a helpful assistant."
human_template = "{content}"

chat_prompt = ChatPromptTemplate.from_messages([
    ("system", system_template),
    ("human", human_template)
])

chat_chain = chat_prompt | openai_chat_model


In [8]:
response = chat_chain.invoke({"content" : "Please define LangChain."})
response.pretty_print()


LangChain is a software development framework designed to streamline the creation of applications that leverage large language models (LLMs). The framework provides essential components and tools for integrating LLMs with various data sources, user interfaces, and other software systems. By using LangChain, developers can more easily build and deploy sophisticated applications that utilize the natural language processing capabilities of LLMs, enhancing their functionality and user experience. The framework typically includes features for managing model inputs and outputs, handling data preprocessing and postprocessing, and ensuring efficient and scalable application performance.


Well, that's not very good - is it!

The issue at play here is that our model was not trained on the idea of "LangChain", and so it's left with nothing but a guess - definitely not what we want the answer to be!

Let's ask another simple LangChain question!

In [9]:
response = chat_chain.invoke({"content" : "What is LangChain Expression Language (LECL)?"})
response.pretty_print()


LangChain Expression Language (LEL) is a domain-specific language designed for the LangChain framework, which focuses on building applications with language models. LEL provides a structured way to define and manipulate the behavior of language models within the LangChain ecosystem.

Here are some key features and purposes of LEL:

1. **Structured Expression**: LEL allows developers to write expressions that can be used to control and manipulate language model outputs, making it easier to build complex applications that require fine-tuned interaction with language models.

2. **Integration with LangChain**: It seamlessly integrates with the LangChain framework, enabling developers to use LEL expressions to define workflows, data processing pipelines, and other tasks that involve language models.

3. **Flexibility and Control**: By using LEL, developers can have more control over how language models are utilized, including specifying conditions, transformations, and other logic that sh

While it provides a confident response, that response is entirely ficticious! Not a great look, OpenAI!

However, let's see what happens when we rework our prompts - and we add the content from the docs to our prompt as context.

In [10]:
HUMAN_TEMPLATE = """
#CONTEXT:
{context}

QUERY:
{query}

Use the provide context to answer the provided user query. Only use the provided context to answer the query. If you do not know the answer, response with "I don't know"
"""

CONTEXT = """
LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.

Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.

Seamless LangSmith Tracing Integration As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at every step. With LCEL, all steps are automatically logged to LangSmith for maximal observability and debuggability.
"""

chat_prompt = ChatPromptTemplate.from_messages([
    ("human", HUMAN_TEMPLATE)
])

chat_chain = chat_prompt | openai_chat_model

In [11]:
response = chat_chain.invoke({"query" : "What is LangChain Expression Language?", "context" : CONTEXT})
response.pretty_print()


LangChain Expression Language (LCEL) is a declarative way to easily compose chains together. It provides several benefits:

1. **Async, Batch, and Streaming Support**: Chains constructed with LCEL automatically support synchronous, asynchronous, batch, and streaming interfaces.
2. **Fallbacks**: LCEL allows for graceful error handling by easily attaching fallbacks to any chain, addressing the non-determinism of LLMs.
3. **Parallelism**: Components that can be run in parallel will automatically do so, optimizing performance.
4. **Seamless LangSmith Tracing Integration**: All steps in LCEL chains are automatically logged to LangSmith, enhancing observability and debuggability as chains grow more complex.


You'll notice that the response is much better this time. Not only does it answer the question well - but there's no trace of confabulation (hallucination) at all!

> NOTE: While RAG is an effective strategy to *help* ground LLMs, it is not nearly 100% effective. You will still need to ensure your responses are factual through some other processes

That, in essence, is the idea of RAG. We provide the model with context to answer our queries - and rely on it to translate the potentially lengthy and difficult to parse context into a natural language answer!

However, manually providing context is not scalable - and doesn't really offer any benefit.

Enter: Retrieval Pipelines.

## Task #4: Implement Naive RAG using LCEL

Now we can make a naive RAG application that will help us bridge the gap between our Pythonic implementation and a fully LangChain powered solution!

## Putting the R in RAG: Retrieval 101

In order to make our RAG system useful, we need a way to provide context that is most likely to answer our user's query to the LLM as additional context.

Let's tackle an immediate problem first: The Context Window.

All (most) LLMs have a limited context window which is typically measured in tokens. This window is an upper bound of how much stuff we can stuff in the model's input at a time.

Let's say we want to work off of a relatively large piece of source data - like the Ultimate Hitchhiker's Guide to the Galaxy. All 898 pages of it!

> NOTE: It is recommended you do not run the following cells, they are purely for demonstrative purposes.

In [12]:
context = """
EVERY HITCHHIKER'S GUIDE BOOK
"""

We can leverage our tokenizer to count the number of tokens for us!

In [13]:
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

In [14]:
len(enc.encode(context))

12

The full set comes in at a whopping *636,144* tokens.

So, we have too much context. What can we do?

Well, the first thing that might enter your mind is: "Use a model with more context window", and we could definitely do that! However, even `gpt-4-128k` wouldn't be able to fit that whole text in the context window at once.

So, we can try splitting our document up into little pieces - that way, we can avoid providing too much context.

We have another problem now.

If we split our document up into little pieces, and we can't put all of them in the prompt. How do we decide which to include in the prompt?!

> NOTE: Content splitting/chunking strategies are an active area of research and iterative developement. There is no "one size fits all" approach to chunking/splitting at this moment. Use your best judgement to determine chunking strategies!

In order to conceptualize the following processes - let's create a toy context set!

### TextSplitting aka Chunking

We'll use the `RecursiveCharacterTextSplitter` to create our toy example.

It will split based on the following rules:

- Each chunk has a maximum size of 100 tokens
- It will try and split first on the `\n\n` character, then on the `\n`, then on the `<SPACE>` character, and finally it will split on individual tokens.

Let's implement it and see the results!

In [15]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter

def tiktoken_len(text):
    tokens = tiktoken.encoding_for_model("gpt-4o").encode(
        text,
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap = 0,
    length_function = tiktoken_len,
)

In [16]:
chunks = text_splitter.split_text(CONTEXT)

In [17]:
len(chunks)

3

In [18]:
for chunk in chunks:
  print(chunk)
  print("----")

LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.
----
Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.
----
Seamless LangSmith Tracing Integration As your chains get more and more complex, it becomes increasingly important to understand what exactly is happening at ev

As is shown in our result, we've split each section into 100 token chunks - cleanly separated by `\n\n` characters!

#### 🏗️ Activity #1:

While there's nothing specifically wrong with the chunking method used above -
it is a naive approach that is not sensitive to specific data formats.

Brainstorm some ideas that would split large single documents into smaller
documents.

1. `Semantic Chunking`: Use natural language processing to identify topic
   changes or semantic shifts in the text, creating chunks based on coherent
   themes or ideas.
2. `Sliding Window with Overlap`: Implement a sliding window approach that
   moves through the document, creating chunks of a fixed size with a specified
   overlap to maintain context between chunks.
3. `Structure-based Splitting`: For documents with clear structural elements
   (e.g., chapters, sections, or XML tags), split the document based on these
   inherent divisions to preserve the original organization.


## Embeddings and Dense Vector Search

Now that we have our individual chunks, we need a system to correctly select the relevant pieces of information to answer our query.

This sounds like a perfect job for embeddings!

We'll be using OpenAI's `text-embedding-3` model as our embedding model today!

Let's load it up through LangChain.

In [19]:
from langchain_openai.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

####❓ Question #2:

What is the embedding dimension, given that we're using `text-embedding-3-small`?

> HINT: Check out the [docs](https://platform.openai.com/docs/guides/embeddings) to help you answer this question.

#### 🔍Answer #2:
By default, the length of the embedding vector will be `1536` for `text-embedding-3-small`.

### Finding the Embeddings for Our Chunks

First, let's find all our embeddings for each chunk and store them in a convenient format for later.

In [20]:
embeddings_dict = {}

for chunk in chunks:
  embeddings_dict[chunk] = embedding_model.embed_query(chunk)

In [21]:
for k,v in embeddings_dict.items():
  print(f"Chunk - {k}")
  print("---")
  print(f"Embedding - Vector of Size: {len(v)}")
  print("\n\n")

Chunk - LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.
---
Embedding - Vector of Size: 1536



Chunk - Fallbacks The non-determinism of LLMs makes it important to be able to handle errors gracefully. With LCEL you can easily attach fallbacks to any chain.

Parallelism Since LLM applications involve (sometimes long) API calls, it often becomes important to run things in parallel. With LCEL syntax, any components that can be run in parallel automatically are.
---
Embedding - Vector of Size: 1536



Chunk - Seamless LangSmith Tracing Integration As your chains get more and

Okay, great. Let's create a query - and then embed it!

In [22]:
query = "Can LCEL help take code from the notebook to production?"

query_vector = embedding_model.embed_query(query)
print(f"Vector of Size: {len(query_vector)}")

Vector of Size: 1536


Now, let's compare it against each existing chunk's embedding by using cosine similarity.

In [23]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec_1, vec_2):
  return np.dot(vec_1, vec_2) / (norm(vec_1) * norm(vec_2))

In [24]:
max_similarity = -float('inf')
closest_chunk = ""

for chunk, chunk_vector in embeddings_dict.items():
  cosine_similarity_score = cosine_similarity(chunk_vector, query_vector)

  if cosine_similarity_score > max_similarity:
    closest_chunk = chunk
    max_similarity = cosine_similarity_score

print(closest_chunk)
print(max_similarity)

LangChain Expression Language or LCEL is a declarative way to easily compose chains together. There are several benefits to writing chains in this manner (as opposed to writing normal code):

Async, Batch, and Streaming Support Any chain constructed this way will automatically have full sync, async, batch, and streaming support. This makes it easy to prototype a chain in a Jupyter notebook using the sync interface, and then expose it as an async streaming interface.
0.537298487051912


And we get the expected result, which is the passage that specifically mentions prototyping in a Jupyter Notebook!

### Creating a Retriever

Now that we have an idea of how we're getting our most relevant information - let's see how we could create a pipeline that would automatically extract the closest chunk to our query and use it as context for our prompt!

First, we'll wrap the above in a helper function!

In [25]:
def retrieve_context(query, embeddings_dict, embedding_model):
  query_vector = embedding_model.embed_query(query)
  max_similarity = -float('inf')
  closest_chunk = ""

  for chunk, chunk_vector in embeddings_dict.items():
    cosine_similarity_score = cosine_similarity(chunk_vector, query_vector)

    if cosine_similarity_score > max_similarity:
      closest_chunk = chunk
      max_similarity = cosine_similarity_score

  return closest_chunk

Now, let's add it to our pipeline!

In [26]:
def simple_rag(query, embeddings_dict, embedding_model, chat_chain):
  context = retrieve_context(query, embeddings_dict, embedding_model)

  response = chat_chain.invoke({"query" : query, "context" : context})

  return_package = {
      "query" : query,
      "response" : response,
      "retriever_context" : context
  }

  return return_package

In [28]:
response = simple_rag("Can LCEL help take code from the notebook to production?", embeddings_dict, embedding_model, chat_chain)

In [31]:
response["response"].pretty_print()


Yes, LCEL can help take code from the notebook to production. By using LCEL to compose chains, you automatically gain full support for synchronous, asynchronous, batch, and streaming operations. This makes it straightforward to prototype a chain in a Jupyter notebook using the sync interface and then expose it as an async streaming interface for production use.


####❓ Question #3:

What does LCEL do that makes it more reliable at scale?

> HINT: Use your newly created `simple_rag` to help you answer this question!

#### 🔍Answer #3:

##### RAG Answer

Yes, LCEL can help take code from the notebook to production. By using LCEL to
compose chains, you automatically gain full support for synchronous,
asynchronous, batch, and streaming operations. This makes it straightforward to
prototype a chain in a Jupyter notebook using the sync interface and then
expose it as an async streaming interface for production use.

##### Extended Answer

1. Configurable Retries and Fallbacks: LCEL allows developers to configure
   retries and fallbacks at any point in the chain[2]. This capability is
   crucial for maintaining performance and reliability in production
   environments, especially as systems scale up.

2. Async Support: LCEL provides dual support for synchronous and asynchronous
   APIs[2]. This flexibility enables applications to handle multiple concurrent
   requests efficiently, which is essential for scalability and performance
   optimization.

3. Optimized Parallel Execution: LCEL automatically optimizes the execution of
   steps that can be performed in parallel[1]. For example, when fetching
   documents from multiple retrievers, LCEL manages these requests
   concurrently, significantly reducing latency and improving response times as
   the system scales.

4. First-Class Streaming Support: LCEL offers robust streaming capabilities,
   which are crucial for applications requiring real-time interaction with
   Language Models (LLMs)[1]. This feature allows for minimal latency from the
   moment a request is made, enhancing responsiveness even under increased
   load.

5. Access to Intermediate Results: For complex chains, LCEL provides the
   ability to stream intermediate results[2]. This feature allows developers to
   monitor the progress of operations and debug issues effectively, which is
   particularly valuable when scaling up and dealing with more complex
   workflows.

6. Input and Output Schemas: LCEL integrates Pydantic and JSONSchema for
   defining input and output schemas[2]. This integration ensures that data
   validation is an integral part of the chains, enhancing the robustness of
   applications as they scale.

These features collectively contribute to LCEL's ability to maintain
reliability at scale by providing mechanisms for error handling, efficient
resource utilization, and robust data processing, even as the system grows and
faces increased demands.

Citations:

- [1] https://www.restack.io/docs/langchain-knowledge-lcel-parallel-cat-ai
- [2] https://www.restack.io/docs/langchain-knowledge-lcel-retriever-cat-ai
- [3] https://www.blameless.com/blog/how-to-scale-for-reliability-and-trust
- [4] https://www.ti.com/lit/wp/slyy074/slyy074.pdf?ts=1709809848758
- [5]
  https://www.linkedin.com/advice/1/how-do-you-ensure-scalability-reliability-your
- [6] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6004510/
- [7] https://platform.openai.com/docs/api-reference