## Install Requirements
Run the cell, below, to install the libraries we'll be using.

In [2]:
%pip install -qU langchain langchain-core langchain_community langchain_text_splitters langgraph
%pip install -qU langchain-google-genai
%pip install -qU bs4
%pip install -qU python-dotenv typing_extensions

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/414.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m409.6/414.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m414.3/414.3 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m100.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m54.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m153.7/153.7 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.4/45.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

## Load the API key into the environment
The code, below, loads the API key and stores it where the LangChain libraries (and likely the Google libraries used by the LangChain libraries) expect to find it.

**If you're running this code in Google Colab**, this code assumes you've already stored your API key as a *secret*:

1. Open your Google Colab notebook and click on the 🔑 Secrets tab in the left panel.
2. The Secrets tab is found on the left panel.
3. Create a new secret with the name `GOOGLE_API_KEY`.
4. Copy/paste your API key into the Value input box of `GOOGLE_API_KEY`.
5. Toggle the button on the left to allow notebook access to the secret.

Otherwise, the code assumes that you have a `.env` file that includes `GOOGLE_API_KEY=<your api key here>`.

In [3]:
import os
import sys

API_KEY = 'GOOGLE_API_KEY'

if 'google.colab' in sys.modules:
    from google.colab import userdata
    os.environ[API_KEY] = userdata.get(API_KEY)
    os.environ[API_KEY]
else:
    from dotenv import load_dotenv
    load_dotenv()  # Load environment variables from .env file; should include GOOGLE_API_KEY

You can verify that your API key is where it ought to be by uncommenting and running the code cell, below.

In [4]:
os.getenv(API_KEY)

'AIzaSyAFd3jwgT_lEzNImSSMf57cSUWHligw6mQ'

## Components
Import and instantiate a:
  1. chat model
  2. embedding model
  3. in-memory vector store

Note that we're using the `langchain_google_genai` library instead of the Google Vertex (or OpenAI, or Anthropic, etc.) library. That means you can't simply copy code from the LangChain tutorial. Documentation for the Google GenAI library can be found [here](https://python.langchain.com/api_reference/google_genai/index.html).

In [5]:
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite-preview-02-05")

In [6]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")

In [7]:
from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)

## RAG Pipeline

### Scrape a Web Page

We'll use the `WebBaseLoader` class to scrape a web page we'd like to ask an LLM about. It uses [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) -- another popular library -- to parse the web page (extract its text content).

Notice how, instead of writing their own HTML parser, the LangChain developers make use of another well-established library. The named parameter `bs_kwargs` is short for "Beautiful Soup key-word arguments. We're passing to `WebBaseLoader` a set of arguments that will be passed to Beautiful Soup functions. A decision to use another library like this comes with trade-offs:
  - To use LangChain, I don't have to write much or any code to control Beautiful Soup. LangChain handles (almost) all of it for me.
  - But now this LangChain class is dependent on (tied to) Beautiful Soup. If Beautiful Soup changes interfaces, `WebBaseLoader` might break.
  - And `WebBaseLoader` is also somewhat less flexible. What if Beautiful Soup isn't my prefered library or doesn't do what I need? So you'll sometimes see one library give you the ability to pass whatever HTML parser you choose. It could be Beautiful Soup or another open-source library or the HTML parser you wrote for fun.

Notice also that we've decided to give Beautiful Soup some more specific instructions, taking content from HTML tags that have a class of `post-content`, `post-title`, or `post-header`. (You could navigate to the web page and open the developer tools to see just what that includes.) Doing so gives us cleaner text to use for our RAG application but at the cost of making our code less general. If I want to query a different web page, there's no reason to think it will use the same class names to identify the important bits. If we add a web page loader to KnotebookLM, we'll need to think about how best to generalize our approach.

In [8]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# setting a User-Agent to avoid a Beautiful Soup warning
# a User-Agent header tells a web server what kind of client is making the request
os.environ['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        ),
    ),
)

docs = loader.load()



`docs` is a list of `Document` objects. We only loaded one document, so the length of `docs` is 1.

In [9]:
len(docs)

1

We can ask Python to tell us the type of that sole document.

In [10]:
type(docs[0])

That's `langchain-core`'s base `Document` class. Consulting the [documentation](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document) you can see it is instantiated with two notable properties: `page_content` and `metadata`. (You can also find in the documentation a link to the source code if you want to dig further.)

We'll need to talk about `metadata` later. For now, let's look at the first bit of the `page_content`.

In [32]:
docs[0].page_content[1300:1500]

'fast retrieval.\n\n\nTool use\n\nThe agent learns to call external APIs for extra information that is missing from the model weights (often hard to change after pre-training), including current information'

Compare it to the web page we scraped. Beautiful Soup did a pretty good job, no?

### Split the Text
As a final pre-processing step, we'll split the text into smaller chunks. Read [why](https://python.langchain.com/docs/concepts/text_splitters/#why-split-documents).

Following the tutorial, we'll use the `RecursiveCharacterTextSplitter` class. It implements a [text-structure based](https://python.langchain.com/docs/concepts/text_splitters/#text-structured-based) approach. To better understand how this splitter works and how to control it, read this [guide](https://python.langchain.com/docs/how_to/recursive_text_splitter/) and consult the [documentation](https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html).

In [48]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
all_splits = text_splitter.split_documents(docs)

Let's see how many chunks we've split our document into.

In [49]:
len(all_splits)

66

In [67]:
all_splits[49:55]

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='You will get instructions for code to write.\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\nMake sure that every detail of the architecture is, in the end, implemented as code.\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code’s language, and CODE is the code:\nFILENAME\nCODE\nYou will start with the “entrypoint” file, then go to the

They're not all equal length. Based on the guides and documentation you've read, can you explain why?

In [51]:
for idx, split in enumerate(all_splits):
    print(f"Split {idx} length: {len(split.page_content)}")#, all_splits[53])

Split 0 length: 969
Split 1 length: 609
Split 2 length: 606
Split 3 length: 644
Split 4 length: 971
Split 5 length: 506
Split 6 length: 902
Split 7 length: 706
Split 8 length: 164
Split 9 length: 960
Split 10 length: 412
Split 11 length: 903
Split 12 length: 834
Split 13 length: 545
Split 14 length: 969
Split 15 length: 986
Split 16 length: 459
Split 17 length: 542
Split 18 length: 760
Split 19 length: 772
Split 20 length: 818
Split 21 length: 469
Split 22 length: 655
Split 23 length: 820
Split 24 length: 476
Split 25 length: 388
Split 26 length: 855
Split 27 length: 805
Split 28 length: 639
Split 29 length: 456
Split 30 length: 610
Split 31 length: 616
Split 32 length: 679
Split 33 length: 726
Split 34 length: 971
Split 35 length: 195
Split 36 length: 997
Split 37 length: 828
Split 38 length: 624
Split 39 length: 955
Split 40 length: 541
Split 41 length: 961
Split 42 length: 704
Split 43 length: 556
Split 44 length: 958
Split 45 length: 666
Split 46 length: 664
Split 47 length: 983
Sp

Based on what we read, we'd expect to see some overlap between the end of one split and the beginning of the next.

In [62]:
for prev, curr in zip(all_splits[51:55], all_splits[52:56]):
    print('previous: \n', prev.page_content[-50:], '\n')
    print('current: \n', curr.page_content[:50], '\n')
    print('\n------------------\n')

previous: 
 ined
package/project.
Python toolbelt preferences: 

current: 
 pytest
dataclasses 


------------------

previous: 
 pytest
dataclasses 

current: 
 Conversatin samples:
[
  {
    "role": "system", 


------------------

previous: 
 Conversatin samples:
[
  {
    "role": "system", 

current: 
 "content": "You will get instructions for code to  


------------------

previous: 
 that are imported by that file, and so on.\nPlease 

current: 
 for the code's language, and CODE is the code:\n\n 


------------------



But in this slice of splits, I don't see any overlap. (I tried a few different slices and likewise didn't see any overlaps.) Does that mean it's not working?

In [53]:
for i in range (len(all_splits) - 1):
    last = all_splits[i].page_content.strip().split()[-1]
    first = all_splits[i+1].page_content.strip().split()[0]
    if last == first:
        print('\n-------- index', i, '----------\n')
        print('previous: \n', last, '\n')
        print('current: \n', first, '\n')


-------- index 0 ----------

previous: 
 Memory 

current: 
 Memory 


-------- index 39 ----------

previous: 
 GOALS: 

current: 
 GOALS: 



There are only two cases where one split overlaps with the previous, and in both cases it looks like a heading. It's not perfectly clear -- at least not to me -- why `RecursiveCharacterTextSplitter` works this way. It could be the nature of the web page (lots of headings, lots of figures). It could be our the relation between our `chunk_size` and `chunk_overlap`. The documentation isn't super helpful. If we want to know more, we'll likely have to dive into the code and experiment.

When it comes time to write code for KnotebookLM, we'll likely want to play around with chunk and overlap sizes and see what makes most sense for our application.

### Index Splits

If you were implementing this next step without LangChain, you'd likely think of it as two steps:
1. For each split, generate an embedding (a vector that represents the "meaning" of the text in the split)
2. Write the resulting vector and the original text to a database.

LangChain handles both with a single call to the `add_documents` method on the `vector_store` instance we created. (And now you understand why we needed to pass the `embeddings` instance as an argument to `vector_store`.

In [61]:
_ = vector_store.add_documents(documents=all_splits)

That's all the pre-processing we need. We're ready to move on to retrieval tasks.

## Retrieve Relevant Chunks, Ask Questions

We've indexed the web page and can now ask questions.

### Prompt

LangChain has a library of task-specific prompts. Let's grab the "RAG" prompt.

In [63]:
from langchain import hub

prompt = hub.pull('rlm/rag-prompt');
prompt



ChatPromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, metadata={'lc_hub_owner': 'rlm', 'lc_hub_repo': 'rag-prompt', 'lc_hub_commit_hash': '50442af133e61576e74536c6556cefe1fac147cad032f4377b60c436e6cdcb6e'}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"), additional_kwargs={})])

Notice it didn't return a simple string, but rather an instance of `ChatPromptTemplate`. Following the tutorial's walk-through, we can explore it a bit:

In [65]:
example_message, = prompt.invoke(
    { "context": "Here's where we'll put relevant chunks from the web page.", "question": "Here's where our question goes." }
).to_messages()

Notice the comma after `example_message`? That wasn't a mistake. As `to_messages` implies, we might get more than one message. Adding the comma there *destructures* the list `to_messages` returns so that I get just the first item. (In this case, there is only one item.)

Let's see the `content` of that message...

In [66]:
print(example_message.content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: Here's where our question goes. 
Context: Here's where we'll put relevant chunks from the web page. 
Answer:


Pretty cool. We pass the prompt a dictionary with `context` and `question` keys and it'll insert their values into our prompt.

### Using LangGraph to Stitch Together the Parts