# Installation

In [2]:
# We will install the basic LangChain Google VertexAI package,
# The LangChain Community package for access to loaders,
# The LangChain Google Community package for speech-to-text API and
# The text splitter package for working with long context
! pip install -qU langchain langchain-community langchain-text-splitters langchain-google-vertexai langchain-google-community google-cloud-speech

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [2]:
# Only needed if you work with Google Colab
# You can skip this if you use a Vertex AI Workbench
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [3]:
# Set your project and region

import vertexai

PROJECT_ID = "genai-app-bldr"  # @param {type:"string"}
REGION = "asia-northeast3"  # @param {type:"string"}

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=REGION)

# Summarizing documents
We will discuss two ways of summarizing long documents. The first way uses traditional map-reduce patterns to summarize documents across multiple LLM calls, while the second way makes use of the long context windows and multimodal capabilities of the latest LLMs to create summaries more efficiently. We will also quickly cover loaders to get documents into your chain, and then go deep into map-reduce patterns to give you a good understanding of what is going on under the hood.

## Summarizing text
We’ll start with summarizing text. Since most advanced modern LLMs have a long context window, we can  just put the whole text in the model’s context and wrap it with additional instructions  about what kind of summary we aim for. In practice, we use zero-shot learning for summarization tasks since examples are typically long. Let’s start with  Gemini release updates  ( (https://gemini.google.com/updates) and ask Gemini to create a summary.

In [26]:
from langchain.chains.summarize import load_summarize_chain
from langchain_google_vertexai import ChatVertexAI
from langchain_core.documents import Document

google_gemini_release_updates = """2024.10.09
Introducing Imagen 3
What: Create stunning images in Gemini with Imagen 3, our highest quality text-to-image model yet. Simply describe what you imagine, and watch as your ideas transform into visuals, bursting with vivid details and realism, in seconds.
And starting in English, Gemini Advanced subscribers unlock the added capability of generating images featuring people, empowering even more variety in your creations.
Why: Your imagination is limitless, and we want to place the power of creativity in your hands. We're also expanding image generation to more countries and languages, so even more people can bring their ideas to life with Gemini. While this model represents a leap forward in image generation quality and accuracy, we're constantly learning and will continue to improve to provide you the best experience possible.
2024.10.06
Gemini can access helpful information from more apps and services
What: Over the coming days, Gemini will connect to more apps and services, letting you set reminders with Google Tasks, make and refine lists with Google Keep, and engage with playlists in YouTube Music. Learn more about extensions. Learn more about extensionsOpens in a new window.
Why: We’re continuing to build new ways for Gemini to connect to the apps you rely on every day. Through these extensions, Gemini can coordinate across your apps and take actions for you, better than ever. Read moreOpens in a new window about how Gemini works with your favorite Assistant actions, including details on upcoming improvements.
2024.10.02
Gemini 1.5 Pro in Gemini Advanced just became more capable
What: Gemini Advanced, with a chat optimized version of 1.5 Pro-002, now provides better and more accurate responses for prompts related to math and exploring complex topics that invite thoughtful conversation. Whether you’re providing detailed, multi-step instructions, or tackling advanced mathematical calculations, Gemini Advanced will help you navigate these challenges with greater ease.
Why: We remain committed to rapid development and delivering the best of Gemini to our users. Through continuous model training and valuable user feedback, we’ve further enhanced the capabilities of Gemini 1.5 Pro, continuing to make Gemini Advanced better over time. Upgrade to Gemini Advanced"""

llm = ChatVertexAI(temperature=0, model_name="gemini-2.0-flash-exp", location="us-central1")
chain = load_summarize_chain(llm, chain_type="stuff")

doc = Document(page_content=google_gemini_release_updates)

result = chain.invoke([doc])
print(result["output_text"])

Google Gemini has recently received several updates: Imagen 3, a new text-to-image model, allows for high-quality image generation, including images with people for English Gemini Advanced subscribers. Gemini is also connecting to more apps like Google Tasks, Keep, and YouTube Music, enabling it to take actions for users. Finally, Gemini Advanced with 1.5 Pro has improved its math and complex topic handling capabilities.



In [None]:
JSON(result)

Google Gemini has recently received several updates: Imagen 3, a new text-to-image model, allows for high-quality image generation, including images with people for English Gemini Advanced subscribers. Gemini is also connecting to more apps like Google Tasks, Keep, and YouTube Music, enabling it to take actions for users. Finally, Gemini Advanced with 1.5 Pro has improved its math and complex topic handling capabilities.


'Google Gemini has recently received several updates: Imagen 3, a new text-to-image model, allows for high-quality image generation, including images with people for English Gemini Advanced subscribers. Gemini is also connecting to more apps like Google Tasks, Keep, and YouTube Music, enabling it to take actions for users. Finally, Gemini Advanced with 1.5 Pro has improved its math and complex topic handling capabilities.\n'

Google is rapidly enhancing its AI assistant, Gemini. Recent updates include:

* **Imagen 3 integration:** Generate high-quality images from text prompts within Gemini, with expanded capabilities for Gemini Advanced subscribers.
* **App integration:** Connect Gemini to Google Tasks, Keep, and YouTube Music for improved functionality and task management.
* **Gemini 1.5 Pro enhancements:**  Improved accuracy and capabilities for complex tasks, math problems, and nuanced conversations within Gemini Advanced. 

These updates highlight Google's focus on making Gemini a powerful and versatile AI tool for a wide range of applications. 


LangChain is transparent about its inner workings. The following cell will let you inspect the template:

In [27]:
print(chain.document_prompt)

input_variables=['page_content'] input_types={} partial_variables={} template='{page_content}'


Sometimes you may need to combine documents into a single document. We use a chain for this. With `_get_inputs`, we can see what is going on inside the chain.

In [28]:
doc1 = Document(page_content="text1")
doc2 = Document(page_content="text2")

chain._get_inputs([doc1, doc2])

{'text': 'text1\n\ntext2'}

And as always, we have full access to everything that is going on inside the chain, including its (deceptively simple) prompt.

In [29]:
print(chain.llm_chain.prompt)

input_variables=['text'] input_types={} partial_variables={} template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:'


This also means that we can modify the behavior of the chain using our own prompt. Let's assume you want to significantly shorten the summary. You can add a `max_length` variable and pass that to the prompt.

In [31]:
from langchain_core.prompts.chat import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
      ("system", "Create an extractive summary in simple English with no more than {max_length} words"),
      ("human", "Use the following context for the summary: \n{context}")
    ]
)

chain = load_summarize_chain(
              llm, 
              chain_type="stuff", 
              document_separator="\n",
              prompt=prompt, 
              document_variable_name="context"
            )

result = chain.invoke(
            input={"input_documents": [doc], "max_length": 20})
print(result["output_text"])

Gemini's Imagen 3 creates high-quality images. Gemini connects to more apps. Gemini 1.5 Pro is more capable.



we have control over the length of the output. Note that to provide additional arguments to the chain, we need to provide a full dictionary to the `input` key argument during invocation instead of providing a list of documents:

In [32]:
result = chain.invoke(
  input={"input_documents": [doc], "max_length": 50})

print(result["output_text"])

Gemini is improving with new features. Imagen 3 creates high-quality images from text. Gemini connects to more apps like Google Tasks and YouTube Music. Gemini Advanced now gives better responses for math and complex topics. These updates aim to make Gemini more useful and creative.



## Using Loaders
Let's look at a practical example: You want to learn about all the amazing new concepts in the world of Generative AI, but you simply don't have the time to read through the many papers published each day. How about building a LangChain agent that can help you quickly summarize complex papers in simple English? We can use `WebBaseLoader` to load the HTML version of any arxiv paper (such as the example on multi-agent systems below) and then pass it to the model:

In [33]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://arxiv.org/html/2312.10256v1")
# Note how the loader already returns LangChain Document types
docs = loader.load()

# We use the standard load_summarize_chain for summarizing the text.
# chain = load_summarize_chain(
#  llm, chain_type="stuff", document_separator="\n",
#  prompt=prompt, document_variable_name="context")

result = chain.invoke(
  input={"input_documents": docs, "max_length": 50})

print(result["output_text"])

USER_AGENT environment variable not set, consider setting it to identify your requests.


This survey explores multi-agent reinforcement learning (MARL), focusing on learning optimal control in multi-agent systems. It covers backgrounds like game theory and deep learning, and challenges such as non-stationarity and coordination. The survey also discusses training schemes, communication, and knowledge transfer in MARL.



In [None]:
from IPython.display import *

Markdown(result["output_text"])

JSON(result)


## Map-reduce pattern
What should we do if our document doesn’t fit into the model’s input window? We have a classic answer - Map-Reduce! This is a Big Data processing technology that has been with us for around 15 years. The idea is to split the original task into smaller ones that can be processed independently (Map phase) and then combine task outputs together (Reduce phase). Let's see a practical example for how this works.

In [53]:
# First, we need to load our long pdf and split it into smaller chunks:
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://arxiv.org/html/2312.10256v1")
doc = loader.load()

# RecursiveCharacterTextSplitter can be configured to any chunk size.
# Make sure the ensuing text fits into your models
# text_splitter = RecursiveCharacterTextSplitter(
#                         chunk_size=2000, 
#                         chunk_overlap=100,
#                         length_function=len, 
#                         is_separator_regex=False
#                      )

# docs = text_splitter.split_documents(doc)
# print(f"I have created {len(docs)} documents")

# Next, we want to create the summary:
from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
   "Collapse this content: {text}"
)

llm_mr = ChatVertexAI(model_name="gemini-1.5-pro-002", location="us-central1")
# We create a load_summarize chain following the map_reduce pattern
chain_mr = load_summarize_chain(llm_mr, chain_type="map_reduce", collapse_prompt=prompt)

result = chain_mr.invoke(doc)
print("Here is your summary: " + result["output_text"])

Here is your summary: This survey comprehensively reviews Multi-agent Reinforcement Learning (MARL), covering its foundational concepts (multi-agent environments, game theory, deep learning, RL), benefits, and key challenges (computational complexity, non-stationarity, coordination, evaluation).  It explores various MARL training paradigms, off-policy learning, exploration, communication, ad-hoc teamwork, knowledge transfer, and agent modeling, concluding with open challenges and future directions.



In [54]:
Markdown(result["output_text"])

This survey comprehensively reviews Multi-agent Reinforcement Learning (MARL), covering its foundational concepts (multi-agent environments, game theory, deep learning, RL), benefits, and key challenges (computational complexity, non-stationarity, coordination, evaluation).  It explores various MARL training paradigms, off-policy learning, exploration, communication, ad-hoc teamwork, knowledge transfer, and agent modeling, concluding with open challenges and future directions.


Of course you are able to inspect the prompt in the Map-Reduce Chain.

In [57]:
chain_mr.llm_chain.prompt

PromptTemplate(input_variables=['text'], input_types={}, partial_variables={}, template='Write a concise summary of the following:\n\n\n"{text}"\n\n\nCONCISE SUMMARY:')

## Summarizing other modalities

Let’s look at how we would process the arxiv paper we analysed in the previous section using multimodal capabilities: This time, we will get a .pdf file from a public Google Cloud Storage bucket and construct a HumanMessage that has two (or more) pieces of content. You can find the full code in our Github repository.

In [58]:
from langchain_core.messages import HumanMessage, SystemMessage

pdf_message = {
   "type": "image_url",
   "image_url": {
       "url": "gs://arxiv-dataset/arxiv/arxiv/pdf/2312/2312.10256v1.pdf"
   },
}

summarize_message = {
   "type": "text",
   "text": "Create an extractive summary in simple English with no more than 50 words of the following documents.",
}

message = HumanMessage(content=[pdf_message, summarize_message])
result = llm.invoke([message])
print(result.content)

This survey explores multi-agent reinforcement learning (MARL), focusing on learning optimal control in multi-agent systems. It covers challenges like complexity, non-stationarity, and coordination, and discusses techniques like communication, knowledge transfer, and agent modeling. The goal is to provide a comprehensive view for researchers and practitioners.



In [None]:
from langchain_core.output_parsers import StrOutputParser

reduce_prompt = ChatPromptTemplate.from_messages(
    [("system", "Create a summary in less than 200 words in simple English of these key points extract from a document:\n{context}")]
)
reduce_chain = reduce_prompt | llm | StrOutputParser()

reduce_chain.invoke({"context" : summary})

Let’s look at another example for using a loader for processing multimodal information: [Google Cloud Speech-To-Text](https://cloud.google.com/speech-to-text/docs/optimizing-audio-files-for-speech-to-text#optimize_video_files_for_analysis ) services help us to transcribe audio or video into text that can be processed by any LLM. For the example below, assume you want to transcribe and process a recording of a presentation given by a colleague.  

In [73]:
from langchain_google_community import SpeechToTextLoader

# First we load a speech file from the public domain
# ! wget -O audio.flac https://upload.wikimedia.org/wikipedia/commons/a/ae/Rep._Vance_McAllister_on_Increasing_the_Mississippi_River_MRT.flac

# Then we create a speech loader for this file
speech_loader = SpeechToTextLoader(
  project_id=PROJECT_ID, file_path="""audio.flac""", is_long=False)

audio_recording_docs = speech_loader.load()
# Let's print out how long it is
print(audio_recording_docs)

[Document(metadata={'language_code': 'en-US', 'result_end_offset': datetime.timedelta(seconds=30)}, page_content=''), Document(metadata={'language_code': 'en-US', 'result_end_offset': datetime.timedelta(seconds=44)}, page_content=' with a padam canal expansion project underway, we must continue to invest in the water resource, not with this finding, these waterways are too important to our nation, uh, again i just want to command the committee on the hold work they done, and there was not an easy challenge.'), Document(metadata={'language_code': 'en-US', 'result_end_offset': datetime.timedelta(seconds=44)}, page_content='')]


Now we can work with these documents as usual.

In [72]:
chain = load_summarize_chain(
 llm, chain_type="stuff", document_separator="\n",
 prompt=prompt)

result = chain.invoke(
  input={"input_documents": audio_recording_docs, "max_length": 50})
print(result["output_text"])

We must continue to invest in data resources, especially with the padam canal expansion underway. These waterways are vital to our nation. I commend the committee for their work on this challenging project.



Another option is to pass the whole audio (or video) content directly to the model. In this case, we need to construct a special type of media message pointing to the file and mentioning the mime type of the content.

In [75]:
import base64

# We first need to encode our audio to base64 so we can pass it to the LLM
with open("audio.flac","rb") as f:
  audio = base64.b64encode(f.read()).decode('utf-8')

# Then just construct a media message, making sure to give the correct mime_type
media_message = {
       "type": "media",
       "data": audio,
       "mime_type": "audio/flac",
   }

message = HumanMessage(content=[summarize_message, media_message])

output = llm.invoke([message])

print(output.content)

The speaker appreciates the committee's hard work on the appropriations bill. Their amendment increases funding for the Mississippi River and tributaries by $47 million to FY14 levels. The Mississippi River is a vital artery of commerce. With the Panama Canal expansion, we must invest in this resource, not reduce funding.


# Answering questions on long documents
Summarization is just the first step in achieving the goal of our users: Answering questions on long documents! No one has time to read these long documents, and while summaries are helpful, what users really want is a clear, concise answer to a question. In this section, we will discuss how you can use the long context window of modern LLM to efficiently answer questions on long documents.

With the introduction of Gemini 1.5 Pro, Google was able to significantly expand the context available for processing in one message to the LLM. Imagine that you can perform Q&A on a 132 pages long document with a single call! To easily enable you to work with long context, LangChain introduced new types of messages. Don’t be confused by the types: Since long context was born out of multimodal model capabilities (processing images and text in the same invocation), the message type for pdf messages is “image_url”.  

In [77]:
# Don't be confused by image_url when loading pdf. This started as an image loader
pdf_message = {
   "type": "image_url",
   "image_url": {"url": "gs://arxiv-dataset/arxiv/arxiv/pdf/2312/2312.10256v1.pdf"},
}

text_message = {
   "type": "text",
   "text": "What is the biggest challenge with MARL? Respond in 10 words or fewer.",
}

# We simply add the pdf message to the model
message = HumanMessage(content=[pdf_message, text_message])

output = llm.invoke([message])
print(output.content)

Non-stationarity, coordination, and computational complexity are big challenges.



You can also provide multiple PDFs and ask more advanced questions:

In [78]:
pdf_message_1 = {
   "type": "image_url",
   "image_url": {"url": "gs://arxiv-dataset/arxiv/arxiv/pdf/2312/2312.10256v1.pdf"},
}
pdf_message_2 = {
   "type": "image_url",
   "image_url": {"url": "gs://arxiv-dataset/arxiv/arxiv/pdf/2410/2410.01706v1.pdf"},
}

text_message = {
   "type": "text",
   "text": "What are the key differences between MARL and Sable? Respond in 50 words or fewer.",
}

# We simply add the pdf message to the model
message = HumanMessage(content=[pdf_message_1, pdf_message_2, text_message])

output = llm.invoke([message])
Markdown(output.content)

MARL is a broad field; Sable is a specific algorithm. MARL addresses multi-agent learning challenges, while Sable is a novel approach using retentive networks for efficient, scalable, and performant multi-agent control. Sable is a specific solution within the broader MARL domain.
