## Summarize large documents using LangChain and Gemini

## Setup
First, you must install the packages and set the necessary environemnt variables.


### Installation
Install LangChin's Python library, langchin-community

In [4]:
!pip install --quiet langchain-community

install LangChain's integration package for Gemini, langchin-google-genai.

In [8]:
!pip install --quiet langchain-google-genai

### Grab an Cerdentials
For example You will set the environment variable GOOGLE_APPLICATION_CREDENTIALS to configure vertex AI



In [9]:
import vertexai
import os
import IPython
from vertexai import generative_models
from vertexai.generative_models import GenerativeModel, ChatSession
# load google access config file
credential_path="/Users/gongbiao/Code/vertex-ai/config/google_access_token_cp.json"
if os.path.exists(credential_path):
    print(f"the config load success")
else:
    print("config file does'not exists!")
    
# init vertex ai
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path
project_id = "gen-lang-client-0115788367"
location = "us-central1"
vertexai.init(project=project_id, location=location)


the config load success


### Setup proxy (optional)

In [26]:
# [Optinal] set proxy
proxy = "http://127.0.0.1:8889"
os.environ["HTTP_PROXY"] = proxy
os.environ["HTTPS_PROXY"] = proxy
os.environ["http_proxy"] = proxy
os.environ["https_proxy"] = proxy

### Summarize text
In this tutorial, you are going to summarize the text from a website using the Gemin mode integrated through LangChian.
You'll perform the following steps to archieve the same:
1.  Read and parse the website data using LangChain.
2. Chain together the following:
- A prompt for extracting the required input data from the parsed website data
- A prompt from summarizing the text using LangChain.
- An LLM model (Gemini) forprompting.
3. Run the cretead chain to prompt the model for the summary of the website data.
### Import the required libraries

In [15]:
from langchain import PromptTemplate
from langchain.document_loaders import WebBaseLoader
from langchain.schema import StrOutputParser
from langchain.schema.prompt_template import format_document

### read and parse the website data
LangChain provides a wide variety of document loaders. To read the website data as a document, you will use the WebBaseLoader from LangChain.

In [39]:
loader = WebBaseLoader("https://coolshell.cn/articles/20793.html")
docs = loader.load()
#print(docs)

### Initialize Gemini LLM

In [28]:
from langchain_google_vertexai import VertexAI
llm = VertexAI(model_name="gemini-1.5-pro-001", temperature=0.7, top_p=0.85)
llm.invoke("How are you today?")

"As a large language model, I don't experience emotions like humans do.  However, I'm ready to assist you with any questions or tasks you may have! 😊  What can I help you with today? \n"

In [36]:
# To extract data from WebBaseLoader
doc_prompt = PromptTemplate.from_template("{page_content}")

# To query Gemini
llm_prompt_template = """Use chinese Write a concise summary of the following:
"{text}"
CONCISE SUMMARY:"""
llm_prompt = PromptTemplate.from_template(llm_prompt_template)
print(llm_prompt)

input_variables=['text'] template='Use chinese Write a concise summary of the following:\n"{text}"\nCONCISE SUMMARY:'


### Create a Stuff documents chain
LangChain provides Chains for chaining together LLMs with each other or other components for complex applications. 
You will create a <b>Stuff documents chain</b> for this application. A <b>Stuff documents chain</b> lets you combine all the documents, insert them into the prompt and pass that prompt to the LLM.

In [37]:
# The chain implements the folowing pipeline:
# 1. Extract data from documents and save to variable `text`.
# 2. This `text` is then passed to the prompt and input variable in prompt is populated.
# 3. The prompt is then passed to the LLM (Gemini).
# 4. Output from the LLM is passed through an output parser to structure the model response.

stuff_chain = (
    # Extract data from the documents and add to the key `text`.
    {
        "text": lambda docs: "\n\n".join(
            format_document(doc, doc_prompt) for doc in docs
        )
    }
    | llm_prompt         # Prompt for Gemini
    | llm                # Gemini function
    | StrOutputParser()  # output parser
)

### Prompt the model
To generate the summary of the website data, pass the documents extracted using the `WebBaseLoader` (`docs`) to `invoke()`


In [38]:
stuff_chain.invoke(docs)

'这篇文章讲解了与程序员相关的CPU缓存知识。文章首先介绍了CPU缓存的基本概念，包括缓存的层级结构、大小、速度以及缓存行的概念。接着，文章解释了缓存命中的重要性，并通过代码示例说明了缓存命中率对程序性能的影响。\n\n随后，文章深入探讨了多核CPU下的缓存一致性问题，介绍了两种常见的缓存一致性协议：Directory协议和Snoopy协议，并详细解释了MESI和MOESI协议的工作原理。\n\n最后，文章通过五个代码示例，展示了缓存行、缓存命中率、缓存一致性以及'