This notebook is based on Google Colab environment. And the demo is just for text summarization, you can change it to other scenario as needed.

**Download text files that will be used**

In [23]:
! wget https://github.com/fangtailin/langchain_MR_with_GCP_genai/blob/main/data/large_text_named_worked.txt
! wget https://github.com/fangtailin/langchain_MR_with_GCP_genai/blob/main/data/small_text_named_muir_lake_tahoe_in_winter.txt

--2023-07-17 01:31:04--  https://github.com/fangtailin/langchain_MR_with_GCP_genai/blob/main/data/large_text_named_worked.txt
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86912 (85K) [text/plain]
Saving to: ‘large_text_named_worked.txt’


2023-07-17 01:31:04 (1.67 MB/s) - ‘large_text_named_worked.txt’ saved [86912/86912]

--2023-07-17 01:31:04--  https://github.com/fangtailin/langchain_MR_with_GCP_genai/blob/main/data/small_text_named_muir_lake_tahoe_in_winter.txt
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18094 (18K) [text/plain]
Saving to: ‘small_text_named_muir_lake_tahoe_in_winter.txt’


2023-07-17 01:31:04 (736 KB/s) - ‘small_text_named_muir_lake_tahoe_in_winter.txt’ saved [18094/18094]



**Install relevant packages**

In [6]:
! pip install "unstructured[local-inference]"
! pip install layoutparser[layoutmodels,tesseract]
! pip install langchain
! pip install shapely==1.8.1
! pip install google-cloud-aiplatform --upgrade
! pip install pillow



**Import relevant libararies**

In [24]:
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain
import os
from langchain.llms import VertexAI
from langchain import PromptTemplate, LLMChain

**Set project ID and region information**

Input your correct PROJECT_ID and region in the follow cell.

In [25]:
PROJECT_ID = 'XXXX' # @param {type:"string"}
os.environ['PROJECT_ID'] = PROJECT_ID
LOCATION = 'xxxx-xxxx' # @param {type:"string"}
os.environ['LOCATION_ID'] = LOCATION

**Load Documents**

In [31]:
sm_loader = UnstructuredFileLoader("/content/small_text_named_muir_lake_tahoe_in_winter.txt")
sm_doc = sm_loader.load()

lg_loader = UnstructuredFileLoader("/content/large_text_named_worked.txt")
lg_doc = lg_loader.load()

In [33]:
def doc_summary(docs):
    print (f'You have {len(docs)} document(s)')

    num_words = sum([len(doc.page_content.split(' ')) for doc in docs])

    print (f'You have roughly {num_words} words in your docs')
    print ()
    print (f'Preview: \n{docs[0].page_content.split(". ")[0]}')

doc_summary(sm_doc)


You have 1 document(s)
You have roughly 2320 words in your docs

Preview: 
{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":2.9682720000000002,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2023-07-16T08:36:54.000Z","ownerAvatar":"https://avatars

In [34]:
doc_summary(lg_doc)

You have 1 document(s)
You have roughly 12556 words in your docs

Preview: 
{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":4.1075859999999995,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2023-07-16T08:36:54.000Z","ownerAvatar":"https://avatar

**Colab Authentication with GCP**

In [35]:
from google.colab import auth as google_auth
google_auth.authenticate_user()

**Load LLM**

In [36]:
llm = VertexAI(project=PROJECT_ID)

**Summarize: Stuff**

In [37]:
chain = load_summarize_chain(llm, chain_type="stuff", verbose=True)

In [38]:
chain.run(sm_doc)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":2.9682720000000002,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":f

'John Muir describes his trip to Lake Tahoe in winter. He writes about the beauty of the snow-covered mountains and lakes, and the fun he had snowshoeing and swimming in the icy lake. He also describes the wildlife he saw, including a bear and two pet coons.'

In [None]:
chain.run(lg_doc)

# There will be error and you can't get the summary



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":4.1075859999999995,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":f

InvalidArgument: ignored

**Summarize: Map Reduce**

In [40]:
chain = load_summarize_chain(llm, chain_type="map_reduce", verbose=True)

In [41]:
chain.run(sm_doc)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":2.9682720000000002,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPus

'John Muir describes his trip to Lake Tahoe in winter, writing about the beauty of the snow-covered mountains and lakes, the fun he had snowshoeing and swimming in the icy lake, and the wildlife he saw.'

In [51]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 8000, # Need to change based on your requirements
    chunk_overlap = 0
)

In [52]:
lg_docs = text_splitter.split_documents(lg_doc)

In [53]:
doc_summary(lg_docs)

You have 13 document(s)
You have roughly 12568 words in your docs

Preview: 
{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":4.1075859999999995,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2023-07-16T08:36:54.000Z","ownerAvatar":"https://avata

Here only 6 splits are used to show the effect of the map reduce, and it's also the same for "refine"

In [55]:
chain.run(lg_docs[:6])



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":4.1075859999999995,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":

''

**Summarize: Refine**

In [56]:
chain = load_summarize_chain(llm, chain_type="refine", verbose=True)

In [57]:
chain.run(lg_docs[:6])



[1m> Entering new RefineDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mWrite a concise summary of the following:


"{"payload":{"allShortcutsEnabled":false,"fileTree":{"data":{"items":[{"name":"large_text_named_worked.txt","path":"data/large_text_named_worked.txt","contentType":"file"},{"name":"small_text_named_muir_lake_tahoe_in_winter.txt","path":"data/small_text_named_muir_lake_tahoe_in_winter.txt","contentType":"file"},{"name":"test.txt","path":"data/test.txt","contentType":"file"}],"totalCount":3},"":{"items":[{"name":"data","path":"data","contentType":"directory"},{"name":"LICENSE","path":"LICENSE","contentType":"file"},{"name":"README.md","path":"README.md","contentType":"file"}],"totalCount":3}},"fileTreeProcessingTime":4.1075859999999995,"foldersToFetch":[],"reducedMotionEnabled":null,"repo":{"id":666981890,"defaultBranch":"main","name":"langchain_MR_with_GCP_genai","ownerLogin":"fangtailin","currentUserCanPush":

''