### Summarization
A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then?

The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

#### Set Environment variables

In [1]:
import os  
import json  
import openai
from Utilities.envVars import *

# Set Search Service endpoint, index name, and API key from environment variables
indexName = SearchIndex

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
assert openAiEndPoint, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in openAiEndPoint.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = openAiEndPoint

#### Summarize the document/PDF (instead of getting that data from Vector store or document reposit)

In [2]:
# Import required libraries
# Import required libraries
from langchain.llms.openai import AzureOpenAI, OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PDFMinerLoader,
    UnstructuredFileLoader,
)

In [3]:
# Set the file name and the namespace for the index
fileName = "Fabric Get Started.pdf"
fabricGetStartedPath = "Data/PDF/" + fileName
# Load the PDF with Document Loader available from Langchain
loader = PDFMinerLoader(fabricGetStartedPath)
rawDocs = loader.load()
# Set the source 
for doc in rawDocs:
    doc.metadata['source'] = fabricGetStartedPath

textSplitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = textSplitter.split_documents(rawDocs)

In [4]:
print("Number of documents chunks generated from PDF : ", len(docs))

Number of documents chunks generated from PDF :  58


In [5]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from IPython.display import display, HTML
from langchain.chains.summarize import load_summarize_chain

# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"
temperature = 0.3
tokenLength = 1000

if (embeddingModelType == 'azureopenai'):
        openai.api_type = "azure"
        openai.api_key = OpenAiKey
        openai.api_version = OpenAiVersion
        openai.api_base = f"{OpenAiEndPoint}"

        llm = AzureChatOpenAI(
                openai_api_base=openai.api_base,
                openai_api_version=OpenAiVersion,
                deployment_name=OpenAiChat,
                temperature=temperature,
                openai_api_key=OpenAiKey,
                openai_api_type="azure",
                max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(deployment=OpenAiEmbedding, chunk_size=1, openai_api_key=OpenAiKey)
        logging.info("LLM Setup done")
elif embeddingModelType == "openai":
        openai.api_type = "open_ai"
        openai.api_base = "https://api.openai.com/v1"
        openai.api_version = '2020-11-07' 
        openai.api_key = OpenAiApiKey
        llm = ChatOpenAI(temperature=temperature,
        openai_api_key=OpenAiApiKey,
        model_name="gpt-3.5-turbo",
        max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)


In [6]:
# Because we have a large document, it is expected that we will run into Token limitation error with code below
chainType = "stuff"
summaryChain = load_summarize_chain(llm, chain_type=chainType)
summary = summaryChain.run(docs)
print("Summary: ", summary)


InvalidRequestError: This model's maximum context length is 4096 tokens. However, your messages resulted in 17332 tokens. Please reduce the length of the messages.

In [9]:
# Let's change now the chaintype from stuff to mapreduce and refine to see the summary
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True)
summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)

Microsoft Fabric is a unified platform that offers data and analytics solutions for organizations. It simplifies analytics needs by providing an integrated and easy-to-use product. It combines components from Power BI, Azure Synapse, and Azure Data Explorer into a unified environment. Fabric allows users to focus on their work without needing to understand or manage the underlying infrastructure. It also offers a Data Warehouse experience and real-time analytics. Power BI is highlighted as the leading business intelligence platform in Fabric. OneLake is a unified storage system that simplifies data management and sharing. The Fabric (Preview) trial allows users to store workspaces and items and run Fabric experiences. It provides access to all Fabric experiences and features, as well as 1 TB of OneLake storage. The trial capacity can be shared with others or additional capacity can be purchased. The Help pane and Fabric settings pane provide assistance and personalization options. Work

In [10]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))

In [11]:
from langchain.prompts import PromptTemplate
# While we are using the standard prompt by langchain, you can modify the prompt to suit your needs
promptTemplate = """You are an AI assistant tasked with summarizing documents. 
        Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
        Please generate a concise and comprehensive summary that includes details. 
        Ensure that the summary is easy to understand and provides an accurate representation. 
        Begin the summary with a brief introduction, followed by the main points. 
        Please remember to use clear language and maintain the integrity of the original information without missing any important details:
        {text}

        """
customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
                                    map_prompt=customPrompt, combine_prompt=customPrompt)
summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)



Microsoft Fabric is a unified platform that offers data and analytics solutions for organizations. It combines components from Power BI, Azure Synapse, and Azure Data Explorer into a single integrated environment. The platform provides customized user experiences for data engineering, data factory, data science, data visualization, and data governance. It offers extensive analytics capabilities, shared experiences, easy access to assets, a unified data lake, and centralized administration and governance. Microsoft Fabric also combines the benefits of data lakes and data warehouses through the lakehouse architecture. It allows users to provision and configure their own storage accounts and provides a trial capacity for exploring the platform.

Microsoft Fabric Home is a centralized hub that allows users to access and manage their items. It provides a search bar, filters, and a user-friendly interface for organizing and working with data. Users can create, edit, delete, and share items d

In [12]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))