### Summarization
A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then?

The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

#### Set Environment variables

In [1]:
import os  
import json  
import openai
from Utilities.envVars import *

# Set Search Service endpoint, index name, and API key from environment variables
indexName = SearchIndex

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
assert openAiEndPoint, "ERROR: Azure OpenAI Endpoint is missing"
openai.api_base = openAiEndPoint

#### Summarize the document/PDF (instead of getting that data from Vector store or document reposit)

In [2]:
# Import required libraries
# Import required libraries
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PDFMinerLoader,
    UnstructuredFileLoader,
)

In [3]:
# Set the file name and the namespace for the index
fileName = "Fabric Get Started.pdf"
fabricGetStartedPath = "Data/PDF/" + fileName
# Load the PDF with Document Loader available from Langchain
loader = PDFMinerLoader(fabricGetStartedPath)
rawDocs = loader.load()
# Set the source 
for doc in rawDocs:
    doc.metadata['source'] = fabricGetStartedPath

textSplitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = textSplitter.split_documents(rawDocs)

In [4]:
print("Number of documents chunks generated from PDF : ", len(docs))

Number of documents chunks generated from PDF :  58


In [5]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from IPython.display import display, HTML
from langchain.chains.summarize import load_summarize_chain

# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"
temperature = 0.3
tokenLength = 1000

if (embeddingModelType == 'azureopenai'):
        openai.api_type = "azure"
        openai.api_key = OpenAiKey
        openai.api_version = OpenAiVersion
        openai.api_base = f"{OpenAiEndPoint}"

        llm = AzureChatOpenAI(
                openai_api_base=openai.api_base,
                openai_api_version=OpenAiVersion,
                deployment_name=OpenAiChat,
                temperature=temperature,
                openai_api_key=OpenAiKey,
                openai_api_type="azure",
                max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(deployment=OpenAiEmbedding, openai_api_key=OpenAiKey, openai_api_type="azure")
        logging.info("LLM Setup done")
elif embeddingModelType == "openai":
        openai.api_type = "open_ai"
        openai.api_base = "https://api.openai.com/v1"
        openai.api_version = '2020-11-07' 
        openai.api_key = OpenAiApiKey
        llm = ChatOpenAI(temperature=temperature,
        openai_api_key=OpenAiApiKey,
        model_name="gpt-3.5-turbo",
        max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)


In [8]:
# Because we have a large document, it is expected that we will run into Token limitation error with code below
chainType = "stuff"
summaryChain = load_summarize_chain(llm, chain_type=chainType)
summary = summaryChain.run(docs)
print("Summary: ", summary)

In [7]:
# Let's change now the chaintype from stuff to mapreduce and refine to see the summary
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True)
summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)

Microsoft Fabric is a unified platform that offers data and analytics solutions for organizations. It simplifies analytics by providing an integrated and easy-to-use product built on a foundation of Software as a Service (SaaS). Fabric combines the OneLake and lakehouse architectures to provide a unified location for storing and analyzing data, eliminating data silos. The Fabric (Preview) trial allows users to access the Fabric product experiences and includes a Power BI individual trial and a Fabric (Preview) trial capacity. The article also discusses the features and functionalities of the nav pane, Help pane, and canvas in Microsoft Fabric's Data Factory, as well as the capabilities and settings of workspaces. It provides instructions on accessing and configuring settings, managing access and permissions, and creating workspaces. The document outlines the capabilities and options in the OneLake data hub, including filtering, recommended items, and item details, as well as instructio

In [10]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))

In [11]:
from langchain.prompts import PromptTemplate
# While we are using the standard prompt by langchain, you can modify the prompt to suit your needs
promptTemplate = """You are an AI assistant tasked with summarizing documents. 
        Your summary should accurately capture the key information in the document while avoiding the omission of any domain-specific words. 
        Please generate a concise and comprehensive summary that includes details. 
        Ensure that the summary is easy to understand and provides an accurate representation. 
        Begin the summary with a brief introduction, followed by the main points. 
        Please remember to use clear language and maintain the integrity of the original information without missing any important details:
        {text}

        """
customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text"])
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True, 
                                    map_prompt=customPrompt, combine_prompt=customPrompt)
summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)



Microsoft Fabric is a unified platform that offers data and analytics solutions for organizations. It combines components from Power BI, Azure Synapse, and Azure Data Explorer into a single integrated environment. The platform provides customized user experiences for data engineering, data factory, data science, data visualization, and data governance. It offers extensive analytics capabilities, shared experiences, easy access to assets, a unified data lake, and centralized administration and governance. Microsoft Fabric also combines the benefits of data lakes and data warehouses through the lakehouse architecture. It allows users to provision and configure their own storage accounts and provides a trial capacity for exploring the platform.

Microsoft Fabric Home is a centralized hub that allows users to access and manage their items. It provides a search bar, filters, and a user-friendly interface for organizing and working with data. Users can create, edit, delete, and share items d

In [12]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))

### Summarzie the document using Chain of Density Prompt on GPT4

In [6]:
def createSystemMessage():
    wc = 80
    sc = 5
    if sc > 1:
        sc = f"{sc-1} - {sc}"

    message = [{
        "role": "system",
        "content": f"""I will provide you with piece of content (e.g. articles, papers, documentation, etc.)

        You will generate increasingly concise, entity-dense summaries of the content.

        Repeat the following 2 steps 5 times.

        Step 1. Identify 1-3 informative Entities (";" delimited) from the Article which are missing from the previously generated summary.

        Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

        A Missing Entity is:

        Relevant: to the main story.
        Specific: descriptive yet concise (5 words or fewer).
        Novel: not in the previous summary.
        Faithful: present in the content piece.
        Anywhere: located anywhere in the Article.

        Guidelines:

        The first summary should be long ({sc} sentences, -{wc} words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach -{wc} words.
        Make every word count: re-write the previous summary to improve flow and make space for additional entities.
        Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
        The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
        Missing entities can appear anywhere in the new summary.
        Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
        Remember, use the exact same number of words for each summary.
        Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary"."""}]
    return message

In [10]:
msg = createSystemMessage()
msg.append(
        {
            "role": "user",
            "content": f"Here is the input text for you to summarize using the 'Missing_Entities' and 'Denser_Summary' approach:\n\n{docs[0].page_content}",
        }
)

In [11]:
import time

parameters = {
    "deployment_id": OpenAiGpt4,
    "messages": msg,
    "temperature": 1,
    "top_p": 1,
    "n": 1,
    "stream": False,
    "stop": None,
    "max_tokens": None,
    "presence_penalty": 0,
    "frequency_penalty": 0,
}

max_attempts = 5  # Maximum number of retry attempts
retry_gap = 3.0  # Initial gap between retries in seconds

for attempt in range(max_attempts):
    try:
        completion = openai.ChatCompletion.create(**parameters)
    except Exception as e:
        print(f"Request failed on attempt {attempt + 1}. Error: {str(e)}")
        if attempt < max_attempts - 1:
            retry_gap *= 1.5  # Increase the retry gap exponentially
            time.sleep(retry_gap)

Request failed on attempt 2. Error: The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.
Request failed on attempt 3. Error: The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.
Request failed on attempt 5. Error: The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.


In [12]:
content = completion["choices"][0]["message"]["content"]
print(content)

[
{
"Missing_Entities": "Microsoft Fabric; data and analytics needs",
"Denser_Summary": "The piece discusses Microsoft Fabric, a unified platform designed to meet an organization's data and analytics needs. The platform provides comprehensive and integrated services, including data lake, data engineering, and data integration, in a seamless package. Additionally, it lays out the benefits of using Fabric, offering a simple and all-in-one analytics solution for enterprises that eliminates the need for multiple service vendors."
},
{
"Missing_Entities": "Software as a Service (SaaS); Real-Time Analytics",
"Denser_Summary": "Microsoft Fabric is a Software as a Service (SaaS) based unified platform, effectively covering an organization's data and analytics needs. It eliminates the need to piece together disparate services, offering a seamless, easy-to-use product. The service range includes data lake, data engineering, and data integration, with the addition of Real-Time Analytics, contribu

#### CoD and GPT4 using langchain

In [23]:
from langchain.prompts import PromptTemplate
llm4 = AzureChatOpenAI(
        openai_api_base=openai.api_base,
        openai_api_version=OpenAiVersion,
        deployment_name=OpenAiGpt4,
        temperature=temperature,
        openai_api_key=OpenAiKey,
        openai_api_type="azure",
        max_tokens=tokenLength)

wc = 380
sc = 5
if sc > 1:
        sc = f"{sc-1} - {sc}"

# While we are using the standard prompt by langchain, you can modify the prompt to suit your needs
promptTemplate = """Article:
  
        {text}

        You will generate increasingly concise, entity-dense summaries of the content.

        Repeat the following 2 steps 2 times.

        Step 1. Identify 1-3 informative Entities (";" delimited) from the text which are missing from the previously generated summary.

        Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

        A Missing Entity is:

        Relevant: to the main story.
        Specific: descriptive yet concise (5 words or fewer).
        Novel: not in the previous summary.
        Faithful: present in the content piece.
        Anywhere: located anywhere in the Article.

        Guidelines:

        The first summary should be long ({sc} sentences, -{wc} words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach -{wc} words.
        Make every word count: re-write the previous summary to improve flow and make space for additional entities.
        Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
        The summaries should become highly dense and concise yet self-contained, e.g., easily understood without the Article.
        Missing entities can appear anywhere in the new summary.
        Never drop entities from the previous summary. If space cannot be made, add fewer new entities.
        Remember, use the exact same number of words for each summary.
        Answer in JSON. The JSON should be a list (length 2) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary"."""
customPrompt = PromptTemplate(template=promptTemplate, input_variables=["text", "sc", "wc"])
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm4, chain_type=chainType, return_intermediate_steps=True, 
                                    map_prompt=customPrompt, combine_prompt=customPrompt)
summary = summaryChain({"input_documents": docs, "sc": sc, "wc": wc}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)

[
{
"Missing_Entities": "Promotion; Authorized reviewers; Quality standards",
"Denser_Summary": "Microsoft Fabric Home's endorsement options include promotion and certification. Promotion, highlighting valuable items for collaborative use, is accessible to item owners or those with write permissions. Certification, ensuring items meet quality standards, is limited to authorized reviewers appointed by the Power BI administrator. Owners must comply with organization guidelines to certify their Fabric items. Power BI dashboards are exempt from these endorsement options."
},
{
"Missing_Entities": "Endorsement section; 'Make discoverable' checkbox; Dataset access",
"Denser_Summary": "The article details promotion and certification of items within a workspace. Users can highlight content by selecting 'Promoted' in the endorsement section of the content settings. If the content is a Power BI dataset, a 'Make discoverable' checkbox may appear, enabling users without access to locate the datase