### Summarization & Generating Sample QA
A common use case is wanting to summarize long documents. This naturally runs into the context window limitations. Unlike in question-answering, you can't just do some semantic search hacks to only select the chunks of text most relevant to the question (because, in this case, there is no particular question - you want to summarize everything). So what do you do then?

The most common way around this is to split the documents into chunks and then do summarization in a recursive manner. By this we mean you first summarize each chunk by itself, then you group the summaries into chunks and summarize each chunk of summaries, and continue doing that until only one is left.

#### Set Environment variables

In [1]:
import os  
import json  
import openai
from Utilities.envVars import *

# Set Search Service endpoint, index name, and API key from environment variables
indexName = SearchIndex

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
assert openAiEndPoint, "ERROR: Azure OpenAI Endpoint is missing"
assert "openai.azure.com" in openAiEndPoint.lower(), "ERROR: Azure OpenAI Endpoint should be in the form: \n\n\t<your unique endpoint identifier>.openai.azure.com"
openai.api_base = openAiEndPoint

#### Summarize the document/PDF (instead of getting that data from Vector store or document reposit)

In [2]:
# Import required libraries
# Import required libraries
from langchain.llms.openai import AzureOpenAI, OpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
    PDFMinerLoader,
    UnstructuredFileLoader,
)

In [3]:
# Flexibility to change the call to OpenAI or Azure OpenAI
embeddingModelType = "azureopenai"

# Set the file name and the namespace for the index
fileName = "Fabric Get Started.pdf"
fabricGetStartedPath = "Data/PDF/" + fileName
# Load the PDF with Document Loader available from Langchain
loader = PDFMinerLoader(fabricGetStartedPath)
rawDocs = loader.load()
# Set the source 
for doc in rawDocs:
    doc.metadata['source'] = fabricGetStartedPath

textSplitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=0)
docs = textSplitter.split_documents(rawDocs)

In [4]:
print("Number of documents chunks generated from PDF : ", len(docs))

Number of documents chunks generated from PDF :  58


In [5]:
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.docstore.document import Document
from langchain.prompts import PromptTemplate
from IPython.display import display, HTML
from langchain.chains.summarize import load_summarize_chain

embeddingModelType = "azureopenai"
temperature = 0.3
tokenLength = 1000

if (embeddingModelType == 'azureopenai'):
        openai.api_type = "azure"
        openai.api_key = OpenAiKey
        openai.api_version = OpenAiVersion
        openai.api_base = f"{OpenAiEndPoint}"

        llm = AzureChatOpenAI(
                openai_api_base=openai.api_base,
                openai_api_version=OpenAiVersion,
                deployment_name=OpenAiChat,
                temperature=temperature,
                openai_api_key=OpenAiKey,
                openai_api_type="azure",
                max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(deployment=OpenAiEmbedding, chunk_size=1, openai_api_key=OpenAiKey)
        logging.info("LLM Setup done")
elif embeddingModelType == "openai":
        openai.api_type = "open_ai"
        openai.api_base = "https://api.openai.com/v1"
        openai.api_version = '2020-11-07' 
        openai.api_key = OpenAiApiKey
        llm = ChatOpenAI(temperature=temperature,
        openai_api_key=OpenAiApiKey,
        model_name="gpt-3.5-turbo",
        max_tokens=tokenLength)
        embeddings = OpenAIEmbeddings(openai_api_key=OpenAiApiKey)


In [6]:
# Let's use the chaintype of mapreduce to summarize the document
chainType = "map_reduce"
summaryChain = load_summarize_chain(llm, chain_type=chainType, return_intermediate_steps=True)
summary = summaryChain({"input_documents": docs}, return_only_outputs=True)
outputAnswer = summary['output_text']
print(outputAnswer)

Microsoft Fabric is a unified platform that simplifies data analytics by integrating various components into a single environment. It offers a range of analytics experiences, easy access to assets, centralized administration, and seamless integration of data and services. Users can ingest, process, analyze, and collaborate on data, with tutorials and resources provided. The document specifically focuses on the features and functionalities of Microsoft Fabric's Data Factory, including organizing actions, accessing items, managing workspace settings, and promoting and certifying items.


In [7]:
# For the chaintype of MapReduce and Refine, we can also get insight into intermediate steps of the pipeline.
# This way you can inspect the results from map_reduce chain type, each top similar chunk summary
intermediateSteps = summary['intermediate_steps']
for step in intermediateSteps:
        display(HTML("<b>Chunk Summary:</b> " + step))

In [8]:
# Now that we have the summary, let's create the prompt to generates sample questions from the document
from langchain.chains import LLMChain

qaTemplate = """Use the following portion of a long document.
        {context}
        """

qaPrompt = PromptTemplate(template=qaTemplate, input_variables=["context"])

combinePromptTemplate = """ 
        Given the following extracted parts of a long document and a question, recommend between 1-5 sample questions.
        Don't repeat the same question or rewrite the question.
        =========
        {summaries}
        =========
        """
combinePrompt = PromptTemplate(
    template=combinePromptTemplate, input_variables=["summaries"]
)
qaChain = load_qa_with_sources_chain(llm,
        chain_type="map_reduce", question_prompt=qaPrompt, combine_prompt=combinePrompt)
answer = qaChain({"input_documents": docs}, return_only_outputs=True)
qa = answer['output_text']
print(qa)




Sample questions:
1. What are the advantages of using Microsoft Fabric for data analytics?
2. How does Microsoft Fabric integrate various components from Power BI, Azure Synapse, and Azure Data Explorer?
3. What are the key features of Microsoft Fabric's Data Visualization experience?
4. How does Microsoft Fabric support data governance and effective data asset management?
5. What are the benefits of using the lakehouse architecture in Microsoft Fabric for data processing and analytics?
6. What are the different components in Azure Data Factory?
7. What is the purpose of a pipeline in Data Factory?
8. How does a trigger work in Data Factory?
9. What is the role of an integration runtime in Data Factory?
10. What is a linked service in Data Factory and how is it used?
11. How do you open the nav pane in Microsoft Fabric?
12. How can you search for a specific workspace in Microsoft Fabric?
13. How do you customize the Home canvas in Microsoft Fabric?
14. What is the purpose of the Help p