## DEMO:
### OpenAI and Huggingface LLMs using LlamaIndex and Langchain framework. 
- Q&A on user documents. (Supported file types: XML, Txt, PDF, Docx, HTML and much more)

In [0]:
! pip install python-dotenv -q
! pip install -q langchain transformers sentence_transformers 
! pip install -q openai llama-index
! pip install -q PyPDF2 docx2txt 
! pip install faiss-cpu -q

You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-d1f47002-4893-4f38-ad7d-ebf8dfea1b90/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-d1f47002-4893-4f38-ad7d-ebf8dfea1b90/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-d1f47002-4893-4f38-ad7d-ebf8dfea1b90/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-d1f47002-4893-4f38-ad7d-ebf8dfea1b90/bin/python -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/local_disk0/.ephemeral_nfs/envs/pythonEnv-d1f47002-4893-4f38-ad7d-ebf8dfea1b90/bin/python -m pip install --upgrade pip' command.[0m


#### Import dependencies

In [0]:
import os
from dotenv import load_dotenv
import glob
import PyPDF2
import docx2txt
import re
import openai
import faiss
from llama_index import(
            GPTVectorStoreIndex, 
            Document, 
            SimpleDirectoryReader, 
            PromptHelper, 
            LLMPredictor, 
            ServiceContext,
            LangchainEmbedding
)
from llama_index.vector_stores.faiss import FaissVectorStore
from langchain import HuggingFaceHub
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from transformers import pipeline
import torch
from langchain.document_loaders import PyPDFLoader
from langchain import OpenAI
from langchain.llms.base import LLM
#from langchain.llms import AzureOpenAI
#load_dotenv()

#### Store OpenAI Key in an env variable

In [0]:
os.environ['OPENAI_API_KEY'] = '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'  

#### OpenAI LLM Integration with LlamaIndex

In [0]:
def create_index(path):
    # prompt helper properties
    max_input = 4096
    tokens = 200
    chunk_size = 600 #for LLM, we need to define chunk size
    max_chunk_overlap = 20
    #SimpleVectorIndex properties
    chunk_size_limit= 2000
    d=1536
    faiss_index = faiss.IndexFlatL2(d)

    llm = OpenAI(temperature=0.2,model_name="text-davinci-003", max_tokens=tokens) 
    #define prompt for OpenAI models.
    promptHelper = PromptHelper(max_input,
                            tokens,
                            max_chunk_overlap,
                            chunk_size)
    #define LLM — there could be many models we can use, but in this example: "text-davinci-003"
    llmPredictor = LLMPredictor(llm=llm)

    service_context = ServiceContext.from_defaults(llm_predictor=llmPredictor, 
                                                prompt_helper=promptHelper)
    #load data — it will take all the .txtx files, if there are more than 1
    docs = SimpleDirectoryReader(path).load_data() 
    #create vector index with FAISS
    vectorIndex = GPTFaissIndex.from_documents(docs,
                                            faiss_index=faiss_index,
                                            service_context=service_context)   

    vectorIndex.save_to_disk('vectorIndex.json')

    return vectorIndex

In [0]:
data_path= '/Workspace/Repos/virajsunil.oke@ssc-spc.gc.ca/LangChain-LlamaIndex-LLM/LangChain-LlamaIndex-LLM_all_files/data/Green_procurement/'
index = create_index(data_path)

#### Sample queries for green procurement.
- what green initiatives do you have? 
- which organizations are involved?
- do we have any deadlines?
- do we have any important locations?

In [0]:
query_list = ["can you list down 10 green initiatives?", 
            "which organizations and people are involved?",
            "do we have any deadlines?",
            "do we have any important geographical locations?"]
answers= []
for query in query_list:
    repsonse = index.query(query)
    
    print(repsonse.response)
    #answers.append(response)



1. Investing in new-build renewable power generation and related technologies to reclassify $6.4 billion of assets.
2. Business transformation of companies in industries such as steel, cement and chemicals to reduce their environmental impact and meet net-zero commitments.
3. Providing capital to electricity generators to enable them to shift from coal to gas and from gas to renewables.
4. Launching the Brookfield Global Transition Fund to support green initiatives and promote transparency in green investments.
5. Issuing green bonds and hybrid securities to finance green projects and sustainability-linked debt and loans.
6. Publishing the Brookfield Asset Management Green Bond and Preferred Securities Framework to guide green investments.
7. Becoming supporters of the TCFD to promote transparency in green investments.
8. Embedding climate change considerations into our strategy through our net-zero commitment.
9. Undertaking a climate risk assessment and publishing TCFD disclosures 

In [0]:
repsonse = index.query('summarize all the documents separately', 
                    response_mode="tree_summarize")
print(repsonse.response)


Capital Commitments to Private Funds: Document outlining the commitments of capital to private funds.

Acquisition and Disposition of Assets and Businesses: Document outlining the acquisition and sale of assets and businesses.

Derivative Contracts: Document outlining the terms of derivative contracts.

Construction and Development of Assets: Document outlining the construction and development of assets.

Note 27 Related Party Transactions: Document outlining the related party transactions as detailed in Note 27 of the consolidated financial statements.


#### LLM implementation on XML files

In [0]:
data_path= '/Workspace/Repos/virajsunil.oke@ssc-spc.gc.ca/LangChain-LlamaIndex-LLM/LangChain-LlamaIndex-LLM_all_files/data/xml_sample_documents/'
index = create_index(data_path)

[Document(text='This XML file does not appear to have any style information associated with it. The document tree is shown below.\n<xbrl xmlns="http://www.xbrl.org/2003/instance" xmlns:dei="http://xbrl.sec.gov/dei/2021q4" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:msft="http://www.microsoft.com/20221213" xmlns:us-gaap="http://fasb.org/us-gaap/2021-01-31" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xlink="http://www.w3.org/1999/xlink">\n<link:schemaRef xlink:href="msft-20221213.xsd" xlink:type="simple"/>\n<context id="duration_2022-12-13_to_2022-12-13">\n<entity>\n<identifier scheme="http://www.sec.gov/CIK">0000789019</identifier>\n</entity>\n<period>\n<startDate>2022-12-13</startDate>\n<endDate>2022-12-13</endDate>\n</period>\n</context>\n<context id="duration_2022-12-13_to_2022-12-13_us-gaap-StatementClassOfStockAxis_us-gaap-CommonStockMember">\n<entity>\n<identifier scheme="http://www.sec.gov/CIK">0000789019</identifier>\n<segment>\n<xbrldi:explicitMember dimension="us

In [0]:
query_list = ["list all the dates"]
answers= []
for query in query_list:
    repsonse = index.query(query)
    print(repsonse.response)



The dates: January, February, March, April, May, June, July, August, September, October, November, December, March 4, 2022, November 1 in every fifth year after November 1, 2001, November 1 in every fifth year after November 1, 2006, June 30 in every fifth year after June 30, 2016, March 31 in every fifth year after March 31, 2017.


#### Summarization task on XML files using LLM

In [0]:
repsonse = index.query('summarize all the documents separately', 
                    response_mode="tree_summarize")

In [0]:
print(repsonse.response)



The XML file contains five contexts. The first context has an entity with an identifier scheme of "http://www.sec.gov/CIK" and a period with a start date of 2021-10-26 and an end date of 2021-10-26. It also includes information such as the entity's incorporation state, file number, tax identification number, address, city area code, local phone number, and whether or not the entity is involved in written communications, soliciting material, or pre-commencement tender offers. The second context has an entity with the same identifier scheme and a segment with an explicitMember dimension of "us-gaap:StatementClassOfStockAxis" and a value of "us-gaap:CommonStockMember". It also has a period with a start date of 2021-10-26 and an end date of 2021-10-26, and includes information such as the common stock's par value per share, trading symbol,


#### Open source Huggingface LLM integration with LlamaIndex (Custom class)
- `class customLLM()` loads the open-source Huggingface model.
- `pipeline()` defines the NLP task
- `huggingface_LLM()` generates an index for the given documents.

In [0]:
class customLLM(LLM):
    model_name = "google/flan-t5-large" 
    pipeline = pipeline("text2text-generation", 
                    model=model_name, 
                    model_kwargs={"torch_dtype":torch.bfloat16}, 
    )
    def _call(self, prompt, stop=None):
        return self.pipeline(prompt, max_length=9999)[0]["generated_text"]
 
    def _identifying_params(self):
        return {"name_of_model": self.model_name}

    def _llm_type(self):
        return "custom"

In [0]:
def huggingface_LLM(path):
    # set number of output tokens
    num_output = 500
    # set maximum input size
    max_input_size = 5000
    # set maximum chunk overlap
    #max_chunk_overlap = 15
    llm_predictor = LLMPredictor(llm=customLLM())   
    embedding = HuggingFaceEmbeddings()
    embed_model = LangchainEmbedding(embedding)
    #prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

    docs = SimpleDirectoryReader(path).load_data() 
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor,
                                            embed_model=embed_model
    )
    vectorIndex = GPTSimpleVectorIndex.from_documents(docs, 
                                service_context=service_context
    ) 
    vectorIndex.save_to_disk('vectorIndex.json')
    return vectorIndex

#### Build Index with Huggingface open-source LLM and inspect the output

In [0]:
path= '/Workspace/Repos/virajsunil.oke@ssc-spc.gc.ca/LangChain-LlamaIndex-LLM/LangChain-LlamaIndex-LLM_all_files/data/xml_sample_documents/'
index= huggingface_LLM(path)

In [0]:
response = index.query("can you summarize the documents?") 
print(response.response)

Token indices sequence length is longer than the specified maximum sequence length for this model (3115 > 512). Running this sequence through the model will result in indexing errors


In [0]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = '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'

## Huggingface open source models without LlamaIndex integration
- distilbert-base-uncased-distilled-squad

In [0]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from llama_index import SimpleDirectoryReader
from timeit import default_timer as timer

In [0]:
class CustomLLM():
    def __init__(self, path, model_name, pipeline_task):
        self.path = path
        self.model_name = model_name
        self.pipeline_task = pipeline_task
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        model = AutoModelForQuestionAnswering.from_pretrained(self.model_name)
        self.generator = pipeline(self.pipeline_task,
                            model=model,
                            tokenizer=tokenizer)     

    def ask_qa_bot(self, question):
        docs = SimpleDirectoryReader(self.path).load_data() 
        start = timer()
        # Store data in context variable 
        # NOTE: Iterate through the list of documents
        context = docs[0].get_text()
        result = self.generator(question=question, context=context)
        end = timer()
        print(f'Query response generated in {end-start:.2f} seconds')
        return result

In [0]:
path = '/Workspace/Repos/virajsunil.oke@ssc-spc.gc.ca/LangChain-LlamaIndex-LLM/LangChain-LlamaIndex-LLM_all_files/data/xml_sample_documents/'
# Dictonaries with open source models and respective pipeline tasks to try out.
open_source_models = {
        'distilbert':"distilbert-base-uncased-distilled-squad",
}
pipeline_tasks={
        'distilbert':"question-answering",
}

Custom_llm = CustomLLM(path, open_source_models['distilbert'], 
                    pipeline_tasks['distilbert'])
Custom_llm.ask_qa_bot('organization')

Query response generated in 2.42 seconds
Out[28]: {'score': 0.5239278674125671,
 'start': 2823,
 'end': 2844,
 'answer': 'Microsoft Corporation'}