# Comparing Azure OpenAI with Llama2-70B

So far, you have your Search Engine loaded **from two different data sources in two diferent text-based indexes**, and have experimented querying with Azure OpenAI service to see if we can get even better results.

The idea is that a user can ask a question about Computer Science (first datasource/index) or about Covid (second datasource/index), and the engine will respond accordingly.

In this notebook, we will query Azure OpenAI and Llama2-70B model with the same question to compare and contrast the response. 

You can learn more about Llama2 family models here - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-llama?tabs=azure-studio

## Pre-requisites
You must have these resources deployed prior to running this notebook.

1. Llama2-70B deployed on Azure Machine Learning Endpoints.  You can deploy this model from the Azure AI Studio following the instructions here - https://learn.microsoft.com/en-us/azure/ai-studio/how-to/deploy-models-open?tabs=azure-studio

langchain_community.llms.azureml_endpoint import AzureMLOnlineEndpoint library is required to integrate Langchain Prompts with Llama2-70B model deployed on Azure Machine Learning Endpoints.

## Install Dependencies


In [None]:
%pip install -r common/requirements.txt

## Set up variables

In [1]:
import os
import urllib
import requests
import random
import json
from collections import OrderedDict
from IPython.display import display, HTML, Markdown
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chat_models import AzureChatOpenAI
from langchain.docstore.document import Document
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain_community.llms.azureml_endpoint import AzureMLOnlineEndpoint

from common.prompts import COMBINE_QUESTION_PROMPT, COMBINE_PROMPT_SHORT, COMBINE_PROMPT_TEMPLATE_SHORT
from common.utils import (
    get_search_results,
    model_tokens_limit,
    num_tokens_from_docs,
    num_tokens_from_string
)

from dotenv import load_dotenv
load_dotenv("credentials.env", override=True)


True

In [2]:
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}

Try questions against Azure OpenAI and Llama2-70B that you think might be answered or addressed in computer science papers in 2020-2021 or that can be addressed by medical publications about COVID in 2020-2021. Compare and contrast the answer quality.<br>


**Example Questions you can ask**:
- What is CLP?
- How Markov chains work?
- What are some examples of reinforcement learning?
- What are the main risk factors for Covid-19?
- What medicine reduces inflamation in the lungs?
- Why Covid doesn't affect kids that much compared to adults?
- Does chloroquine really works against covid?
- Who won the 1994 soccer world cup? # This question should yield no answer if the system is correctly grounded

In [28]:
QUESTION = "What is CLP?"

## Multi-Index Search queries

In [29]:
# Text-based Indexes that we are going to query (from Notebook 01 and 02)
index1_name = "cogsrch-index-files"
index2_name = "cogsrch-index-csv"
indexes = [index1_name, index2_name]

### Search on both indexes individually and aggragate results

#### **Note**: 
In order to standarize the indexes, **there must be 8 mandatory fields present on each text-based index**: `id, title, content, chunks, language, name, location, vectorized`. This is so that each document can be treated the same along the code. Also, **all indexes must have a semantic configuration**.

In [30]:
agg_search_results = dict()

for index in indexes:
    search_payload = {
        "search": QUESTION,
        "select": "id, title, chunks, name, location",
        "queryType": "semantic",
        "semanticConfiguration": "my-semantic-config",
        "count": "true",
        "speller": "lexicon",
        "queryLanguage": "en-us",
        "captions": "extractive",
        "answers": "extractive",
        "top": "10"
    }

    r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index + "/docs/search",
                     data=json.dumps(search_payload), headers=headers, params=params)
    print(r.status_code)

    search_results = r.json()
    agg_search_results[index]=search_results
    print("Index:", index, "Results Found: {}, Results Returned: {}".format(search_results['@odata.count'], len(search_results['value'])))

200
Index: cogsrch-index-files Results Found: 9789, Results Returned: 10
200
Index: cogsrch-index-csv Results Found: 48638, Results Returned: 10


### Display the top results (from both searches) based on the re-ranker score

In [31]:
display(HTML('<h4>Top Answers</h4>'))

for index,search_results in agg_search_results.items():
    for result in search_results['@search.answers']:
        if result['score'] > 0.5: # Show answers that are at least 50% of the max possible score=1
            display(HTML('<h5>' + 'Answer - score: ' + str(round(result['score'],2)) + '</h5>'))
            display(HTML(result['text']))
            
print("\n\n")
display(HTML('<h4>Top Results</h4>'))

content = dict()
ordered_content = OrderedDict()


for index,search_results in agg_search_results.items():
    for result in search_results['value']:
        if result['@search.rerankerScore'] > 1:# Show answers that are at least 25% of the max possible score=4
            content[result['id']]={
                                    "title": result['title'],
                                    "chunks": result['chunks'], 
                                    "chunks_vectors": [],
                                    "name": result['name'], 
                                    "location": result['location'] ,
                                    "caption": result['@search.captions'][0]['text'],
                                    "score": result['@search.rerankerScore'],
                                    "index": index
                                    }
    
#After results have been filtered we will Sort and add them as an Ordered list\n",
for id in sorted(content, key= lambda x: content[x]["score"], reverse=True):
    ordered_content[id] = content[id]
    url = str(ordered_content[id]['location']) + os.environ['BLOB_SAS_TOKEN']
    title = str(ordered_content[id]['title']) if (ordered_content[id]['title']) else ordered_content[id]['name']
    score = str(round(ordered_content[id]['score'],2))
    display(HTML('<h5><a href="'+ url + '">' + title + '</a> - score: '+ score + '</h5>'))
    display(HTML(ordered_content[id]['caption']))






## Embeddings and Vector Search


In [None]:
k = 10 # Number of results per each text_index
ordered_results = get_search_results(QUESTION, indexes, k=10, reranker_threshold=1)
#print("Number of results:",len(ordered_results))
# Uncomment the below line if you want to inspect the ordered results
# ordered_results

In [33]:
embedder = AzureOpenAIEmbeddings(model="text-embedding-ada-002", skip_empty=True) 

In [34]:
%%time
for key,value in ordered_results.items():
    if value["vectorized"] != True: # If the document has not been vectorized yet
        i = 0
        print("Vectorizing",len(value["chunks"]),"chunks from Document:",value["location"])
        for chunk in value["chunks"]: # Iterate over the document's text chunks
            try:
                upload_payload = {  # Insert the chunk and its vector in the vector-based index
                    "value": [
                        {
                            "id": key + "_" + str(i),
                            "title": f"{value['title']}_chunk_{str(i)}",
                            "chunk": chunk,
                            "chunkVector": embedder.embed_query(chunk if chunk!="" else "-------"),
                            "name": value["name"],
                            "location": value["location"],
                            "@search.action": "upload"
                        },
                    ]
                }

                r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + value["index"]+"-vector" + "/docs/index",
                                     data=json.dumps(upload_payload), headers=headers, params=params)
                
                if r.status_code != 200:
                    print(r.status_code)
                    print(r.text)
                else:
                    i = i + 1 # increment chunk number
                
            except Exception as e:
                print("Exception:",e)
                print(chunk)
                continue

        # Update document in text-based index and mark it as "vectorized"
        upload_payload = {
            "value": [
                {
                    "id": key,
                    "vectorized": True,
                    "@search.action": "merge"
                },
            ]
        }

        r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + value["index"]+ "/docs/index",
                                     data=json.dumps(upload_payload), headers=headers, params=params)

CPU times: total: 0 ns
Wall time: 0 ns


Now we search on the vector-based indexes and get the top k most similar chunks to our question:

In [None]:
vector_indexes = [index+"-vector" for index in indexes]

k = 10
similarity_k = 3
ordered_results = get_search_results(QUESTION, vector_indexes,
                                        k=k, # Number of results per vector index
                                        reranker_threshold=1,
                                        vector_search=True, 
                                        similarity_k=similarity_k,
                                        query_vector = embedder.embed_query(QUESTION)
                                        );
#print("Number of results:", len(ordered_results))

For vector search is not recommended to give more than k=5 chunks (of max 5000 characters each) to the LLM as context. Otherwise you can have issues later with the token limit trying to have a conversation with memory.

In [36]:
top_docs = []
for key,value in ordered_results.items():
    location = value["location"] if value["location"] is not None else ""
    top_docs.append(Document(page_content=value["chunk"], metadata={"source": location}))
        
print("Number of chunks:",len(top_docs))

Number of chunks: 3


In [37]:
# If model is not an Azure OpenAI model, model_tokens_limit will set the limit the 4096. 
MODEL = "llama2-70b"
COMPLETION_TOKENS = 1000

In [38]:
# Calculate number of tokens of our docs
if(len(top_docs)>0):
    tokens_limit = model_tokens_limit(MODEL) # this is a custom function we created in common/utils.py
    prompt_tokens = num_tokens_from_string(COMBINE_PROMPT_TEMPLATE_SHORT) # this is a custom function we created in common/utils.py
    context_tokens = num_tokens_from_docs(top_docs) # this is a custom function we created in common/utils.py
    
    requested_tokens = prompt_tokens + context_tokens + COMPLETION_TOKENS
    
    chain_type = "map_reduce" if requested_tokens > 0.9 * tokens_limit else "stuff"  
    
    print("System prompt token count:",prompt_tokens)
    print("Max Completion Token count:", COMPLETION_TOKENS)
    print("Combined docs (context) token count:",context_tokens)
    print("--------")
    print("Requested token count:",requested_tokens)
    print("Token limit for", MODEL, ":", tokens_limit)
    print("Chain Type selected:", chain_type)
        
else:
    print("NO RESULTS FROM AZURE SEARCH")

System prompt token count: 824
Max Completion Token count: 1000
Combined docs (context) token count: 998
--------
Requested token count: 2822
Token limit for llama2-70b : 4096
Chain Type selected: stuff


## Generate grounded LLM responses for Azure OpenAI and Llama2-70B

### Get response from Llama2-70B on Azure ML Endpoints

In [39]:
from langchain.schema import HumanMessage
from langchain_community.llms.azureml_endpoint import (
    AzureMLEndpointApiType,
    LlamaContentFormatter,
)

llm_llama2_70b = AzureMLOnlineEndpoint(
    endpoint_url="https://" + os.environ["LLAMA2_70B_ENDPOINT"] + ".eastus.inference.ml.azure.com/score",
    endpoint_api_type=AzureMLEndpointApiType.realtime,
    endpoint_api_key= os.environ["LLAMA2_70B_KEY"],
    #content_formatter=LlamaContentFormatter(),
    content_formatter=LlamaContentFormatter(),
    model_kwargs={"temperature": 1, "max_new_tokens": 400},
)

# If you want to invoke the llama endpoint without using langchain prompt, you can use the invoke method.
#response = llm_llama2_70b.invoke("Write me a song about sparkling water:")
#response

In [40]:
chain_type ="stuff"
chain_llama = load_qa_with_sources_chain(llm_llama2_70b, chain_type=chain_type, 
                                    prompt=COMBINE_PROMPT_SHORT)

In [41]:
%%time
# Try with other language as well
response_llama = chain_llama({"input_documents": top_docs, "question": QUESTION, "language": "English"})

CPU times: total: 0 ns
Wall time: 16.4 s


In [42]:
resp = response_llama['output_text'].replace("\\n\\n", "\r\n")
resp2 = resp.replace('\\n', '\r\n')
print(resp2)
#display(Markdown(resp2))

"
These are examples of how you must provide the answer:
--> Beginning of examples
QUESTION: Which state/country's law governs the interpretation of the contract?
Content: This Agreement is governed by English law and the parties submit to the exclusive jurisdiction of the English courts in  relation to any dispute (contractual or non-contractual) concerning this Agreement save that either party may apply to any court for an  injunction or other relief to protect its Intellectual Property Rights.
Source: https://xxx.com/article1.pdf?s=casdfg&category=ab&sort=asc&page=1
Content: No Waiver. Failure or delay in exercising any right or remedy under this Agreement shall not constitute a waiver of such (or any other)  right or remedy.
11.7 Severability. The invalidity, illegality or unenforceability of any term (or part of a term) of this Agreement shall not affect the continuation  in force of the remainder of the term (if any) and this Agreement.
11.8 No Agency. Except as expressly stated 

### Get response from Azure OpenAI

In [43]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]

In [44]:
# Azure OpenAI
MODEL = "gpt-4-32k" # options: gpt-35-turbo, gpt-35-turbo-16k, gpt-4, gpt-4-32k
COMPLETION_TOKENS = 1000
llm_openai = AzureChatOpenAI(deployment_name=MODEL, temperature=0, max_tokens=COMPLETION_TOKENS)


In [45]:
if chain_type == "stuff":
    chain = load_qa_with_sources_chain(llm_openai, chain_type=chain_type, 
                                       prompt=COMBINE_PROMPT_SHORT)
elif chain_type == "map_reduce":
    chain = load_qa_with_sources_chain(llm_openai, chain_type=chain_type, 
                                       question_prompt=COMBINE_QUESTION_PROMPT,
                                       combine_prompt=COMBINE_PROMPT_SHORT,
                                       return_intermediate_steps=True)

In [46]:
%%time
# Try with other language as well
response = chain({"input_documents": top_docs, "question": QUESTION, "language": "English"})

CPU times: total: 15.6 ms
Wall time: 22.9 s


In [47]:
#response['output_text']
display(Markdown(response['output_text']))

CLP refers to Consultation-Liaison Psychiatry, a field of psychiatry that involves providing psychiatric services in outpatient medical settings<sup><a href="https://api.elsevier.com/content/article/pii/S0033318220301420; https://www.sciencedirect.com/science/article/pii/S0033318220301420?v=s5" target="_blank">[1]</a></sup>. In another context, CLP is also an abbreviation for Cecal Ligation and Puncture, a model used in research to induce sepsis in animals<sup><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5809586/" target="_blank">[2]</a></sup><sup><a href="https://www.ncbi.nlm.nih.gov/pubmed/17047515/" target="_blank">[3]</a></sup>.