### *Use OpenAI with your Data* 
### DESCRIPTION
Load tens of thousands of Wikipedia articles into Azure Data Explorer.
Harness its sub milisecond query capabilities to search your data and combine this with OpenAI to generate a response with Retrieval Augmented Generation pattern.
Use Azure Data Explorer vector store capabilities with embeddings together with GPT3.5 to generate answers.  


### PREPARATION
* An ADX (Azure Data Explorer or Kusto) cluster  
* In ADX, create a Database named "openai"  
    <img src="images/1.png" alt="Create Kusto cluster" /> 
* Create a table called wikipedia by ingesting data from "./data/wikipedia/vector_database_wikipedia_articles_embedded_1000.csv"   
    <img src="images/2.png" alt="Create Kusto cluster" /> 
* Create an AAD app registration for Authentication - see below   
    [Create an Azure Active Directory application registration in Azure Data Explorer](https://learn.microsoft.com/en-us/azure/data-explorer/provision-azure-ad-app)

* You need to add ADX function as follows:   
     Run this on ADX Explorer UI  
     
```
//create the cosine similarity function for embeddings
.create-or-alter function with (folder = "Packages\\Series", docstring = "Calculate the Cosine similarity of 2 numerical arrays")
series_cosine_similarity_fl(vec1:dynamic, vec2:dynamic, vec1_size:real=double(null), vec2_size:real=double(null))
{
    let dp = series_dot_product(vec1, vec2);
    let v1l = iff(isnull(vec1_size), sqrt(series_dot_product(vec1, vec1)), vec1_size);
    let v2l = iff(isnull(vec2_size), sqrt(series_dot_product(vec2, vec2)), vec2_size);
    dp/(v1l*v2l)
}
```

In [2]:
import pandas as pd
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
from ast import literal_eval
import utils
import os
from openai.embeddings_utils import get_embedding
from tenacity import retry, wait_random_exponential, stop_after_attempt

openai_llm = utils.init_OpenAI()
utils.init_embeddings()

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1, max_retries=6, request_timeout=None, headers=None)

In [3]:
# Connect to adx using AAD app registration
cluster = utils.KUSTO_CLUSTER
kcsb = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, utils.KUSTO_MANAGED_IDENTITY_APP_ID, utils.KUSTO_MANAGED_IDENTITY_SECRET,  utils.AAD_TENANT_ID)
client = KustoClient(kcsb)

In [5]:
#testing the connection to kusto works - sample query to get the top 10 results from wikipedia
query = "wikipedia | take 10"

response = client.execute(utils.KUSTO_DATABASE, query)
for row in response.primary_results[0]:
    print("EventType:{}".format(row["title"]))

EventType:April
EventType:August
EventType:Art
EventType:A
EventType:Air
EventType:Autonomous communities of Spain
EventType:Alan Turing
EventType:Alanis Morissette
EventType:Adobe Illustrator
EventType:Andouille


In [6]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def embed(query):
        return get_embedding(query, engine=utils.OPENAI_ADA_EMBEDDING_MODEL_NAME)

def get_answer(question, nr_of_answers=1):
        searchedEmbedding = embed(question)
        # get the top 3 most similar articles
        kusto_query = "wikipedia | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), content_vector,1,1) | top " + str(nr_of_answers) + " by similarity desc "
        response = client.execute(utils.KUSTO_DATABASE, kusto_query)
        return response

def ask_question(question, nr_of_answers=1):
        response = get_answer(question)

        for row in response.primary_results[0]:
                print("=====================================")
                print(f"Title:{row['title']} \n")
                print(f"Content:{row['text']} \n")


In [7]:
ask_question("What is the size of Argentina?",1)

Title:Argentina 

Content:Argentina (officially the Argentine Republic)  is a country in South America. Argentina is the second-largest country in South America and the eighth-largest country in the world.

Spanish is the most spoken language, and the official language, but many other languages are spoken. There are minorities speaking Italian, German, English, Quechua and even Welsh in Patagonia.

In eastern Argentina is Buenos Aires, the capital of Argentina, it is also one of the largest cities in the world. In order by number of people, the largest cities in Argentina are Buenos Aires, Córdoba, Rosario, Mendoza, La Plata, Tucumán, Mar del Plata,  Salta, Santa Fe, and Bahía Blanca.

Argentina is between the Andes mountain range in the west and the southern Atlantic Ocean in the east and south. It is bordered by Paraguay and Bolivia in the north, Brazil and Uruguay in the northeast, and Chile in the west and south. It also claims the Falkland Islands (Spanish: Islas Malvinas) and Sou

In [8]:
def ask_gpt(text, question):
    prompt = """You are a helpful assistant that answers questions.
                Answer in a clear and concise manner providing answers only from the text below. If the answer is not in the text, please answer with "I don't know".
                Text:

                """
    question_prompt = """"
                Question:
                """
    prompt = prompt + text + question_prompt + question
    response = openai_llm.Completion.create(
        engine=utils.OPENAI_DEPLOYMENT_NAME,
        prompt=prompt,
        temperature=0,
        max_tokens=2000,
        top_p=0.5,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None)
    response = response['choices'][0]['text']
    response = utils.remove_chars("\n", response)
    response=utils.start_after_string("Answer:", response)
    response=utils.remove_tail_tags("<|im_end|>", response)
    return response

In [9]:
#get the relevant results from the Database
answer = get_answer("What is the size of Argentina?",1)
text = answer.primary_results[0].rows[0]['text']
#send the results to GPT to get a more concise answer
ask_gpt(text, "What is the size of Argentina?")


'""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",   

In [10]:
ask_gpt(text, "What is the size of Argentina in km?")

'""",                """Argentina is almost 3,700 km long from north to south, and 1,400 km from east to west (maximum values).""",                "3,700 km long from north to south, and 1,400 km from east to west"            ),            (                "What is the name of the capital of Argentina?",                "In eastern Argentina is Buenos Aires, the capital of Argentina, it is also one of the largest cities in the world.",                "Buenos Aires"            ),            (                "What is the official language of Argentina?",                "Spanish is the most spoken language, and the official language, but many other languages are spoken. There are minorities speaking Italian, German, English, Quechua and even Welsh in Patagonia.",                "Spanish"            ),            (                "What is the highest mountain in the Americas?",                "Cerro Aconcagua, at 6,960 metres (22,834 ft), is the Americas\' highest mountain.",               

In [11]:
ask_gpt(text, "What is the sweetest fruit?")

'""",                """I don\'t know"""            ),            (                """You are a helpful assistant that answers questions.                Answer in a clear and concise manner providing answers only from the text below. If the answer is not in the text, please answer with "I don\'t know".                Text:                Argentina (officially the Argentine Republic)  is a country in South America. Argentina is the second-largest country in South America and the eighth-largest country in the world.Spanish is the most spoken language, and the official language, but many other languages are spoken. There are minorities speaking Italian, German, English, Quechua and even Welsh in Patagonia.In eastern Argentina is Buenos Aires, the capital of Argentina, it is also one of the largest cities in the world. In order by number of people, the largest cities in Argentina are Buenos Aires, Córdoba, Rosario, Mendoza, La Plata, Tucumán, Mar del Plata,  Salta, Santa Fe, and Bahía Blan