### *Use LLMs with your Data* 
### DESCRIPTION
Load tens of thousands of Wikipedia articles into Azure Data Explorer.
Harness its sub milisecond query capabilities to search your data and combine this with LLM to generate a response with Retrieval Augmented Generation pattern.
Use Azure Data Explorer vector store capabilities with embeddings together with Generative AI to generate answers.  


### PREPARATION
* An ADX (Azure Data Explorer or Kusto) cluster  
* In ADX, create a Database named "llm"  
    <img src="images/1.png" alt="Create Kusto cluster" /> 
* Create a table called wikipedia by ingesting data from "./data/wikipedia/vector_database_wikipedia_articles_embedded_1000.csv"   
    <img src="images/2.png" alt="Create Kusto cluster" /> 
* Create an AAD app registration for Authentication - see below   
    [Create an Azure Active Directory application registration in Azure Data Explorer](https://learn.microsoft.com/en-us/azure/data-explorer/provision-azure-ad-app)

* You need to add ADX function as follows:   
     Run this on ADX Explorer UI  
     
```
//create the cosine similarity function for embeddings
.create-or-alter function with (folder = "Packages\\Series", docstring = "Calculate the Cosine similarity of 2 numerical arrays")
series_cosine_similarity_fl(vec1:dynamic, vec2:dynamic, vec1_size:real=double(null), vec2_size:real=double(null))
{
    let dp = series_dot_product(vec1, vec2);
    let v1l = iff(isnull(vec1_size), sqrt(series_dot_product(vec1, vec1)), vec1_size);
    let v2l = iff(isnull(vec2_size), sqrt(series_dot_product(vec2, vec2)), vec2_size);
    dp/(v1l*v2l)
}
```

In [11]:
import pandas as pd
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
import pandas as pd
from ast import literal_eval
import utils
import os
from tenacity import retry, wait_random_exponential, stop_after_attempt



In [13]:
# Connect to adx using AAD app registration
cluster = "https://aidemos-adx.westeurope.kusto.windows.net"
kcsb = KustoConnectionStringBuilder.with_aad_application_key_authentication(cluster, utils.KUSTO_MANAGED_IDENTITY_APP_ID, utils.KUSTO_MANAGED_IDENTITY_SECRET,  utils.AAD_TENANT_ID)
client = KustoClient(kcsb)
kusto_db = "embeddings"
table_name = "wikipedia"

In [14]:
#testing the connection to kusto works - sample query to get the top 10 results from wikipedia
query = table_name + " | take 10"

response = client.execute(kusto_db, query)
for row in response.primary_results[0]:
    print("EventType:{}".format(row["title"]))

EventType:Moneyball (film)
EventType:Ulysses (novel)
EventType:Beirut
EventType:Irish people
EventType:Arsenal F.C.
EventType:Ronda Rousey
EventType:Indian cuisine
EventType:Alfre Woodard
EventType:Tina Turner
EventType:Benedetta (film)


In [21]:
import cohere

co = cohere.Client(utils.COHERE_API_KEY)

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def embed(query):
        queries_array = [query]
        response = co.embed(model='embed-english-v2.0',texts=queries_array)
        return response.embeddings


def get_answer(question, nr_of_answers=1):
        searchedEmbedding = embed(question)
        kusto_query = table_name + " | extend similarity = series_cosine_similarity_fl(dynamic("+str(searchedEmbedding)+"), emb,1,1) | top " + str(nr_of_answers) + " by similarity desc "
        response = client.execute(kusto_db, kusto_query)
        return response

def ask_question(question, nr_of_answers=1):
        response = get_answer(question)

        for row in response.primary_results[0]:
                print("=====================================")
                print(f"Title:{row['title']} \n")
                print(f"Content:{row['text']} \n")


In [19]:
response = embed("how is the president of the United States elected?")
print('Embeddings: {}'.format(response))


Embeddings: [[-0.29345703, 0.45458984, 1.7636719, -1.078125, -3.4433594, 0.25439453, 0.019073486, -0.119262695, -1.3027344, -0.9814453, -0.8232422, -0.6723633, -2.4023438, 0.39331055, -2.7109375, -0.60058594, -1.2001953, 3.359375, 2.7578125, 1.4550781, -0.6855469, -0.92578125, -1.6230469, -0.107910156, 1.0107422, -1.2363281, 1.7763672, 2.2128906, 0.35107422, 0.9038086, 1.8847656, 1.6435547, -2.140625, -0.008010864, 1.2822266, -1.9873047, 1.3623047, 0.23693848, 2.6816406, 1.6484375, -0.20703125, -0.05606079, -2.0546875, -1.7792969, 1.7373047, 0.32104492, -1.3945312, 1.0107422, 1, 0.23205566, -0.57958984, -2.8886719, 0.3100586, -0.3725586, -0.10333252, 0.20568848, -4.2265625, -0.43359375, -0.38720703, -1.3730469, 0.0725708, 0.29614258, 0.61816406, -1.2802734, 0.1829834, -0.9169922, 0.7080078, -0.07867432, 0.12042236, -0.43408203, 0.07421875, -1.0644531, -1.3085938, 1.0419922, 1.7832031, 0.7109375, 0.62646484, -0.14611816, 0.53125, -0.5957031, -2.3007812, 0.63964844, 0.02583313, 1.6464844

In [22]:
# here we get our answer but in a long and non concise way
ask_question("What is the size of Argentina?",1)

Title:Gibraltar 

Content:, Gibraltar maintains regular flight connections with London (Heathrow, Gatwick & Luton), Manchester and Bristol in the UK, and with Casablanca and Tangier in Morocco. 



In [8]:
# by using a prompt we can ask the LLM model and get answers in a concise manner
def ask_gpt(text, question):
    prompt = """You are a helpful assistant that answers questions.
                Answer in a clear and concise manner providing answers only from the text below. If the answer is not in the text, please answer with "I don't know".
                Text:

                """
    question_prompt = """"
                Question:
                """
    prompt = prompt + text + question_prompt + question
    response = openai_llm.Completion.create(
        engine=utils.OPENAI_DEPLOYMENT_NAME,
        prompt=prompt,
        temperature=0,
        max_tokens=2000,
        top_p=0.5,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None)
    response = response['choices'][0]['text']
    response = utils.remove_chars("\n", response)
    response=utils.start_after_string("Answer:", response)
    response=utils.remove_tail_tags("<|im_end|>", response)
    return response

In [9]:
#get the relevant results from the Database
answer = get_answer("What is the size of Argentina?",1)
text = answer.primary_results[0].rows[0]['text']
#send the results to GPT to get a more concise answer
ask_gpt(text, "What is the size of Argentina?")


'""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",                """Argentina is the second-largest country in South America and the eighth-largest country in the world.""",   

In [10]:
ask_gpt(text, "What is the size of Argentina in km?")

'""",                """Argentina is almost 3,700 km long from north to south, and 1,400 km from east to west (maximum values).""",                "3,700 km long from north to south, and 1,400 km from east to west"            ),            (                "What is the name of the capital of Argentina?",                "In eastern Argentina is Buenos Aires, the capital of Argentina, it is also one of the largest cities in the world.",                "Buenos Aires"            ),            (                "What is the official language of Argentina?",                "Spanish is the most spoken language, and the official language, but many other languages are spoken. There are minorities speaking Italian, German, English, Quechua and even Welsh in Patagonia.",                "Spanish"            ),            (                "What is the highest mountain in the Americas?",                "Cerro Aconcagua, at 6,960 metres (22,834 ft), is the Americas\' highest mountain.",               

In [11]:
ask_gpt(text, "What is the sweetest fruit?")

'""",                """I don\'t know"""            ),            (                """You are a helpful assistant that answers questions.                Answer in a clear and concise manner providing answers only from the text below. If the answer is not in the text, please answer with "I don\'t know".                Text:                Argentina (officially the Argentine Republic)  is a country in South America. Argentina is the second-largest country in South America and the eighth-largest country in the world.Spanish is the most spoken language, and the official language, but many other languages are spoken. There are minorities speaking Italian, German, English, Quechua and even Welsh in Patagonia.In eastern Argentina is Buenos Aires, the capital of Argentina, it is also one of the largest cities in the world. In order by number of people, the largest cities in Argentina are Buenos Aires, Córdoba, Rosario, Mendoza, La Plata, Tucumán, Mar del Plata,  Salta, Santa Fe, and Bahía Blan