### DESCRIPTION:
    This example shows how to use OpenAI with your data. In this case we are using the Moby Dick book online. 
    We will create embeddings from urls and save it in Azure Data Explorer
    Then we will query Azure Data Explorer and get an answer using OpenAI by using the Retrieval Augmented Generation method.
### REQUIREMENTS:
    Create an .env file with your OpenAI API key and save it in the root directory of this project.

### Langchain library
Load your venv and run the following command:
pip install langchain[all]

### PREPARATION
* An ADX (Azure Data Explorer or Kusto) cluster  
* In ADX, create a Database named "openai"  
    <img src="images/1.png" alt="Create Kusto cluster" /> 
* Create a table called wikipedia by ingesting data from "./data/wikipedia/vector_database_wikipedia_articles_embedded_1000.csv"   
    <img src="images/2.png" alt="Create Kusto cluster" /> 
* Create an AAD app registration for Authentication - see below   
    [Create an Azure Active Directory application registration in Azure Data Explorer](https://learn.microsoft.com/en-us/azure/data-explorer/provision-azure-ad-app)

* You need to add ADX function as follows:   
     Run this on ADX Explorer UI  
     
```
//create the cosine similarity function for embeddings
.create-or-alter function with (folder = "Packages\\Series", docstring = "Calculate the Cosine similarity of 2 numerical arrays")
series_cosine_similarity_fl(vec1:dynamic, vec2:dynamic, vec1_size:real=double(null), vec2_size:real=double(null))
{
    let dp = series_dot_product(vec1, vec2);
    let v1l = iff(isnull(vec1_size), sqrt(series_dot_product(vec1, vec1)), vec1_size);
    let v2l = iff(isnull(vec2_size), sqrt(series_dot_product(vec2, vec2)), vec2_size);
    dp/(v1l*v2l)
}
```

In [14]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredURLLoader
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from openai.embeddings_utils import cosine_similarity
from openai.embeddings_utils import get_embedding
from dotenv import load_dotenv
import time
import tiktoken
from tenacity import retry, wait_random_exponential, stop_after_attempt
import utils
import pandas as pd

#### IMPORTANT!! Embeddings Creation Section - Run this only once
You only need to run this once to create the embeddings and save them to Azure Data Explorer.   
Then you can use the already created database and table in Azure Data explorer for retrieval

In [3]:
# create embeddings from urls and save it to a FAISS index.
openai = utils.init_OpenAI()
embeddings = utils.init_embeddings()

# you can add as many urls as you want, but for this example we will only use one
# "moby dick" the book is available online at the URL below
urls = ["https://www.gutenberg.org/files/2701/2701-0.txt"]

loader = UnstructuredURLLoader(urls=urls)
documents = loader.load()

#we use chunk size of 1000 and 10% overlap to try not to cut sentences in the middle
#this regex separates by placing the sentence period when cutting a chunk at the end of that chunk
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separators=["\n\n", "\n", "(?<=\. )", " ", ""])
chunks = text_splitter.split_documents(documents)
len(chunks)

1804

In [15]:
#we use the tenacity library to create delays and retries when calling openAI to avoid hitting throtlling limits
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
def calc_embeddings(text, deployment):
    # replace newlines, which can negatively affect performance.
    text = text.replace("\n", " ")
    return get_embedding(text, engine=deployment)

In [6]:
#save all the chunks into a pandas dataframe
df = pd.DataFrame(columns=['document_name', 'content', 'embedding'])
for ch in chunks:
    dict = {'document_name': ch.metadata['source'],'content': ch.page_content, 'embedding': ""}
    temp_df = pd.DataFrame(dict, index=[0])
    df = pd.concat([df, temp_df], ignore_index=True)
df.head()

Unnamed: 0,document_name,content,embedding
0,https://www.gutenberg.org/files/2701/2701-0.txt,The Project Gutenberg eBook of Moby-Dick; or T...,
1,https://www.gutenberg.org/files/2701/2701-0.txt,CONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied b...,
2,https://www.gutenberg.org/files/2701/2701-0.txt,CHAPTER 33. The Specksnyder.\n\nCHAPTER 34. Th...,
3,https://www.gutenberg.org/files/2701/2701-0.txt,CHAPTER 58. Brit.\n\nCHAPTER 59. Squid.\n\nCHA...,
4,https://www.gutenberg.org/files/2701/2701-0.txt,CHAPTER 85. The Fountain.\n\nCHAPTER 86. The T...,


In [16]:
df["embedding"] = df.content.apply(lambda x: calc_embeddings(x, utils.OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME))
print(df.head(100))


                                      document_name  \
0   https://www.gutenberg.org/files/2701/2701-0.txt   
1   https://www.gutenberg.org/files/2701/2701-0.txt   
2   https://www.gutenberg.org/files/2701/2701-0.txt   
3   https://www.gutenberg.org/files/2701/2701-0.txt   
4   https://www.gutenberg.org/files/2701/2701-0.txt   
..                                              ...   
95  https://www.gutenberg.org/files/2701/2701-0.txt   
96  https://www.gutenberg.org/files/2701/2701-0.txt   
97  https://www.gutenberg.org/files/2701/2701-0.txt   
98  https://www.gutenberg.org/files/2701/2701-0.txt   
99  https://www.gutenberg.org/files/2701/2701-0.txt   

                                              content  \
0   The Project Gutenberg eBook of Moby-Dick; or T...   
1   CONTENTS\n\nETYMOLOGY.\n\nEXTRACTS (Supplied b...   
2   CHAPTER 33. The Specksnyder.\n\nCHAPTER 34. Th...   
3   CHAPTER 58. Brit.\n\nCHAPTER 59. Squid.\n\nCHA...   
4   CHAPTER 85. The Fountain.\n\nCHAPTER 86. The T... 

In [17]:
df.to_csv('data/adx/adx_embeddings.csv', index=False)

### Save to Azure Data Explorer section
* You need to do this only once.   
* We have read the document in the URL
* Splitted it into chunks
* Created the embeddings
* Saved the text chunks and correspoding embeddings to a CSV

Now we will ingest the CSV (must be less than 1 GB)


#### Retrieval section
You can use the already created database and table in Azure Data explorer for retrieval by running the cells here below

In [None]:

llm = utils.init_llm()
embeddings = utils.init_embeddings()
vectorStore = FAISS.load_local("./dbs/urls/faiss_index", embeddings)
retriever = vectorStore.as_retriever(search_type="similarity", search_kwargs={"k":2})
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=False)

In [None]:
qa({"query": "Why does the coffin prepared for Queequeg become Ishmael's life buoy once the Pequod sinks?"})

In [None]:
qa({"query": "Why does Ahab pursue Moby Dick so single-mindedly?"})

In [None]:
qa({"query": "Why does the novel's narrator begin his story with 'Call me Ishmael'?"})