# Using a database

### Introduction

Now so far, we have seen one technique for adding information to our llm application.  The technique was to get a question then use google to search for information and present that to our model to answer the question.

<img src="./question-answer.png">

Now above, we used a google search to provide this information.  But oftentimes we cannot use Google to search for this information to provide to our llm.  

> **Why not**?  Well, for one, the information may be private, like if we want to search through internal company documents.  Or, we may have to first perform pre-processing work, like selecting just the correct text to present to our llm.  Or we may need to ensure a level of data accuracy before feeding that data to our LLM.

So instead of using Google, we'll need to search for this information in a database.  

In that case, our architecture will look like the following.

<img src="./rag-pipeline.png" width="60%">

So in this case, we are not searching Google for content to then send to our LLM.  Instead, we are searching our database for the relevant content.

The hard part, of course, is that instead of a SQL statement -- our query is text like "Was barbie nominated for an Oscar".  And of course, the content in our database is text as well.  So how do we take a text-based query, and retrieve the relevant content from our database?  

That's what we'll discuss next.

### Text to vector

The key thing is to convert our content in the database into a numerical representation, and also to turn our question into a numerical representation.  These numerical representations are called embeddings.  

An embedding is a vector that represents text.  For example, the word queen may be represented by the following vector:

`[1.17129, 2.12989, 3.1980128]`

When someone asks a question, we'll turn the question into a vector, and return the vectors in our database most similar to the query vector.

For example, in the `example_1.py` file, you'll see some initial lines of code that takes text and turns each word into a vector.

In [3]:
from openai import OpenAI
api_key = ""
client = OpenAI(
    api_key=api_key
)  # get API key from platform.openai.com

MODEL = "text-embedding-3-small"
inputs = [
        "Tree", "Bagel", "Software Developer", "King", "Queen", "Prince"
    ]
res = client.embeddings.create(
    input=inputs, model=MODEL
)
df = pd.DataFrame(inputs, columns = ['text'])
vectors = res.data

np_embeddings = [np.array(vector.embedding) for vector in vectors]
df = df.assign(embedding = np_embeddings)

Then if someone gives us a question, we convert that text to a vector, and compare to the others.

> So below, you can imagine our question text is `King`.  We convert that to a vector.

In [2]:
# question = "King"
# q_embeddings = client.embeddings.create(input=question, model=MODEL)
# q_embedding_array = np.array(q_embeddings.data[0].embedding)

Then the cosine function compares the similarity between the our question vector (representing king), and each of the other words above.

In [None]:
distances = [cosine(q_embedding_array, embedding) for embedding in embeddings]
df['distances'] = distances

Then we can sort the dataframe by the words that are most similar to our vector question vector.  

In [None]:
df.sort_values('distances')[:3]

We get the text from those closes three entries and send it to our ai model as the context.

In [None]:
context = '\n\n'.join(df.sort_values('distances')[:3].text)

So this way, instead of our ai model getting information from the internet, we find the relevant information from our database, and that becomes the context that we send to our model.

<img src="./rag-pipeline.png" width="60%">

This pattern is called a rag pipeline.  Rag stands for retrieval augmented generation.  This is because the text generation from our llm is *augmented* by retrieving data from the database, and presenting it to our llm.

> You can see another example in the `example_2.py` file. 

[Pinecone OpenAI](https://docs.pinecone.io/docs/openai)

[Pinecone Quickstart](https://docs.pinecone.io/docs/quickstart)

[Text to Vector Lesson](https://github.com/jigsawlabs-student/pytorch-nlp/blob/master/2-embedding-words.ipynb)

[Hacker News Who is Hiring](https://marcotm.com/articles/information-extraction-with-large-language-models-parsing-unstructured-data-with-gpt/)

[OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/RAG_with_graph_db.ipynb)

[Question and Answering Embedding](https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb)