## Cohere and Astra DB

This page teaches the user how to embed text with Cohere's embedding models. That text is then indexed in an Astra DB vector store and used to find text similar to an embedded query.

In this case we use a portion of the SQuAD dataset, which consists of questions and answers to set up a RAG pipeline. By embedding the questions and storing them alongside the answers in the database, and then embedding the user's query before perfoming a similarity search, we can retrieve relevant answers from the database.

### Install Packages

The first thing we need to do is install and import required Python packages.
- Cohere is the interface for the cohere models. It requrired a cohere API key, which can be gotten for free at https://dashboard.cohere.com/.
- Astrapy is the python interface for Astra DB's JSON API. It requires a database endpoint and application token, both of which can be found in the Astra UI's database overview page.
- Datasets contians the squad dataset that we are using. It can access many datasets from the Hugging Face Datasets Hub.
- Python-Dotenv allows the program to load the required credentials from a .env file.


In [None]:
%pip install -U cohere astrapy datasets python-dotenv

In [None]:
import cohere
import os
from dotenv import load_dotenv
from astrapy.db import AstraDB, AstraDBCollection
from astrapy.ops import AstraDBOps
from datasets import load_dataset

### Environment Setup

Now we set up our credentials and use them to create the connections to Cohere and Astra DB.

- Rename the .env.template file to .env.
- Fill the ASTRA_DB_APPLICATION_TOKEN and ASTRA_DB_API_ENDPOINT lines with the values from the Astra UI.
- Fill the COHERE_API_KEY line with your cohere API key from the cohere dashboard.

In [None]:
load_dotenv()

token = os.getenv("ASTRA_DB_APPLICATION_TOKEN")
api_endpoint = os.getenv("ASTRA_DB_API_ENDPOINT")
cohere_key = os.getenv("COHERE_API_KEY")

astra_db = AstraDB(token=token, api_endpoint=api_endpoint)
cohere = cohere.Client(cohere_key)

### Creating your Astra Collection

To create an Astra collection you can call the create_collection method. For a vector store application you must create your collection with a specified embedding dimension. Only embeddings of this specific length can be inserted into the collection. Therefore, we must select our Cohere embedding model first, and use that to set the embedding dimension for our collection. There is a table of Cohere embedding models and their embedding dimensions below. Set dimension to the value of your chosen model before running the below code.

### Embedding Models
| Model Name                    | Embedding Dimensions |
|-------------------------------|----------------------|
| embed-english-v3.0            | 1024                 |
| embed-multilingual-v3.0       | 1024                 |
| embed-english-light-v3.0      | 384                  |
| embed-multilingual-light-v3.0 | 384                  |
| embed-english-v2.0            | 4096                 |
| embed-english-light-v2.0      | 1024                 |
| embed-multilingual-v2.0       | 768                  |

In [None]:
dimension = 1024
astra_db.create_collection(collection_name="cohere", dimension=dimension)
collection = AstraDBCollection(
    collection_name="cohere", astra_db=astra_db
)

### Preparing the data

First we load the squad dataset. We select the first 2000 rows of the training set specifically since it contains both questions and answers. You can see an example of the questions in the dataset in the results of the next cell.

Then we ask cohere for the embeddings of all of these questions. Remeber that the model selected matches the embedding dimension of the collection. When using cohere for RAG, you first embed the documents with the input_type 'search_document' and then later embed the query with the input type 'search_query'. The truncate value of 'END' means that if the provided text is too long to embed, the model will cut off the end of the offending text and return an embedding of only the beginning part.

We check that the embeddings we recieve back are the correct length.

Then we combine each dictionary representing a row from the squad dataset with its generated embedding. This process results in one dictionary with the squad dataset keys and values untouched and the embedding associated with the '\$vector' key. Embeddings need to be top level values associated with the $vector key to be valid vector search targets in Astra DB.

In [None]:
squad = load_dataset('squad', split='train[:2000]')
squad["question"][0:20]

In [None]:
embeddings = cohere.embed(
    texts=squad["question"],
    model="embed-english-v3.0",
    input_type="search_document",
    truncate="END"
    ).embeddings

len(embeddings[0])

In [None]:
to_insert = []
for i in range(len(squad)):
    to_insert.append({**squad[i], "$vector":embeddings[i]})


### Inserting the Data

We use the insert_many method to insert our documents into Astra DB. We have a list of 2000 dictionaries to insert, but insert_many only takes up to 20 documents at a time. Thus the loop keeps us from running into this limitation. 

In [None]:
batch_size = 20
i=0
while i<(len(to_insert)):
    res = collection.insert_many(documents=to_insert[i:(i+batch_size)])
    print(i)
    i=i+batch_size

### Embedding the Query

We call cohere.embed again, with the same model and truncate values, but this time with the input_type of 'search_query'. Replace the text is user_query to search for an answer to a different question. 

In [None]:
user_query = "What's in front of Notre Dame?"
embedded_query = cohere.embed(
    texts=[user_query],
    model="embed-english-v3.0",
    input_type='search_query',
    truncate='END'
).embeddings[0]

### Finding the Answer

WE use the vector_find method to extract a limited number of rows with questions similar to the embeddded query. Then we take those rows and extract their answer value and similarity core to show the user.

In [None]:
results = collection.vector_find(embedded_query, limit=50)

In [None]:
print(f"Query: {user_query}")
print("Answers:")
for idx, answer in enumerate(results):
    print(f"\t Answer {idx}: {answer['answers']['text']}, Score: {answer['$similarity']}")