![](https://framerusercontent.com/images/XrtjzetA0V96Cl0di9eH9gsbpY.png)
# Astra DB with AstraPy

Learn how to use your Astra DB database with AstraPy.

In this quickstart, you'll create a vector collection, store a few documents on it, and run **vector searches** on it.

_Prerequisites:_ Make sure you have an Astra DB instance and get ready to supply the corresponding *Token* and the *API Endpoint*
(read more [here](https://docs.datastax.com/en/astra/home/astra.html)).

## Setup

In [None]:
!pip install --quiet --upgrade astrapy openai langchain tiktoken gradio

### Import needed libraries

In [2]:
import os, json
from getpass import getpass

from astrapy.db import AstraDB

### Provide database credentials

These are the connection parameters on your Astra dashboard. Example values:

- API Endpoint: `https://01234567-89ab-cdef-0123-456789abcdef-us-east1.apps.astra.datastax.com`
- Token: `AstraCS:6gBhNmsk135...`


In [3]:
ASTRA_DB_API_ENDPOINT = ""
ASTRA_DB_APPLICATION_TOKEN = ""

In [4]:
os.environ['OPENAI_API_KEY'] = ""

## Create a collection

### Create the client

In [5]:
astra_db = AstraDB(
    api_endpoint=ASTRA_DB_API_ENDPOINT,
    token=ASTRA_DB_APPLICATION_TOKEN,
    namespace='tekimax'
)

### Create the collection

The `create_collection` method results in a new collection on your database.

In [6]:
collection = astra_db.create_collection("players_collection", dimension=1536)

Here, `dimension` is the vector dimension (or "size", i.e. how many numeric components your vector will have).

We choose a very low number in this example for demonstration purposes, but actual embedding vectors usually are much longer.

_Note:_ In case it exists already and the parameters match, this method does just return the collection -- you will get an error, instead, if you try to create a collection with the same name but a different configuration (such as a mismatching dimension).

## Insert documents

### Insert one document

When working with vector stores, your documents can have arbitrary fields, as long as you use only letters, digits and the `_` (underscore) character, preferrably sticking to `snake_case`, in their name.

In particular, note the reserved dollar sign in the field names `$vector` and `$similarity`.

In [7]:
from openai import OpenAI

client = OpenAI(
  api_key=os.environ['OPENAI_API_KEY'],  # this is also the default, it can be omitted
)

embedding_model_name = "text-embedding-ada-002"

In [8]:
def generate_llm_reponse(metadata):

    llm_input = f"""
                Given the following information, Extract the basketball player related stats and generate a metrics for player performance comparison,
                User should be able to run vector embedded similarity searches on this metrics such as 'Players with minimum Personal Fouls' , '  'Players with top scores',
                If a specific metrics item is zero the display that metrics preceded by 'No',
                Only output the metrics , no other comments:

                {metadata}

                End of information.
                """

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Use the instructions provided to process the information."},
            {"role": "user", "content": llm_input}
        ]
    )

    return response.choices[0].message.content

Note: Metadata may need to be converted into more natural langauage if that kind of queries should be supported

In [None]:
import uuid

# record #1

# could be a player id
uuid_rec1 = str(uuid.uuid1())

metadata = """{"Player":"John Abramovic","Pos":"F","Age":27,"Tm":"PIT","G":47,"FG":202,"FGA":834,"FGPercentage":0.242,"2P":202,"2PA":834,"2PPercentage":0.242,"eFGPercentage":0.242,"FT":123,"FTA":178,"FTPercentage":0.691,"AST":35,"PF":161.0,"PTS":527.0,"League":"BAA","Year":1947}"""

playerPerformace_rec1 = generate_llm_reponse(metadata)
print(playerPerformace_rec1)
vectorData = "{} : {}".format(metadata, playerPerformace_rec1)

vector_embedding_rec1 = client.embeddings.create(
        input=playerPerformace_rec1,
        model=embedding_model_name,
    ).data[0].embedding


# record #2

uuid_rec2 = str(uuid.uuid1())

metadata = """{"Player":"Chet Aubuchon","Pos":"G","Age":30,"Tm":"DTF","G":30,"FG":23,"FGA":91,"FGPercentage":0.253,"2P":23,"2PA":91,"2PPercentage":0.253,"eFGPercentage":0.253,"FT":19,"FTA":35,"FT Percentage":0.543,"AST":20,"PF":46.0,"PTS":65.0,"League":"BAA","Year":1947}"""

playerPerformace_rec2 = generate_llm_reponse(metadata)
print(playerPerformace_rec2)
vectorData = "{} : {}".format(metadata, playerPerformace_rec2)

vector_embedding_rec2 = client.embeddings.create(
        input=playerPerformace_rec2,
        model=embedding_model_name,
    ).data[0].embedding


# record #3

uuid_rec3 = str(uuid.uuid1())

metadata = """{"Player":"Norm Baker","Pos":"G","Age":23,"Tm":"CHS","G":4,"FG":0,"FGA":1,"FGPercentage":0.0,"2P":0,"2PA":1,"2PPercentage":0.0,"eFGPercentage":0.0,"FT":0,"FTA":0,"AST":0,"PF":0.0,"PTS":0.0,"League":"BAA","Year":1947}"""

playerPerformace_rec3 = generate_llm_reponse(metadata)
print(playerPerformace_rec3)
vectorData = "{} : {}".format(metadata, playerPerformace_rec3)

vector_embedding_rec3 = client.embeddings.create(
        input=playerPerformace_rec3,
        model=embedding_model_name,
    ).data[0].embedding

# record #4

uuid_rec4 = str(uuid.uuid1())

metadata = """{"Player": "Moe Becker", "Pos": "G-F", "Age": 29.0, "Tm": "TOT", "G": 43.0, "FG": 70.0, "FGA": 358.0, "FGPercentage": 0.196, "2P": 70.0, "2PA": 358.0, "2PPercentage": 0.196, "eFGPercentage": 0.196, "FT": 22.0, "FTA": 44.0, "AST": 30.0, "PF": 98.0, "PTS": 162.0, "League": "BAA", "Year": 1947}"""

playerPerformace_rec4 = generate_llm_reponse(metadata)
print(playerPerformace_rec4)
vectorData = "{} : {}".format(metadata, playerPerformace_rec4)

vector_embedding_rec4 = client.embeddings.create(
        input=playerPerformace_rec4,
        model=embedding_model_name,
    ).data[0].embedding

# record #5

uuid_rec5 = str(uuid.uuid1())

metadata = """{"Player": "Hank Beenders", "Pos": "C", "Age": 30.0, "Tm": "PRO", "G": 58.0, "FG": 266.0, "FGA": 1016.0, "FGPercentage": 0.262, "2P": 266.0, "2PA": 1016.0, "2PPercentage": 0.262, "eFGPercentage": 0.262, "FT": 181.0, "FTA": 257.0, "AST": 37.0, "PF": 196.0, "PTS": 713.0, "League": "BAA", "Year": 1947}"""

playerPerformace_rec5 = generate_llm_reponse(metadata)
print(playerPerformace_rec5)
vectorData = "{} : {}".format(metadata, playerPerformace_rec5)

vector_embedding_rec5 = client.embeddings.create(
        input=playerPerformace_rec5,
        model=embedding_model_name,
    ).data[0].embedding

# record #6

uuid_rec6 = str(uuid.uuid1())

metadata = """{"Player": "Hank Beenders", "Pos": "F", "Age": 31.0, "Tm": "TOT", "G": 45.0, "FG": 76.0, "FGA": 269.0, "FG%": 0.283, "2P": 76.0, "2PA": 269.0, "2P%": 0.283, "eFG%": 0.283, "FT": 51.0, "FTA": 82.0, "FT%": 0.622, "AST": 13.0, "PF": 99.0, "PTS": 203.0, "League": "BAA", "Year": 1948}"""

playerPerformace_rec6 = generate_llm_reponse(metadata)
print(playerPerformace_rec6)
vectorData = "{} : {}".format(metadata, playerPerformace_rec6)

vector_embedding_rec6 = client.embeddings.create(
        input=playerPerformace_rec6,
        model=embedding_model_name,
    ).data[0].embedding


### Insert multiple documents

In [None]:
v_doc_list = [
    {
        "_id": uuid_rec1,
        "Player":"John Abramovic",
        "Pos":"F",
        "Age":27,
        "Tm":"PIT",
        "G":47,
        "FG":202,
        "FGA":834,
        "FGPercentage":0.242,
        "2P":202,
        "2PA":834,
        "2PPercentage":0.242,
        "eFGPercentage":0.242,
        "FT":123,
        "FTA":178,
        "FTPercentage":0.691,
        "AST":35,
        "PF":161,
        "PTS":527,
        "League":"BAA",
        "Year":1947,
        "PlayerPerformace": playerPerformace_rec1,
        "$vector": vector_embedding_rec1
    },
    {
        "_id": uuid_rec2,
        "Player":"Chet Aubuchon",
        "Pos":"G",
        "Age":30,
        "Tm":"DTF",
        "G":30,
        "FG":23,
        "FGA":91,
        "FGPercentage":0.253,
        "2P":23,
        "2PA":91,
        "2PPercentage":0.253,
        "eFGPercentage":0.253,
        "FT":19,
        "FTA":35,
        "FTPercentage":0.543,
        "AST":20,
        "PF":46,
        "PTS":65,
        "League":"BAA",
        "Year":1947,
        "PlayerPerformace": playerPerformace_rec2,
        "$vector": vector_embedding_rec2,
    },
    {
        "_id": uuid_rec3,
        "Player":"Norm Baker",
        "Pos":"G",
        "Age":23,
        "Tm":"CHS",
        "G":4,
        "FG":0,
        "FGA":1,
        "FGPercentage":0.0,
        "2P":0,
        "2PA":1,
        "2PPercentage":0.0,
        "eFGPercentage":0.0,
        "FT":0,
        "FTA":0,
        "AST":0,
        "PF":0,
        "PTS":0,
        "League":"BAA",
        "Year":1947,
        "PlayerPerformace": playerPerformace_rec3,
        "$vector": vector_embedding_rec3,
    },
    {
        "_id": uuid_rec4,
        "Player": "Moe Becker",
        "Pos": "G-F",
        "Age": 29.0,
        "Tm": "TOT",
        "G": 43.0,
        "FG": 70.0,
        "FGA": 358.0,
        "FGPercentage": 0.196,
        "2P": 70.0,
        "2PA": 358.0,
        "2PPercentage": 0.196,
        "eFGPercentage": 0.196,
        "FT": 22.0,
        "FTA": 44.0,
        "AST": 30.0,
        "PF": 98.0,
        "PTS": 162.0,
        "League": "BAA",
        "Year": 1947,
        "PlayerPerformace": playerPerformace_rec4,
        "$vector": vector_embedding_rec4,
      },
      {
        "_id": uuid_rec5,
        "Player": "Hank Beenders",
        "Pos": "C",
        "Age": 30.0,
        "Tm": "PRO",
        "G": 58.0,
        "FG": 266.0,
        "FGA": 1016.0,
        "FGPercentage": 0.262,
        "2P": 266.0,
        "2PA": 1016.0,
        "2PPercentage": 0.262,
        "eFGPercentage": 0.262,
        "FT": 181.0,
        "FTA": 257.0,
        "AST": 37.0,
        "PF": 196.0,
        "PTS": 713.0,
        "League": "BAA",
        "Year": 1947,
        "PlayerPerformace": playerPerformace_rec5,
        "$vector": vector_embedding_rec5,
      },
      {
        "_id": uuid_rec6,
        "Player": "Hank Beenders",
        "Pos": "C",
        "Age": 31.0,
        "Tm": "TOT",
        "G": 45.0,
        "FG": 76.0,
        "FGA": 269.0,
        "FGPercentage": 0.283,
        "2P": 76.0,
        "2PA": 269.0,
        "2PPercentage": 0.283,
        "eFGPercentage": 0.283,
        "FT": 51.0,
        "FTA": 82.0,
        "FTPercentage": 0.622,
        "AST": 13.0,
        "PF": 99.0,
        "PTS": 203.0,
        "League": "BAA",
        "Year": 1948,
        "PlayerPerformace": playerPerformace_rec6,
        "$vector": vector_embedding_rec6,
      }

]

response = collection.insert_many(v_doc_list)
print(response)

## Find documents

Find by specific player:

In [None]:
document = collection.find_one(filter={"Player":"Chet Aubuchon"})
print(document)

Find by any (non-vector) filter clause:

### Find by vector similarity

By default, the `$similarity` field is returned with each document (note the decreasing order):

In [None]:
query_vector = client.embeddings.create(
                    input='Team PIT',
                    model=embedding_model_name,
                ).data[0].embedding

documents = collection.vector_find(query_vector,include_similarity=True, fields=["Player","Tm","League", "PF"], limit=10)
for document in documents:
    print(f"\n{document}")

You can specify which **fields** you'll get back and/or whether you need the **similarity** as well:

In [None]:
documents = collection.vector_find(
    query_vector,
    limit=10,
    fields=["Player","Tm","League", "PF"],  # remember the dollar sign (reserved name)
    include_similarity=False,
)
for document in documents:
    print(f"\n{document}")

You can compound with other `filter` clauses, effectively implementing **metadata filtering** on your vector searches:

In [None]:
documents = collection.vector_find(
    query_vector,
    limit=10,
    filter={'Tm': 'CHS'},
)
for document in documents:
    print(f"\n{document}")

These options are supported for the `vector_find_one` method as well:

In [None]:
document = collection.vector_find_one(
    query_vector,
    fields=["PlayerPerformace"],
    include_similarity=True,  # not really necessary since True is the default
)
print(document)

## Delete a collection

In [30]:
response = astra_db.delete_collection("players_collection")
print(response)

{'status': {'ok': 1}}


#UI Demo

In [None]:
prompt_template_str = """Human: Use the following pieces of context to provide a concise answer to the question at the end.
                      If you don't know the answer, just say that you don't know, don't try to make up an answer.

                      <context>
                      {context}
                      </context>

                      Question: {question}

                      Assistant:"""

def answer_question_openai(question: str, verbose: bool = False) -> str:
    if verbose:
        print(f"\n[answer_question] Question: {question}")
    # Retrieval of the most relevant stored documents from the vector store:

    query_vector = client.embeddings.create(
                    input=question,
                    model=embedding_model_name,
                ).data[0].embedding

    context_docs = collection.vector_find(
        query_vector,
        limit=5,
        fields=["PlayerPerformace"],  # remember the dollar sign (reserved name)
        include_similarity=False,
    )
    context = "\n".join(doc['PlayerPerformace'] for doc in context_docs)
    if verbose:
        print("\n[answer_question] Context:")
        print(context)
    # Filling the prompt template with the current values
    llm_prompt_str = prompt_template_str.format(
        context=context,
        question=question
    )

    response = client.chat.completions.create(
        model="gpt-4",
        temperature=0,
        messages=[
            {"role": "system", "content": "You are a helpful assistant. Use the instructions provided to process the information."},
            {"role": "user", "content": llm_prompt_str}
        ]
    )
    print(response)
    return response.choices[0].message.content

In [None]:
import gradio as gr

def predict(message, history):
    response = answer_question_openai(message)
    return response

gr.ChatInterface(predict).launch()