### Retrival Augmentented Generation
Retrieval-Augmented Generation (RAG) is the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response. Large Language Models (LLMs) are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences. RAG extends the already powerful capabilities of LLMs to specific domains or an organization's internal knowledge base, all without the need to retrain the model. It is a cost-effective approach to improving LLM output so it remains relevant, accurate, and useful in various contexts.


Standard imports for the libraires we will be using in this notebook.  Try to keep your imports in the first cell so this can this code can more easliy be converted into a python program later

In [27]:
#%pip install langchain_cohere -q
#%pip install spacy -q
#%pip install psycopg2-binary -q
#%pip install python-dotenv -q
#ignore error

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [1]:
import boto3
import pandas as pd
import json
import numpy as np
import pyarrow
import time
import os
import numpy as np
import pyarrow
import traceback
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings
from langchain.embeddings import BedrockEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_cohere import CohereEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.chat_models import BedrockChat
from langchain_core.output_parsers import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import BedrockEmbeddings
import dbconnection
import psycopg2
from psycopg2 import OperationalError
from dotenv import load_dotenv

# Create the AWS client for the Bedrock runtime with boto3
aws_client = boto3.client(service_name="bedrock-runtime")

In [2]:
def limit_string_size(x, max_chars=2048):
    # Check if the input is a string
    if isinstance(x, str):
        return x[:max_chars]
    else:
        return x

In [3]:
def clean_value(value):
    value_str = str(value)
    cleaned_value = ''.join(char for char in value_str if char.isalnum() or char.isspace())
    return cleaned_value

In [4]:
def print_top_values(list_stuff: list, num_items: int) -> None:
    i=0
    for item in list_stuff:
        i=i+1
        if i>num_items:
            return None
        print(item)

#### Cohere

In [14]:
# send in an array size of one and only return the 0th element
def generate_cohere_vector_embedding(text_data):
    input_type = "clustering"
    truncate = "NONE" # optional
    model_id = "cohere.embed-english-v3" # or "cohere.embed-multilingual-v3"
    trunc_data = limit_string_size(text_data)
    # Create the JSON payload for the request
    json_params = {
            'texts': [trunc_data],
            'truncate': truncate, 
            "input_type": input_type
        }
    json_body = json.dumps(json_params)
    params = {'body': json_body, 'modelId': model_id,}
    
    # Invoke the model and print the response
    result = aws_client.invoke_model(**params)
    response = json.loads(result['body'].read().decode())
    return(np.array(response['embeddings'][0]))


#### Amazon Titan

In [15]:
# Let's generate a dense vector using Amazon Titan with LangChain
def generate_titan_vector_embedding(text):
    #create an Amazon Titan Text Embeddings client
    embeddings_client = BedrockEmbeddings(region_name="us-west-2") 

    #Invoke the model
    embedding = embeddings_client.embed_query(text)
    return(np.array(embedding))



This is the mathmatical formula to calcuate cosine similarity between 2 vectors

In [16]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    norm_vec1 = np.linalg.norm(vec1)
    norm_vec2 = np.linalg.norm(vec2)
    similarity = dot_product / (norm_vec1 * norm_vec2)
    return similarity



In [17]:
# Let's generate a dense vector using Amazon Titan without using a np.array as a return value
def generate_vector_embedding(text):
    #create an Amazon Titan Text Embeddings client
    embeddings_client = BedrockEmbeddings(region_name="us-west-2") 

    #Invoke the model
    embedding = embeddings_client.embed_query(text)
    #Note pgvector does not want a np.array as out manual method
    return(embedding)

#### Storage Data for Retrieval
 In order to do semantic search and retrieve relevant content we need to store that content for later use.  We can store the embedding in several different persistence technologies.  To start simply let's store the data in memory using pandas dataframe.

Using Gaggle I looked for an interesting dataset that had data that I wanted to use to answer specific questions that is likely not in the large corpus of trained data. Let's put the data into a pandas dataframe and examine the data.

In [18]:
# clean abstract text
df = pd.read_csv('data/latest_research_articles.csv')
df['abstract'] = df['abstract'].apply(clean_value)

df

Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic
0,Estimates of the reproduction ratio from epide...,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics
1,Spin Berry curvature-enhanced orbital Zeeman e...,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics
2,Room-temperature flexible manipulation of the ...,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics
3,Irreversible entropy transport enhanced by fer...,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics
4,Penning-trap measurement of the Q value of ele...,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4179,Short-term creep approach to redefining the ro...,The creep response of the 174PH martensitic ag...,https://doi.org/10.1038/s41598-024-58273-7,0,134,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering
4180,Jamming precoding in AF relay-aided PLC system...,Enhancing information security has become incr...,https://doi.org/10.1038/s41598-024-58735-y,0,136,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering
4181,Frequency-hopping wave engineering with metasu...,Wave phenomena can be artificially engineered ...,https://doi.org/10.1038/s41467-023-44627-8,1,3634,10,03 January 2024,1,0,0,1,0,0,0,5,Engineering
4182,The effect of floating spline parameter on the...,The load sharing performance of encased differ...,https://doi.org/10.1038/s41598-024-59094-4,0,148,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering


#### Store embedded values in memory
Now let's create a new column that represents the article abstract as a vector embedding

In [19]:
# Generate embeddings using Titan
# Add a new column 'embedded_abstract' by applying the function to an existing column
# This step takes a while so I did it for you and saved the output as pickle
#dft = df.copy()
#dft['embedded_abstract'] = dft['abstract'].apply(generate_titan_vector_embedding)
#df.to_pickle('data/embedded_df.pkl')

# Easier way
dft = pd.read_pickle('data/embedded_df.pkl')
dft

Unnamed: 0,title,abstract,doi,citations,accesses,online_attention,published_datetime,tweeters,blogs,facebook_pages,news_outlets,redditors,video_uploaders,wikipedia_page,mendeley,Topic,embedded_abstract
0,Estimates of the reproduction ratio from epide...,Accurate estimates of the reproduction ratio a...,https://doi.org/10.1038/s41567-024-02471-7,0,0,0,25 April 2024,5,0,0,0,0,0,0,0,Physics,"[0.49804688, 0.09375, 0.041259766, -0.14941406..."
1,Spin Berry curvature-enhanced orbital Zeeman e...,Berry phases and the related concept of Berry ...,https://doi.org/10.1038/s41567-024-02487-z,0,801,1,22 April 2024,1,0,0,0,0,0,0,0,Physics,"[0.17382812, -0.095214844, -0.06225586, -0.160..."
2,Room-temperature flexible manipulation of the ...,The quantum metric and Berry curvature are two...,https://doi.org/10.1038/s41567-024-02476-2,0,1029,53,22 April 2024,14,0,1,7,0,0,0,0,Physics,"[0.45507812, -0.21386719, -0.36914062, 0.01879..."
3,Irreversible entropy transport enhanced by fer...,The nature of particle and entropy flow betwee...,https://doi.org/10.1038/s41567-024-02483-3,0,636,1,22 April 2024,2,0,0,0,0,0,0,0,Physics,"[0.18066406, -0.30664062, -0.24414062, -0.0092..."
4,Penning-trap measurement of the Q value of ele...,The investigation of the absolute scale of the...,https://doi.org/10.1038/s41567-024-02461-9,0,2025,105,19 April 2024,6,1,0,14,0,0,0,0,Physics,"[0.23925781, 0.49804688, 0.38867188, 0.2910156..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4179,Short-term creep approach to redefining the ro...,The creep response of the 174PH martensitic ag...,https://doi.org/10.1038/s41598-024-58273-7,0,134,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering,"[0.5625, 0.5703125, -0.123046875, 0.013916016,..."
4180,Jamming precoding in AF relay-aided PLC system...,Enhancing information security has become incr...,https://doi.org/10.1038/s41598-024-58735-y,0,136,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering,"[0.39453125, -0.6796875, -0.609375, 0.22460938..."
4181,Frequency-hopping wave engineering with metasu...,Wave phenomena can be artificially engineered ...,https://doi.org/10.1038/s41467-023-44627-8,1,3634,10,03 January 2024,1,0,0,1,0,0,0,5,Engineering,"[0.057861328, -0.62890625, 0.14746094, 0.32421..."
4182,The effect of floating spline parameter on the...,The load sharing performance of encased differ...,https://doi.org/10.1038/s41598-024-59094-4,0,148,0,09 April 2024,0,0,0,0,0,0,0,0,Engineering,"[0.3125, 0.0013885498, -0.061767578, 0.0016937..."


### Retrieval from embedded sources
Now that we have a dataframe with embedded content of interest, we can use semantic similarity to retrieve the right data to feed to an LLM

In [20]:
# Let's setup a query that a user might ask
query = "What is the latest research for broken ribs in children"

#### Let's send a query in and see hoe the vector embeddings compare the data

In [13]:
# Let's search our records for a good semantic search
query_vector = generate_titan_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in dft.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dft.iloc[item[0]]['title']
    print(f"Abstract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 10:
        break

  embeddings_client = BedrockEmbeddings(region_name="us-west-2")


Here are a few articles that may match your interest:
Abstract: 'High sensitivity methods for automated rib fracture detection in pediatric radiographs' with a cosine match of: 0.46759500534733645
Abstract: 'Magnetic resonance imaging based finite element modelling of the proximal femur: a short-term in vivo precision study' with a cosine match of: 0.23057253235831165
Abstract: 'On the crashworthiness analysis of bio-inspired DNA tubes' with a cosine match of: 0.21674978249516894
Abstract: 'Reproduction of forearm rotation dynamic using intensity-based biplane 2D–3D registration matching method' with a cosine match of: 0.1990790331354664
Abstract: 'Propagation of extended fractures by local nucleation and rapid transverse expansion of crack-front distortion' with a cosine match of: 0.19056102062916996
Abstract: 'A virtual simulation approach to assess the effect of trocar-site placement and scar characteristics on the abdominal wall biomechanics' with a cosine match of: 0.1882334353573

#### Cohere Embedding

Each of these models also have a [max input tokens limit](https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed.html) and that determines the largest number of characters of text we can embed.

[https://cohere.com/blog/introducing-embed-v3](https://cohere.com/blog/introducing-embed-v3)

In [21]:
# Generate embeddings using Cohere
# Add a new column 'embedded_abstract' by applying the function to an existing column
# This step takes a while so I did it for you and saved the output as pickle
#df['embedded_abstract'] = df['abstract'].apply(generate_cohere_vector_embedding)
#df.to_pickle('data/cohere_embedded.pkl')

# This step takes a while so I did it for you and saved the output as pickle
dfc = pd.read_pickle('data/cohere_embedded.pkl')

In [22]:
# Let's search our records for a good semantic search
query_vector = generate_cohere_vector_embedding(query)

results = []
# Iterate over each row in the DataFrame
for index, row in dfc.iterrows():
    # Extract the value from the specified column
    article_embedding = row['embedded_abstract']
    results.append((index, cosine_similarity(article_embedding, query_vector)))
    #print (index, value)

results.sort(key=lambda x: x[1], reverse=True)
i = 0
# Print the sorted data
print("Here are a few articles that may match your interest:")
for item in results:
    article_title = dfc.iloc[item[0]]['title']
    print(f"Abtract: '{article_title}' with a cosine match of: {item[1]}")
    i=i+1
    if i == 10:
        break

Here are a few articles that may match your interest:
Abtract: 'High sensitivity methods for automated rib fracture detection in pediatric radiographs' with a cosine match of: 0.5572853714312022
Abtract: 'Magnetic resonance imaging based finite element modelling of the proximal femur: a short-term in vivo precision study' with a cosine match of: 0.3721272671225773
Abtract: 'Non-invasive biomarkers for detecting progression toward hypovolemic cardiovascular instability in a lower body negative pressure model' with a cosine match of: 0.3536668260710724
Abtract: 'A machine learning-based multiscale model to predict bone formation in scaffolds' with a cosine match of: 0.3511103993742184
Abtract: 'Application of machine learning algorithms for accurate determination of bilirubin level on in vitro engineered tissue phantom images' with a cosine match of: 0.33538364674943555
Abtract: 'Screen-printed flexible MRI receive coils' with a cosine match of: 0.3213033149410693
Abtract: 'Deformation a

### Vector Database for Larger Datasets
For our first example we used a local dataframe to store the contents of our raw text and embeddings, then manually calculated similarity between embeddings.  There are other tools that are better suited for larger datasets with embeddings. Vector databases!  We will explore pgvector running on a Postgresql database engine.

### Switch from a dataframe to a database
Just like we used a dataframe earlier, now we will use a database instead.  This will allow us to store many more records and persist them to disk so we don't just keep them in memory.  Databases are much more efficient at storing large sets of data and being able to calulate cosine similarity more efficiently.


In [23]:
# Let's review our IP address in case we haev any issues accessing the DB
from requests import get

ip = get('https://api.ipify.org').content.decode('utf8')
print('My public IP address is: {}'.format(ip))

My public IP address is: 71.93.45.201


### IMPORTANT
Configure this item for your data.  We are sharing a Vector DB and this value will segment your data!

In [24]:
# Fill in your username so yoru database entry will be unique to you
#MY_USERNAME = "dkraker@calpoly.edu"
MY_USERNAME = "wli64@calpoly.edu"

In [25]:
def insert_record_into_db(table_name, record, conn):

    # Dynamically generates and executes an INSERT SQL statement for PostgreSQL, handling
    # special data types like datetime objects and arrays directly.
    
    # Args:
    # - table_name (str): The name of the table into which the record will be inserted.
    # - data (dict): A dictionary representing the record to be inserted, where keys are column names
    #                  and values can include native PostgreSQL types like datetime and arrays.
    # - conn (psycopg2.connection): A psycopg2 connection object.
    
    # Generate column names and placeholders
    columns = ', '.join(record.keys())
    placeholders = ', '.join(['%s'] * len(record))  # PostgreSQL uses %s as placeholder

    # Create the INSERT INTO statement
    sql = f'INSERT INTO {table_name} ({columns}) VALUES ({placeholders})'
    # print("columns=", columns)
    # print("placeholders=", placeholders)
    # print(sql)
    try:
        cur = conn.cursor()
        
        cur.execute(sql, tuple(record.values()))
        conn.commit()
        print("Record inserted successfully.")
    except psycopg2.Error as e:
        print("An add record error occurred:", e)

In [26]:
# Let's load a staff report for the city of SLO that contains 10 pages of data.  We will chunk the report into 1024 character chunks
# with a 256 character overlap.  We will insert each chunk into the DB with embeddings for similarity search.
with open("data/staff-report.txt", "r") as file:
    text = file.read()

# Create a text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,  # Adjust the chunk size as needed
    chunk_overlap=256,  # Adjust the overlap between chunks as needed
)

# Split the text into chunks
chunks = text_splitter.split_text(text)

# Now for each chunk open a connection to our DB and insert a record into our generic schema

conn = dbconnection.open_connection_to_db()
try:
    # create a dataframe with new chunk raw text
    for chunk in chunks:
        data_record = {}
        v_embed = generate_vector_embedding(chunk)
        data_record["username"] = MY_USERNAME
        data_record["textattribute1"] = chunk
        data_record["textattribute2"] = ""
        data_record["textattribute3"] = ""
        data_record["textattribute4"] = ""
        data_record["textattribute5"] = ""
        data_record["textembedding1"] = v_embed
        insert_record_into_db("rag", data_record, conn)        
        
except Exception as e:
    print(f"An error occurred while inserting records: {e}")
finally:
    if conn:
        conn.close()

Error connecting to database: connection to server at "gsb570-instance-1.cfrzn1oim1as.us-west-2.rds.amazonaws.com" (44.246.111.77), port 5432 failed: Operation timed out
	Is the server running on that host and accepting TCP/IP connections?



  embeddings_client = BedrockEmbeddings(region_name="us-west-2")


An error occurred while inserting records: 'NoneType' object has no attribute 'cursor'


#### Review our data in the DB
Now that we've chunked and inserted our data let's do a quick check

In [None]:
def view_my_data(conn):
    try:
        cur = conn.cursor()

        # SQL statement to delete rows where username is 'bob'
        sql = f"SELECT textattribute1, textattribute2, textattribute3, textattribute4, textattribute5 FROM public.rag WHERE username = '{MY_USERNAME}'"
    
        # Execute the SQL statement
        cur.execute(sql)
        rows = cur.fetchall()

        # grab the cosine scores so we can compute Z score for narrow article selection
        for row in rows:
            print(row[0], row[1], row[2], row[3], row[4])
    
        

        
    except psycopg2.Error as e:
            print("An error occurred:", e)
    finally:
        if conn:
            conn.close()
    

In [None]:
conn = dbconnection.open_connection_to_db()
view_my_data(conn)

#### Now that our data is in the DB we are ready to retrieve
Something we need to consider during retrieval how many records do we think are important.  We have a few choices
- Top N results
- Threshhold
- Fall off values<br>
Let's explore Z-score as a first approach<br>
![Z-Score Example](./images/zscore-dist.jpg)

In [None]:
def calculate_zscores(cosine_scores):
    zscores = []
    # Calculate the mean of the sample points
    mean = np.mean(cosine_scores)
    # Calculate the standard deviation of the sample points
    std_deviation = np.std(cosine_scores, ddof=1)  # ddof=1 for sample standard deviation
    # Calculate the z-scores for each sample point
    z_scores = [(x - mean) / std_deviation for x in cosine_scores]

    return z_scores

Now let's use the database as a way to find the best match. Notice the 2 different commented SQL syntax differences.  We can easily compute Euclidean distance as well

In [None]:
def run_similarity_search_pgvector(question, embedded_text, conn):
    # Cosine similarity
    #1-(textembedding1 <=> ('{embedded_text}')) as cosine_similar  \
    # Euclidean distance
    #textembedding1 <-> ('{embedded_text}') as euclidean_distance  \
    sql = f"SELECT textattribute1, textattribute2, textattribute3, textattribute4, textattribute5, \
                1-(textembedding1 <=> ('{embedded_text}')) as cosine_similar  \
                FROM public.rag \
                WHERE username = '{MY_USERNAME}' \
                ORDER BY cosine_similar DESC \
                LIMIT 50"
    #print(sql)
    cosine_scores = []
    try:
        article_text = ""
        cur = conn.cursor()
        cur.execute(sql)
        rows = cur.fetchall()

        # grab the cosine scores so we can compute Z score for narrow article selection
        # need all scores so we can calc Z
        for row in rows:
            #print(row[5])
            cosine_scores.append(row[5])
            
        z_scores = calculate_zscores(cosine_scores)
        answer = "Unknown"
        article_text = ""
        zscore_index = 0
        first_z_score = z_scores[0]
        for row in rows:
            if(first_z_score/2)<z_scores[zscore_index]:
                print(f"Using chunk with a cosine match of: {row[5]} and Z-score of: {z_scores[zscore_index]}")
                article_text = article_text + row[0] + "\n"
            zscore_index += 1
        #print(article_text)
        answer = best_answer(article_text, question)
        #Close cursor and connection
        cur.close()
        return answer
        
    except psycopg2.Error as e:
            print("An error occurred:", e)
    finally:
        if conn:
            conn.close()
        
  

In [None]:
# Now let's write a function that will take retrieval data and use an LLM to generate a good answer
def best_answer(data, question):
    model_id = "anthropic.claude-3-sonnet-20240229-v1:0"

    model_kwargs =  { 
        "max_tokens": 2048,
        "temperature": 0.0,
        "top_k": 250,
        "top_p": 0.9,
        "stop_sequences": ["\n\nHuman"],
    }

    model = BedrockChat(
        client=aws_client,
        model_id=model_id,
        model_kwargs=model_kwargs,
    )

    human_prompt = "You are to answer the question using the data in the following information.  Do not make up your answer, only use \
                    supporting data from the article, If you don't have enough data simply respond, I don't have enough information to answer that question. \
                    given the following article data {data} can you please give a concise answer to the following question. {question}"
    messages = [
        ("system", "You are a helpful assistant that can answer quesitons based on news articles you have been given."),
        ("human", human_prompt),
    ]
    try:
        prompt = ChatPromptTemplate.from_messages(messages)

        chain = prompt | model | StrOutputParser()

        # Chain Invoke
        
    
        # Send the message content to Claude using Bedrock and get the response
        start_time = time.time()  # Start timing
        # Call Bedrock
        response = chain.invoke({"data": data,"question": question})
        end_time = time.time()  # End timing
        #print("Claude call took :", end_time - start_time)  # Calculate execution time

        return(response)
    except Exception as e:
        exc_type, exc_value, exc_traceback = traceback.sys.exc_info()
        line_number = exc_traceback.tb_lineno

        return f"ERROR generating good answer: {exc_type}{exc_value}{exc_traceback} on {line_number}"

In [None]:
def purge_my_data(conn):
    
    try:
        cur = conn.cursor()

        # SQL statement to delete rows where username is 'bob'
        sql = f"DELETE FROM public.rag WHERE username = '{MY_USERNAME}'"
    
        # Execute the SQL statement
        cur.execute(sql)
    
        # Commit the changes to the database
        conn.commit()
    
        # Get the number of affected rows
        deleted_rows = cur.rowcount
        print(f"{deleted_rows} row(s) deleted.")

        
    except psycopg2.Error as e:
            print("An error occurred:", e)
    finally:
        if conn:
            conn.close()

#### It's magic time!
<p>Ok let's review what we've done to get here</p>
1. Ingested chunks of data from a large text file<br>
2. Embedded each chunk into a vector DB<br>
3. Wrote a function to determine the right number of retrievals to consider using Z-score<br>
4. Wrote a function that asks the LLM to find us the best answer from retrieved chunks<br>
<br>
<p>Let's send in a question and find the best chunks to answer it!!!</p>

In [None]:
#query = "What is the city recommending?"
query = "What will it cost to replace the mixer gearbox?"

query_vector = generate_vector_embedding(query)

conn = dbconnection.open_connection_to_db()

run_similarity_search_pgvector(query, query_vector, conn)

### WARNING this will DELETE all your vector DB data
Use this if you want to removed your embedded data

In [None]:
conn = dbconnection.open_connection_to_db()
purge_my_data(conn)

### Assignment
Find a interesting set of information data that might be a candidate for your project
Ingest this data into a data store for later retrieval. Choose either a dataframe or use the class vector DB
Choose a retrieval method to respond to input queries and using either an approach demonstrated in class or one of your own that reduces the retrieval context length
Use an LLM of your choosing to provide a concise answer to the question asked. Ensure the information provided can be referenced in your original source data
