<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/openai_pgvector_helloworld_FrankMorales_version_model_gpt-4-0613.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hello pgvector: Create, store and query OpenAI embeddings in PostgreSQL using pgvector

This notebook will teach you:
- How to create embeddings from content using the OpenAI API
- How to use PostgreSQL as a vector database and store embeddings data in it using pgvector.
- How to use embeddings retrieved from a vector database to augment LLM generation.

We'll be using the example of creating a chatbot to answer questions about Timescale use cases, referencing content from the Timescale Developer Q+A blog posts.

This is a great first step to building something like chatbot that can reference a company knowledge base or developer docs.

Let's get started!

Note: This notebook uses a PostgreSQL database with pgvector installed that's hosted on Timescale. You can create your own cloud PostgreSQL database in minutes [at this link](https://console.cloud.timescale.com/signup) to follow along. You can also use a local PostgreSQL database if you prefer.

Note2: In this Notebook version, we use PostgreSQL with pgvector extension locally in Google Cloud, developed by Frank Morales on 04/12/2023. Initially, this Notebook was using PostgreSQL with Timescale's configuration.

Forget RAG, the Future is RAG-Fusion: https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1



!### Configuration
- Signup for an OpenAI Developer Account and create an API Key. See [OpenAI's developer platform](https://platform.openai.com/overview).
- Install Python
- Install and configure a python virtual environment. We recommend [Pyenv](https://github.com/pyenv/pyenv)
- Install the requirements for this notebook using the following command:

```
pip install -r requirements.txt
```

In [72]:
## Added by Frank Morales Dec 4th, 2023
#!git clone https://github.com/timescale/vector-cookbook.git
#!pip install -r /content/vector-cookbook/openai_pgvector_helloworld/requirements.txt

#import openai
import os
import pandas as pd
import numpy as np
import json
import tiktoken
import psycopg2
import ast
import pgvector
import math
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector

#Install Libraries to access Google Drive and OpenAI resources.
#!pip install colab-env --upgrade
#!pip install openai==0.28

#!pip install openai

import openai
import colab_env

In [73]:
# Run export OPENAI_API_KEY=sk-YOUR_OPENAI_API_KEY...
# Get openAI api key by reading local .env file
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

#openai.api_key  = os.environ['OPENAI_API_KEY']
openai.api_key = os.getenv("API")


## Part 1: Create Embeddings
First, we'll create embeddings using the OpenAI API on some text we want to augment our LLM with.
In this example, we'll use content from the Timescale blog about real world use cases.

In [74]:
# Load your CSV file into a pandas DataFrame
df = pd.read_csv('/content/vector-cookbook/openai_pgvector_helloworld/blog_posts_data.csv')
df.head()

Unnamed: 0,title,content,url
0,"How to Build a Weather Station With Elixir, Ne...",This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-to-build-a-...
1,CloudQuery on Using PostgreSQL for Cloud Asset...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/cloudquery-on-u...
2,How a Data Scientist Is Building a Time-Series...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-a-data-scie...
3,How Conserv Safeguards History: Building an En...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-conserv-saf...
4,How Messari Uses Data to Open the Cryptoeconom...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-messari-use...


### 1.1 Calculate cost of embedding data
It's usually a good idea to calculate how much creating embeddings for your selected content will cost.
We use a number of helper functions to calculate a cost estimate before creating the embeddings to help us avoid surprises.

For this toy example, since we're using a small dataset, the total cost will be less than $0.01.

In [75]:
# Helper functions to help us create the embeddings

# Helper func: calculate number of tokens
def num_tokens_from_string(string: str, encoding_name = "cl100k_base") -> int:
    if not string:
        return 0
    # Returns the number of tokens in a text string
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# Helper function: calculate length of essay
def get_essay_length(essay):
    word_list = essay.split()
    num_words = len(word_list)
    return num_words

# Helper function: calculate cost of embedding num_tokens
# Assumes we're using the text-embedding-ada-002 model
# See https://openai.com/pricing
def get_embedding_cost(num_tokens):
    return num_tokens/1000*0.0001

# Helper function: calculate total cost of embedding all content in the dataframe
def get_total_embeddings_cost():
    total_tokens = 0
    for i in range(len(df.index)):
        text = df['content'][i]
        token_len = num_tokens_from_string(text)
        total_tokens = total_tokens + token_len
    total_cost = get_embedding_cost(total_tokens)
    return total_cost

# Helper function: get embeddings for a text
## openai API version == 0.
#def get_embeddings(text):
#    response = openai.Embedding.create(
#        model="text-embedding-ada-002",
#        input = text.replace("\n"," ")
#    )
#    embedding = response['data'][0]['embedding']
#    return embedding


#from openai import OpenAI
#client = openai

def get_embeddings(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.embeddings.create(input = [text], model=model).data[0].embedding


#df['ada_embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
#df.to_csv('output/embedded_1k_reviews.csv', index=False)



In [76]:
# quick check on total token amount for price estimation
total_cost = get_total_embeddings_cost()
print("estimated price to embed this content = $" + str(total_cost))

estimated price to embed this content = $0.0060178


### 1.2 Create smaller chunks of content
The OpenAI API has a limit to the maximum amount of tokens it create create an embedding for in a single request. To get around this limit we'll break up our text into smaller chunks. In general its a best practice to create embeddings of a certain size in order to get better retrieval. For our purposes, we'll aim for chunks of around 512 tokens each.

Note: If you prefer to skip this step, you can use use the provided file: blog_data_and_embeddings.csv which contains the data and embeddings that you'll generate in this step.

In [77]:
###############################################################################
# Create new list with small content chunks to not hit max token limits
# Note: the maximum number of tokens for a single request is 8191
# https://openai.com/docs/api-reference/requests
###############################################################################
# list for chunked content and embeddings
new_list = []
# Split up the text into token sizes of around 512 tokens
for i in range(len(df.index)):
    text = df['content'][i]
    token_len = num_tokens_from_string(text)
    if token_len <= 512:
        new_list.append([df['title'][i], df['content'][i], df['url'][i], token_len])
    else:
        # add content to the new list in chunks
        start = 0
        ideal_token_size = 512
        # 1 token ~ 3/4 of a word
        ideal_size = int(ideal_token_size // (4/3))
        end = ideal_size
        #split text by spaces into words
        words = text.split()

        #remove empty spaces
        words = [x for x in words if x != ' ']

        total_words = len(words)

        #calculate iterations
        chunks = total_words // ideal_size
        if total_words % ideal_size != 0:
            chunks += 1

        new_content = []
        for j in range(chunks):
            if end > total_words:
                end = total_words
            new_content = words[start:end]
            new_content_string = ' '.join(new_content)
            new_content_token_len = num_tokens_from_string(new_content_string)
            if new_content_token_len > 0:
                new_list.append([df['title'][i], new_content_string, df['url'][i], new_content_token_len])
            start += ideal_size
            end += ideal_size

In [78]:
# Create embeddings for each piece of content
for i in range(len(new_list)):
    text = new_list[i][1]
    embedding = get_embeddings(text)
    new_list[i].append(embedding)

# Create a new dataframe from the list
df_new = pd.DataFrame(new_list, columns=['title', 'content', 'url', 'tokens', 'embeddings'])
df_new.head()

Unnamed: 0,title,content,url,tokens,embeddings
0,"How to Build a Weather Station With Elixir, Ne...",This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-to-build-a-...,501,"[0.021440856158733368, 0.02200360782444477, -0..."
1,"How to Build a Weather Station With Elixir, Ne...",capture weather and environmental data. In all...,https://www.timescale.com/blog/how-to-build-a-...,512,"[0.016152484342455864, 0.01139064785093069, 0...."
2,"How to Build a Weather Station With Elixir, Ne...",command in their database migration:SELECT cre...,https://www.timescale.com/blog/how-to-build-a-...,374,"[0.022517921403050423, -0.0019158280920237303,..."
3,CloudQuery on Using PostgreSQL for Cloud Asset...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/cloudquery-on-u...,519,"[0.008887906558811665, -0.0048979795537889, 0...."
4,CloudQuery on Using PostgreSQL for Cloud Asset...,Architecture with CloudQuery SDK- Writing plug...,https://www.timescale.com/blog/cloudquery-on-u...,511,"[0.020441284403204918, 0.010131468996405602, 0..."


In [79]:
# Save the dataframe with embeddings as a CSV file
# /content/vector-cookbook/openai_pgvector_helloworld/blog_data_and_embeddings.json

df_new.to_csv('blog_data_and_embeddings.csv', index=False)
# It may also be useful to save as a json file, but we won't use this in the tutorial
#df_new.to_json('blog_data_and_embeddings.json')

## Part 2: Store embeddings with pgvector
In this section, we'll store our embeddings and associated metadata.

We'll use PostgreSQL as a vector database, with the pgvector extension.

You can create a cloud PostgreSQL database for free on [Timescale](https://console.cloud.timescale.com/signup) or use a local PostgreSQL database for this step.

In this Notebook version, we use PostgreSQL with pgvector extension locally in Google Cloud, developed by Frank Morales on 04/12/2023. Initially, this Notebook was using PostgreSQL with Timescale's configuration.



### 2.2 Connect to and configure your vector database


In [96]:
# Timescale database connection string
# Found under "Service URL" of the credential cheat-sheet or "Connection Info" in the Timescale console
# In terminal, run: export TIMESCALE_CONNECTION_STRING=postgres://<fill in here>

#ORIGINAL
#connection_string  = os.environ['TIMESCALE_CONNECTION_STRING']


# install PSQL and DEV Libraries locally added by FRANK MORALES December 4th, 2023.
!apt install postgresql postgresql-contrib &>log
!service postgresql restart
#!sudo apt install postgresql-server-dev-all
#!git clone https://github.com/pgvector/pgvector.git
#%cd /content/pgvector/

#print()
#print('START: PG VECTOR COMPILATION')
#!make
#!make install
#print('END: PG VECTOR COMPILATION')
print()

# PostGRES SQL Settings
#!sudo -u postgres psql -c "CREATE USER postgres WITH SUPERUSER"
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres'"

#connection_string = 'postgresl://postgres:postgres@localhost:5432/postgres'

#CREATE EXTENSION IF NOT EXISTS btree_gist
#!sudo -u postgres psql -c "CREATE EXTENSION IF NOT EXISTS vector"

import psycopg2 as ps

DB_NAME = "postgres"
DB_USER = "postgres"
DB_PASS = "postgres"
DB_HOST = "localhost"
DB_PORT = "5432"

conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

cur = conn.cursor() # creating a cursor


 * Restarting PostgreSQL 14 database server
   ...done.

ALTER ROLE


In [None]:
# Connect to PostgreSQL database in Timescale using connection string
#conn = psycopg2.connect(connection_string)

cur = conn.cursor()

#install pgvector
cur.execute("CREATE EXTENSION IF NOT EXISTS vector");
conn.commit()

# Register the vector type with psycopg2
register_vector(conn)

!sudo -u postgres psql -c "DROP TABLE embeddings"

# Create table to store embeddings and metadata
table_create_command = """
CREATE TABLE IF NOT EXISTS embeddings (
            id bigserial primary key,
            title text,
            url text,
            content text,
            tokens integer,
            embedding vector(1536)
            );
            """

cur.execute(table_create_command)
cur.close()
conn.commit()

Optional: Uncomment and execute the following code only if you need to read the embeddings and metadata from the provided CSV file

In [None]:
# Uncomment and execute this cell only if you need to read the blog data and embeddings from the provided CSV file
# Otherwise, skip to next cell
'''
df = pd.read_csv('blog_data_and_embeddings.csv')
titles = df['title']
urls = df['url']
contents = df['content']
tokens = df['tokens']
embeds = [list(map(float, ast.literal_eval(embed_str))) for embed_str in df['embeddings']]

df_new = pd.DataFrame({
    'title': titles,
    'url': urls,
    'content': contents,
    'tokens': tokens,
    'embeddings': embeds
})
'''

### 2.3 Ingest and store vector data into PostgreSQL using pgvector
In this section, we'll batch insert our embeddings and metadata into PostgreSQL and also create an index to help speed up search.

In [99]:
register_vector(conn)
cur = conn.cursor()

In [100]:
# Remind ourselves of the dataframe structure
df_new.head()

Unnamed: 0,title,content,url,tokens,embeddings
0,"How to Build a Weather Station With Elixir, Ne...",This is an installment of our “Community Membe...,https://www.timescale.com/blog/how-to-build-a-...,501,"[0.021440856158733368, 0.02200360782444477, -0..."
1,"How to Build a Weather Station With Elixir, Ne...",capture weather and environmental data. In all...,https://www.timescale.com/blog/how-to-build-a-...,512,"[0.016152484342455864, 0.01139064785093069, 0...."
2,"How to Build a Weather Station With Elixir, Ne...",command in their database migration:SELECT cre...,https://www.timescale.com/blog/how-to-build-a-...,374,"[0.022517921403050423, -0.0019158280920237303,..."
3,CloudQuery on Using PostgreSQL for Cloud Asset...,This is an installment of our “Community Membe...,https://www.timescale.com/blog/cloudquery-on-u...,519,"[0.008887906558811665, -0.0048979795537889, 0...."
4,CloudQuery on Using PostgreSQL for Cloud Asset...,Architecture with CloudQuery SDK- Writing plug...,https://www.timescale.com/blog/cloudquery-on-u...,511,"[0.020441284403204918, 0.010131468996405602, 0..."


Batch insert embeddings using psycopg2's ```execute_values()```

In [101]:
#Batch insert embeddings and metadata from dataframe into PostgreSQL database

# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()

Sanity check by running some simple queries against the embeddings table

In [102]:
cur.execute("SELECT COUNT(*) as cnt FROM embeddings;")
num_records = cur.fetchone()[0]
print("Number of vector records in table: ", num_records,"\n")
# Correct output should be 129

Number of vector records in table:  129 



In [103]:
# print the first record in the table, for sanity-checking
cur.execute("SELECT * FROM embeddings LIMIT 1;")
records = cur.fetchall()
print("First record in table: ", records)

First record in table:  [(1, 'How to Build a Weather Station With Elixir, Nerves, and TimescaleDB', 'https://www.timescale.com/blog/how-to-build-a-weather-station-with-elixir-nerves-and-timescaledb/', 'This is an installment of our “Community Member Spotlight” series, where we invite our customers to share their work, shining a light on their success and inspiring others with new ways to use technology to solve problems.In this edition,Alexander Koutmos, author of the Build a Weather Station with Elixir and Nerves book, joins us to share how he uses Grafana and TimescaleDB to store and visualize weather data collected from IoT sensors.About the teamThe bookBuild a Weather Station with Elixir and Nerveswas a joint effort between Bruce Tate, Frank Hunleth, and me.I have been writing software professionally for almost a decade and have been working primarily with Elixir since 2016. I currently maintain a few Elixir libraries onHexand also runStagira, a software consultancy company.Bruce T

Create index on embedding column for faster cosine similarity comparison

In [104]:
# Create an index on the data for faster retrieval
# this isn't really needed for 129 vectors, but it shows the usage for larger datasets
# Note: always create this type of index after you have data already inserted into the DB

#calculate the index parameters according to best practices
num_lists = num_records / 1000
if num_lists < 10:
    num_lists = 10
if num_records > 1000000:
    num_lists = math.sqrt(num_records)

#use the cosine distance measure, which is what we'll later use for querying
cur.execute(f'CREATE INDEX ON embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = {num_lists});')
conn.commit()

## Part 3: Nearest Neighbor Search using pgvector

In this final part of the tutorial, we will query our embeddings table.

We'll showcase an example of RAG: Retrieval Augmented Generation, where we'll retrieve relevant data from our vector database and give it to the LLM as context to use when it generates a response to a prompt.

In [105]:
# Helper function: get text completion from OpenAI API
# Note max tokens is 4097
# Note we're using the latest gpt-3.5-turbo-0613 model

# completion = client.chat.completions.create(

# models https://platform.openai.com/docs/models/continuous-model-upgrades
# models https://platform.openai.com/docs/models

#MODEL NAME	DISCONTINUATION DATE	REPLACEMENT MODEL

#gpt-3.5-turbo-0613	Jun 13, 2024	gpt-3.5-turbo-1106

#gpt-3.5-turbo-0301	Jun 13, 2024	gpt-3.5-turbo-1106

#gpt-4-0314	Jun 13, 2024	gpt-4-0613

#gpt-4-32k-0314	Jun 13, 2024	gpt-4-32k-0613

# How do I access GPT-4 32k?
#Sign up for the Azure service. Apply for access to OpenAI models
#using this form: https://aka.ms/oai/get-gpt4.
#Once you've gained access, create a subscription in the "East Canada" region
#(click the Create +). Open the Azure OpenAI Studio and create a new Deployment
#for the gpt4-32k in the Deployment menu.


# ORIGINAL
#def get_completion_from_messages(messages, model="gpt-3.5-turbo-0613", temperature=0, max_tokens=1000): # ORIGINAL

# ADDED by FRANK MORALES 04/12/2023
def get_completion_from_messages(messages, model="gpt-4-0613", temperature=0, max_tokens=1000): # ADDED by FRANK MORALES 04/12/2023

#def get_completion_from_messages(messages, model="gpt-3.5-turbo-0613", temperature=0, max_tokens=1000):
    response = openai.chat.completions.create( ## NEW API by Frank Morales
    #response = openai.ChatCompletion.create( #OLD API ORIGINAL
        model=model,
        messages=messages,
        temperature=temperature,
        max_tokens=max_tokens,
    )
    #return response.choices[0].message["content"] # OLD API ORIGINAL
    return response.choices[0].message.content ## NEW API by Frank Morales

    #print(completion.choices[0].message.content)

In [106]:
# Helper function: Get top 3 most similar documents from the database
def get_top3_similar_docs(query_embedding, conn):
    embedding_array = np.array(query_embedding)
    # Register pgvector extension
    register_vector(conn)
    cur = conn.cursor()
    # Get the top 3 most similar documents using the KNN <=> operator
    cur.execute("SELECT content FROM embeddings ORDER BY embedding <=> %s LIMIT 3", (embedding_array,))
    top3_docs = cur.fetchall()
    return top3_docs

### 3.1 Define a prompt for the LLM
Here we'll define the prompt we want the LLM to provide a reponse to.

We've picked an example relevant to the blog post data stored in the database.

In [107]:
# Question about Timescale we want the model to answer
input = "How is Timescale used in IoT?"

In [108]:
# Function to process input with retrieval of most similar documents from the database
def process_input_with_retrieval(user_input):
    delimiter = "```"

    #Step 1: Get documents related to the user input from database
    related_docs = get_top3_similar_docs(get_embeddings(user_input), conn)

    # Step 2: Get completion from OpenAI API
    # Set system message to help set appropriate tone and context for model
    system_message = f"""
    You are a friendly chatbot. \
    You can answer questions about timescaledb, its features and its use cases. \
    You respond in a concise, technically credible tone. \
    """

    # Prepare messages to pass to model
    # We use a delimiter to help the model understand the where the user_input starts and ends
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": f"{delimiter}{user_input}{delimiter}"},
        {"role": "assistant", "content": f"Relevant Timescale case studies information: \n {related_docs[0][0]} \n {related_docs[1][0]} {related_docs[2][0]}"}
    ]

    final_response = get_completion_from_messages(messages)
    return final_response

In [109]:
response = process_input_with_retrieval(input)
print(input)
print()
print(response)
print()

How is Timescale used in IoT?

TimescaleDB is widely used in IoT (Internet of Things) for data storage and analysis. IoT devices generate a large amount of time-series data, which is data that is indexed by time. TimescaleDB, being a time-series database built on PostgreSQL, is well-suited for managing this type of data. 

Here are some ways TimescaleDB is used in IoT:

1. **Data Ingestion**: IoT devices generate a lot of data that needs to be ingested and stored efficiently. TimescaleDB can handle high write loads and is capable of ingesting millions of data points per second.

2. **Data Analysis**: TimescaleDB supports full SQL and joins, making it easy to analyze IoT data in real-time. It also provides advanced time-series capabilities like time bucketing, gap filling, aggregations, and more.

3. **Data Retention**: IoT applications often require retaining data for a long period. TimescaleDB provides efficient data retention policies that allow older data to be compressed or discard

In [110]:
# We can also ask the model questions about specific documents in the database
input_2 = "Tell me about Edeva and Hopara. How do they use Timescale?"
response_2 = process_input_with_retrieval(input_2)
print(input_2)
print()
print(response_2)

Tell me about Edeva and Hopara. How do they use Timescale?

Edeva and Hopara are two companies that use TimescaleDB for their operations.

Edeva is a Swedish company that develops intelligent traffic systems, including a dynamic speed bump called Actibump. They use TimescaleDB as the main database in their smart city system. Their clients can control their IoT devices and see the data that has been captured, getting an overview of trends and historical data. Edeva uses TimescaleDB's continuous aggregations feature to render their dashboards quickly and efficiently. This feature has significantly improved their query speed, changing their dashboards from sluggish to lightning fast.

Hopara, on the other hand, is a Boston-based company that provides a visualization system for any kind of data, especially applicable for real-time monitoring applications. They use TimescaleDB to power their real-time views. To guarantee a real-time display, Hopara fetches live data from the database for ev