<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/SemanticSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Frank Morales developed This Notebook in Python (December 2nd, 2023) with modifications from the URL below related to  **"Semantic Search with PostgreSQL and OpenAI Embeddings.**"

The Notebook was tested successfully in the Google Cloud using Google Colab. The audience will also learn how to use and install PostgreSQL with the pgvector extension in Google Colab and securely use the OPenAI KEY in the Google Cloud.

The **pgvector extension** is a powerful tool that turns PostgreSQL into a vector database. It introduces a dedicated data type, operators, and functions that enable efficient storage, manipulation, and analysis of vector data directly within the PostgreSQL database1. Here are some key features and use cases:

Vector storage: The pgvector extension lets you store high-dimensional vectors directly in PostgreSQL tables. It provides a dedicated data type for vector representation, allowing efficient storage and retrieval of vector data.

Similarity search: With pgvector, you can perform similarity searches based on vector similarity metrics such as cosine similarity or Euclidean distance.

Natural Language Processing (NLP) and text analysis: pgvector is particularly useful in NLP applications. It allows you to represent text documents as vectors using word embeddings or document embeddings.

Computer vision: The pgvector extension can handle vector representations of images and enable similarity-based image search.






https://towardsdatascience.com/semantic-search-with-postgresql-and-openai-embeddings-4d327236f41f

In [2]:

#Install Libraries to access Google Drive and OpenAI resources.
#!pip install colab-env --upgrade
#!pip install openai==0.28

import colab_env
import os
import openai

print()
print('TEST - OPENAI/API - BY FRANK MORALES - DECEMBER 2, 2023 ')
print()

openai.api_key = os.getenv("API")

#from openai import OpenAI
#client = OpenAI(api_key = os.getenv("API"))

from openai.embeddings_utils import cosine_similarity
openai.api_key = os.getenv("API")

def get_embedding(text: str) -> list:
 response = openai.Embedding.create(
     input=text,
     model="text-embedding-ada-002"
 )
 return response['data'][0]['embedding']

good_ride = "good ride"
good_ride_embedding = get_embedding(good_ride)

#print(good_ride_embedding)
# [0.0010935445316135883, -0.01159335020929575, 0.014949149452149868, -0.029251709580421448, -0.022591838613152504, 0.006514389533549547, -0.014793967828154564, -0.048364896327257156, -0.006336577236652374, -0.027027441188693047, ...]

len(good_ride_embedding)
# 1536

good_ride_review_1 = "I really enjoyed the trip! The ride was incredibly smooth, the pick-up location was convenient, and the drop-off point was right in front of the coffee shop."
good_ride_review_1_embedding = get_embedding(good_ride_review_1)
cosine_similarity(good_ride_review_1_embedding, good_ride_embedding)
# 0.8300454513797334

good_ride_review_2 = "The drive was exceptionally comfortable. I felt secure throughout the journey and greatly appreciated the on-board entertainment, which allowed me to have some fun while the car was in motion."
good_ride_review_2_embedding = get_embedding(good_ride_review_2)
cosine_similarity(good_ride_review_2_embedding, good_ride_embedding)
# 0.821774476808789

bad_ride_review = "A sudden hard brake at the intersection really caught me off guard and stressed me out. I was not prepared for it. Additionally, I noticed some trash left in the cabin from a previous rider."
bad_ride_review_embedding = get_embedding(bad_ride_review)
cosine_similarity(bad_ride_review_embedding, good_ride_embedding)
# 0.7950041130579355

# install PSQL and DEV Libraries
!apt install postgresql postgresql-contrib &>log
!service postgresql restart
!sudo apt install postgresql-server-dev-all

#!pip install git+https://github.com/pgvector/pgvector.git#egg=pgvector

!git clone https://github.com/pgvector/pgvector.git
%cd /content/pgvector/

print()
print('START: PG VECTOR COMPILATION')
print()
!make
!make install # may need sudo
print('END: PG VECTOR COMPILATION')
print()

#!ls /usr/share/postgresql/14/extension/*control*

# PostGRES SQL Settings
#!sudo -u postgres psql -c "CREATE USER postgres WITH SUPERUSER"
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres'"

#CREATE EXTENSION IF NOT EXISTS btree_gist
!sudo -u postgres psql -c "CREATE EXTENSION IF NOT EXISTS vector"

import psycopg2 as ps
DB_NAME = "postgres"
DB_USER = "postgres"
DB_PASS = "postgres"
DB_HOST = "localhost"
DB_PORT = "5432"

conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

!sudo -u postgres psql -c "DROP TABLE reviews"

cur = conn.cursor() # creating a cursor
cur.execute("""
                            CREATE TABLE reviews
                            (text TEXT, embedding vector(1536))
                         """)

conn.commit()
print("Review Created successfully")
conn.close()
cur.close()

print()

def text_and_embedding(text,textid):
    review_embedding=get_embedding(text)
    ### INSERT INTO DB
    DB_NAME = "postgres"
    DB_USER = "postgres"
    DB_PASS = "postgres"
    DB_HOST = "localhost"
    DB_PORT = "5432"
    conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)


    cur = conn.cursor() # creating a cursor

    cur.execute("""
        INSERT INTO reviews
        (text, embedding)
        VALUES ('%s',
                vector('%s'))""" % (text,review_embedding))

    conn.commit()
    print("INSERT TEXTID %s successfully"%textid)
    conn.close()
    cur.close()


print()
text_and_embedding(good_ride_review_1,1)
text_and_embedding(good_ride_review_2,2)
text_and_embedding(bad_ride_review,3)
print()

#print(good_ride_review_1_embedding)
print()

conn = ps.connect(database=DB_NAME,
							user=DB_USER,
							password=DB_PASS,
							host=DB_HOST,
							port=DB_PORT)

num_characters = int(len(good_ride_embedding))/10

cur = conn.cursor() # creating a cursor
cur.execute("""
    SELECT substring(text, 0, %s) FROM reviews ORDER BY embedding <-> vector('%s')
""" % (int(num_characters),good_ride_embedding))

conn.commit()
print()
print("QUERY SELECTION successfully")
print()

records = cur.fetchall()
print("Total rows are:  ", len(records))

print("Printing each row")
print()
n=0
for row in records:
    n=n+1
    print("TEXT %s: "%n, row[0])

conn.close()
cur.close()



TEST - OPENAI/API - BY FRANK MORALES - DECEMBER 2, 2023 

 * Restarting PostgreSQL 14 database server
   ...done.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
postgresql-server-dev-all is already the newest version (238).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.
Cloning into 'pgvector'...
remote: Enumerating objects: 5026, done.[K
remote: Counting objects: 100% (1901/1901), done.[K
remote: Compressing objects: 100% (349/349), done.[K
remote: Total 5026 (delta 1646), reused 1717 (delta 1540), pack-reused 3125[K
Receiving objects: 100% (5026/5026), 734.70 KiB | 9.18 MiB/s, done.
Resolving deltas: 100% (3628/3628), done.
/content/pgvector

START: PG VECTOR COMPILATION

make: Nothing to be done for 'all'.
/bin/mkdir -p '/usr/lib/postgresql/14/lib'
/bin/mkdir -p '/usr/share/postgresql/14/extension'
/bin/mkdir -p '/usr/share/postgresql/14/extension'
/usr/bin/install -c -m 755  vector.so '/usr/lib/postgresql/14/li