# Semantic search using OpenSearch and OpenAI

This notebook guides you through an example of using [Aiven for OpenSearch](https://go.aiven.io/openai-opensearch-os) as a backend vector database for OpenAI embeddings and how to perform semantic search.

## Prerequisites

Before you begin, ensure you have created all necessary accounts and services as highlighted in the [README](./README.md) to follow the prerequisites:
- You have an [Aiven Account](./README.md#setup-your-aiven-account)
- You have created your [opensearch service](./README.md#create-an-opensearch-service)
- You have and OpenAI Account
- You have created AND SAVED an OpenAI API key
- You have setup your python environment for this notebook

## Adding our Environment Variables
To avoid leaking api_keys we will store them in an .env file that is ignored from version control.

**make a copy of `.env_sample`**

In [None]:
! cp .env_sample .env

## Add our OpenAI API key

Open `.env` and replace `<YOUR_OPENAI_API_KEY>` with the key that you saved from OpenAI.

## Add our OpenSearch Service URI

Verify the Aiven for OpenSearch service is in the `RUNNING` state.

![OpenSearch service in the running state](./assets/opensearch-running-state.png)

Select the running service and copy the **Service URI**.

![Copy the OpenSearch Service URI](assets/copy-opensearch-service-uri.png)

Add the OpenSearch Service URI to your `.env` file created above, replacing `<OPENSEARCH_SERVICE_URI>`

## Load our environment variables

In [1]:
import os # to access our variables
from dotenv import load_dotenv


load_dotenv()

True

## Connect to our Opensearch Service

In [2]:
from opensearchpy import OpenSearch

connection_string = os.getenv("OPENSEARCH_SERVICE_URI")

# # Create the client with SSL/TLS enabled, but hostname verification disabled.

# client = OpenSearch(connection_string, use_ssl=True, timeout=100)
client = OpenSearch(connection_string, use_ssl=False, verify_certs=False, timeout=100)

# 클러스터 상태 확인
print(client.info())

{'name': '5f09d60b6eca', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'VwFwI21JQ2SPV60p2Fl8Pw', 'version': {'distribution': 'opensearch', 'number': '2.12.0', 'build_type': 'tar', 'build_hash': '2c355ce1a427e4a528778d4054436b5c4b756221', 'build_date': '2024-02-20T02:20:12.084014282Z', 'build_snapshot': False, 'lucene_version': '9.9.2', 'minimum_wire_compatibility_version': '7.10.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'The OpenSearch Project: https://opensearch.org/'}


In [None]:
# # Opensearch 연결 정보 설정
# host = 'localhost'  # Docker 호스트의 IP 주소 또는 도메인. 로컬 환경에서는 localhost 사용
# port = 9200  # Opensearch가 수신 대기 중인 포트
# auth = ('USERNAME', 'PASSWORD')  # 기본 사용자 이름과 환경 변수에서 비밀번호를 가져옴

# # SSL/TLS 설정이 활성화된 경우 (여기서는 비활성화 예시로 간주)
# # 아래는 SSL 비활성화 상태로 클라이언트 생성 예시입니다.
# client = OpenSearch(
#     hosts=[{'host': host, 'port': port}],
#     http_auth=auth,
#     use_ssl=False,  # HTTPS를 사용하지 않는 경우 False로 설정
#     verify_certs=False,  # 자체 서명된 인증서 사용 시 인증서 검증 비활성화
# )

# # 클러스터 상태 확인
# print(client.info())

In [None]:
# from elasticsearch import Elasticsearch

# client = Elasticsearch(
#     hosts=[
#             "https://localhost:9200"
#     ],
#     http_auth=('elastic', 'InFrbvg22wgMD2+_ME_Y'),
#     ca_certs="./http_ca.crt",
# )
    
# client.ping()

## Download the dataset
To save us from having to recalculate embeddings on a huge dataset, we are using a pre-calculated OpenAI embeddings dataset covering wikipedia articles. We can get the file and unzip it with:

In [2]:
import wget
import zipfile

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'
wget.download(embeddings_url)

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip",
"r") as zip_ref:
    zip_ref.extractall("data")

Let's load the file in a dataframe and check the content with:

In [3]:
import pandas as pd

wikipedia_dataframe = pd.read_csv("data/vector_database_wikipedia_articles_embedded.csv")

wikipedia_dataframe.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


The file contains:
* `id` a unique Wikipedia article identifier
* `url` the Wikipedia article URL
* `title` the title of the Wikipedia page
* `text` the text of the article
* `title_vector` and `content_vector` the embedding calculated on the title and content of the wikipedia article respectively
* `vector_id` the id of the vector

We can create an OpenSearch mapping optimized for this information with:

In [4]:
index_settings ={
    "index": {
      "knn": True,
      "knn.algo_param.ef_search": 100
    }
  }

index_mapping= {
    "properties": {
      "title_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        }
      },
      "content_vector": {
          "type": "knn_vector",
          "dimension": 1536,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "faiss"
        },
      },
      "text": {"type": "text"},
      "title": {"type": "text"},
      "url": { "type": "keyword"},
      "vector_id": {"type": "long"}
      
    }
}

## Create an index in Aiven for OpenSearch

This is where we will store our data

In [5]:
index_name = "openai_wikipedia_index"
client.indices.create(index=index_name, body={"settings": index_settings, "mappings":index_mapping})

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'openai_wikipedia_index'}

## Index data into OpenSearch

Now it's time to parse the the pandas dataframe and index the data into OpenSearch using Bulk APIs. The following function indexes a set of rows in the dataframe:

In [6]:
def dataframe_to_bulk_actions(df):
    for index, row in df.iterrows():
        yield {
            "_index": index_name,
            "_id": row['id'],
            "_source": {
                'url' : row["url"],
                'title' : row["title"],
                'text' : row["text"],
                'title_vector' : json.loads(row["title_vector"]),
                'content_vector' : json.loads(row["content_vector"]),
                'vector_id' : row["vector_id"]
            }
        }

We don't want to index all the dataset at once, since it's way too large, so we'll load it in batches of `200` rows.

In [7]:
from opensearchpy import helpers
import json

start = 0
end = len(wikipedia_dataframe)
batch_size = 200
for batch_start in range(start, end, batch_size):
    batch_end = min(batch_start + batch_size, end)
    batch_dataframe = wikipedia_dataframe.iloc[batch_start:batch_end]
    actions = dataframe_to_bulk_actions(batch_dataframe)
    helpers.bulk(client, actions)

## Verify that our index has populated in our Aiven Console

In the Aiven Console, select **Indexes** in the sidebar and verify that you have documents populated. There should be OVER 20,000 documents.

![OpenSearch Indexes in the Aiven Console](assets/opensearch-indexes.png)

Once all the documents are indexed, let's try a query to retrieve the documents containing `Pizza`:

In [8]:
res = client.search(index=index_name, body={
    "_source": {
        "excludes": ["title_vector", "content_vector"]
    },
    "query": {
        "match": {
            "text": {
                "query": "Pizza"
            }
        }
    }
})

print(res["hits"]["hits"][0]["_source"]["text"])

Pizza is an Italian food that was created in Italy (The Naples area). It is made with different toppings. Some of the most common toppings are cheese, sausages, pepperoni, vegetables, tomatoes, spices and herbs and basil. These toppings are added over a piece of bread covered with sauce. The sauce is most often tomato-based, but butter-based sauces are used, too. The piece of bread is usually called a "pizza crust". Almost any kind of topping can be put over a pizza. The toppings used are different in different parts of the world. Pizza comes from Italy from Neapolitan cuisine. However, it has become popular in many parts of the world.

History 
The origin of the word Pizza is uncertain. The food was invented in Naples about 200 years ago. It is the name for a special type of flatbread, made with special dough. The pizza enjoyed a second birth as it was taken to the United States in the late 19th century.

Flatbreads, like the focaccia from Liguria, have been known for a very long time

## Encode questions with OpenAI

To perform a semantic search, we need to calculate questions encodings with the same embedding model used to encode the documents at index time. In this example, we need to use the `text-embedding-3-small` model.

In [9]:
from openai import OpenAI
# import os

# Define model
EMBEDDING_MODEL = "text-embedding-ada-002"

# Define the Client
openaiclient = OpenAI(
    # This is the default and can be omitted
    api_key=os.getenv("OPENAI_API_KEY"),
)
# Define question
question = 'is Pineapple a good ingredient for Pizza?'

# Create embedding
question_embedding = openaiclient.embeddings.create(input=question, model=EMBEDDING_MODEL)

## Run semantic search queries with OpenSearch

With the above embedding calculated, we can now run semantic searches against the OpenSearch index. We're using `knn` as query type and scan the content of the `content_vector` field.

After running the block below, we should see content semantically similar to the question. Expect documents based on Pineapples, Pizza, Hawaii, Italy, etc.

In [10]:
opensearch_response = client.search(
  index = index_name,
  body = {
      "size": 15,
      "query" : {
        "knn" : {
          "content_vector":{
          "vector":  question_embedding.data[0].embedding,
          "k": 3
        }
      }
    }
  }
)

for result in opensearch_response["hits"]["hits"]:
  print("Id:" + str(result['_id']))
  print("Score: " + str(result["_score"]))
  print("Title: " + str(result["_source"]["title"]))
  print("Text: " + result["_source"]["text"][0:100])


Id:66079
Score: 0.7135362
Title: Pizza Pizza
Text: Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast foo
Id:15719
Score: 0.7114062
Title: Pineapple
Text: The pineapple is a fruit. It is native to South America, Central America and the Caribbean. The word
Id:13967
Score: 0.7108203
Title: Pizza
Text: Pizza is an Italian food that was created in Italy (The Naples area). It is made with different topp
Id:13968
Score: 0.6949483
Title: Pepperoni
Text: Pepperoni is a meat food that is sometimes sliced thin and put on pizza. It is a kind of salami, whi
Id:40989
Score: 0.6696381
Title: Coprophagia
Text: Coprophagia is the eating of faeces. Many animals eat faeces, either their own or that of other anim
Id:90918
Score: 0.66622764
Title: Pizza Hut
Text: Pizza Hut is an American pizza restaurant, or pizza parlor. Pizza Hut also serves salads, pastas and
Id:433
Score: 0.6660739
Title: Lanai
Text: Lanai (or Lānaʻi) is sixth largest of the Hawaiian Islan

## Use OpenAI Chat Completions API to generate a reply

now let's use OpenAI chat `completions` to generate a reply based on the information retrieved.

In [11]:
# Retrieve the text of the first result in the above dataset
top_hit_summary = opensearch_response['hits']['hits'][0]['_source']['text']

# Craft a reply
openai_response = openaiclient.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Answer the following question:" 
            + question 
            + "by using the following text:" 
            + top_hit_summary
        }
    ]
)

choices = openai_response.choices
print(f"Our top hit is \n {top_hit_summary}")
for choice in choices:
    print("------------------------------------------------------------")
    print(choice.message.content)
    print("------------------------------------------------------------")


Our top hit is 
 Pizza Pizza Limited (PPL), doing business as Pizza Pizza (), is a franchised Canadian pizza fast food restaurant with locations throughout Ontario, Canada. Ingredients include pepperoni, pineapples, mushrooms, and other non-exotic produce. Pizza Pizza has served the area for over 30 years. It has over 500 locations around Ontario and are expanding across the nation. They have recently opened shops in Montreal and British Columbia.

Other websites 

 Official website

Canadian fast food restaurants
Companies listed on the Toronto Stock Exchange
------------------------------------------------------------
Pineapple is a popular ingredient used in pizzas served at Pizza Pizza Limited (PPL), a Canadian fast food restaurant with over 500 locations in Ontario and expanding across the nation.
------------------------------------------------------------


## Conclusion

OpenSearch is a powerful tool providing both text and vector search capabilities. Used alongside OpenAI APIs allows you to craft personalized AI applications able to augment the context based on semantic search.

You can try Aiven for OpenSearch, or any of the other Open Source tools, in the Aiven platform free trial by [signing up](https://go.aiven.io/openai-opensearch-signup).