# Semantic Search Engine

This notebook demonstrates how to implement a semantic search engine using OpenAI's word embeddings. We will compute `embeddings` for product descriptions and names, store them, and perform semantic search to find the most relevant results for a given query.


## Libraries/packages used

* [<code style="color:blue;">openai:</code>](https://platform.openai.com/docs/) The core library for accessing OpenAI’s models 

* [<code style="color:blue;">numpy:</code>](https://numpy.org/doc/) A powerful library for numerical computing, providing support for multi-dimensional arrays, mathematical functions, and array operations

* [<code style="color:blue;">pandas:</code>](https://pandas.pydata.org/pandas-docs/stable/) A library for data manipulation and analysis, offering data structures like DataFrame and Series for handling structured data

* [<code style="color:blue;">getpass:</code>](https://docs.python.org/3/library/getpass.html) Allows secure entry of passwords or sensitive input without displaying it on the console

To run this code on your local machine, please run the following commands in your command line tool.
* `pip install openai -q`
* `pip install numpy -q`
* `pip install pandas -q`
* `pip install getpass`

<a id='TOC'></a>
### **Table of contents**

1. <a href="#setup">Setting up OpenAI API</a><br>
2. <a href="#read_data">Reading the dataset</a><br>
3. <a href="#compute_embeddings">Calculating word embeddings</a><br>
4. <a href="#save_embeddings">Saving word embeddings</a><br>
5. <a href="#semantic_search">Performing semantic search</a><br>
    5.1. <a href="#cosine_similarity">Calculating cosine similarity</a><br>
    5.2. <a href="#results">Sorting and displaying results</a><br>

<a id='setup'></a>
## **1. Setting up OpenAI API**  
[Back to table of contents](#TOC)  

* [<code style="color:blue;">warnings</code>](https://docs.python.org/3/library/warnings.html)  

In [3]:
# import required libraries
import openai
import pandas as pd
import numpy as np
from getpass import getpass

# used for suppressing warnings
import warnings
warnings.filterwarnings("ignore")

# enter OpenAI API key securely
openai.api_key = getpass('Enter OpenAI API key:')

<a id='read_data'></a>
## **2. Reading the dataset**  
[Back to table of contents](#TOC)  

We’ve set up OpenAI, so let's start by working with a CSV file containing product data. The dataset includes details about various products, such as `product ID`, `name`, `URL`, `description`, `price`, `color`, `inventory count`, `sale status`, `promotions`, and related keywords. The products range from clothing items like jeans and footwear to accessories like hats and handbags, for both men and women.

* [<code style="color:blue;">read_csv</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) 

In [None]:
# load the data into a pandas dataframe
df = pd.read_csv('data/product/data.csv')

# display the data
df.head()

In [None]:
# display shape of data
print(f"Dataset shape: {df.shape}")

In [None]:
# displaying the 'Gender' column from the DataFrame
df['value/Gender']

<a id='compute_embeddings'></a>
## **3. Calculating word embeddings**  
[Back to table of contents](#TOC)  

To perform semantic search using word embeddings, we convert words into numerical representations. We use OpenAI's `get_embedding` function to compute these embeddings. Since our words are stored in a `pandas` DataFrame, we apply the `get_embedding` function to each row using the `apply` method. The calculated embeddings are then saved in a file called `word_embeddings.csv` to avoid making repeated API calls.
* [<code style="color:blue;">rename</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) 

In [None]:
# rename columns for clarity
df = df.rename(columns={
    'value/description': 'Description',
    'value/productName': 'ProductName',
    'value/productID': 'ProductID',
    'value/productURL': 'ProductURL',
    'value/img_url': 'ImageURL'
})
df.head()

In [None]:
# select required columns
df1 = df[['ProductID', 'ProductName', 'ProductURL', 'Description', 'ImageURL']]
df1.head()

Now we add two other columns. The first new column, `embedding1`, contains the embeddings for the text present in the `ProductName` column, and the second new column, `embedding2`, contains the embeddings for the text in the `Description` column. The embeddings are generated using the `text-embedding-ada-002` engine.

* [<code style="color:blue;">OpenAI</code>](https://python.langchain.com/api_reference/openai/llms/langchain_openai.llms.base.OpenAI.html#langchain_openai.llms.base.OpenAI) 
* [<code style="color:blue;">apply</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [None]:
# define a function to get embeddings
from openai import OpenAI
client = OpenAI(api_key=openai.api_key)

def get_embedding(text, engine="text-embedding-ada-002"):
    # replace newlines and retrieve the embedding
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=engine).data[0].embedding

# calculate embeddings for product name and description
df1['embedding1'] = df1['ProductName'].apply(lambda x: get_embedding(x, engine="text-embedding-ada-002"))
df1['embedding2'] = df1['Description'].apply(lambda x: get_embedding(x, engine="text-embedding-ada-002"))

print("Embeddings successfully calculated.")

<a id='save_embeddings'></a>
## 4. Saving word embeddings  
[Back to table of contents](#TOC)  

We store the calculated word embeddings in a new text file called `word_embeddings.csv` so that we don't have to call OpenAI again to perform these calculations.

* [<code style="color:blue;">to_csv</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) 

In [None]:
# save embeddings to a CSV file
df1.to_csv('data/product/word_embeddings.csv')
print("Embeddings saved to word_embeddings.csv")

In [None]:
df1.head()

<a id='semantic_search'></a>
## 5. Performing semantic search
[Back to table of contents](#TOC)  

Now that we have saved our word embeddings, we can load them into a new DataFrame for semantic search. Since the `embedding` values in the CSV are stored as strings, we'll use the `apply()` function to evaluate these strings as Python code and convert them into numpy arrays, allowing us to perform calculations.

* [<code style="color:blue;">np.array</code>](https://numpy.org/doc/stable/reference/generated/numpy.array.html)

In [None]:
# load embeddings
df = pd.read_csv('data/product/word_embeddings.csv')

In [None]:
# convert embedding columns back to numpy arrays
df['embedding1'] = df['embedding1'].apply(eval).apply(np.array)
df['embedding2'] = df['embedding2'].apply(eval).apply(np.array)
df.head()

We create a new column in our DataFrame named `embeddings` by adding the `embedding1` and `embedding2` arrays element-wise. This could be useful for further analysis or computations.

In [None]:
# combine embeddings for analysis
df['embeddings'] = df['embedding1'] + df['embedding2']
df.head()

print("Embeddings loaded and combined successfully.")

In [None]:
df['ImageURL'] = df['ImageURL'].str.replace('#STORAGE_ACCOUNT_NAME#', 'stretailprod')

<a id='cosine_similarity'></a>
### 5.1. Calculating cosine similarity 
[Back to table of contents](#TOC)  

We calculate cosine similarity between a user-provided search term and the product embeddings.  
The code now prompts the user to enter a search term. This will be used to perform a semantic search in the dataset to find similar embeddings.

* [<code style="color:blue;">input</code>](https://docs.python.org/3/library/functions.html#input) 

In [None]:
# take search input
search_term = input("Enter a search term: ")

To perform the semantic search, we use OpenAI's embeddings utility. We call `get_embedding` with the user's input search term and specify the engine as `text-embedding-ada-002`. This function retrieves the embedding vector for the search term.

In [None]:
# calculate embedding for the search term
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")

In this step, we calculate cosine similarities between the search term vector and the `embeddings` column for each row in the DataFrame. The result is stored in a new column called `similarities`. Cosine similarity is a measure of how similar two vectors are, and it's often used in text and vector-based search to find similarities between data points.

* [<code style="color:blue;">cosine</code>](https://docs.scipy.org/doc/scipy-1.15.0/reference/generated/scipy.spatial.distance.cosine.html)

In [None]:
# calculate cosine similarity
from scipy.spatial.distance import cosine

df["similarities"] = df['embeddings'].apply(lambda x: 1 - cosine(x, search_term_vector))
print("Cosine similarity calculated for search term.")

<a id='results'></a>
### 5.2. Sorting and displaying results  
[Back to table of contents](#TOC)  

Now,we are sorting the DataFrame `df` by the `similarities` column in descending order (highest similarity first) using the `sort_values` method. The result is stored in the `result` DataFrame, which will contain the top 10 rows with the highest similarity scores.

* [<code style="color:blue;">sort_values</code>](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

In [None]:
# sort results based on similarity
result = df.sort_values("similarities", ascending=False).head(10)

# display top product names
print("Top 10 similar products:")
print(result[['ProductName', 'Description', 'similarities']])