## What is Hybrid search?

Hybrid search merges different search methods to enhance result quality.It is a combination of keyword search with semantic search. Suppose a user searches  "Italian recipes with tomato sauce" , if we are using keyword search it would just bring text with words italian,reciepies,tomato or sause. But in many cases it fails to bring some dishes which is italian with tomato as incredient but those keywords are not present.Similarly if we are just using semantic search, it may bring reciepies like mexican salsa recipie and not what the user is looking for. In such cases hybrid search can be a good option to go.Hybrid search combines the strengths of both these methods


**When to use hybrid search?**

The decision to use hybrid search depends on what your users are looking for in your app. 
- For a code repository where developers need to find exact lines of code or error messages, keyword search is likely ideal because it matches specific terms. 
- In a mental health form, where user express about their feelings, semantic search will be agood option.
- For a shopping app where customers might search for specific product names yet also be open to related suggestions, hybrid search  would be agood option.


**Some key advantages of using hybrid search**

- Precision: Keyword search enables exact matches to the query, leaving no room for ambiguity. 
- Context: Semantic search allows algorithms to understand the intent of the query. If no keywords are matched, the semantic search will step in to analyze the context and meaning behind the query, ensuring that relevant results are still provided and covering any gaps in keyword-based matching.
- Relevance: Both techniques complement each other and improve relevance for unseen queries.



## Implementing Hybrid Search

In [2]:
from dotenv import load_dotenv
import os,json 
from openai import AzureOpenAI
import psycopg2,os
from psycopg2.extras import execute_values
from pgvector.psycopg2 import register_vector
import numpy as np
import pandas as pd 


load_dotenv()


## Intializing connection and Creating a Table 
DBUSER = os.environ["DBUSER"]
DBPASS = os.environ["DBPASS"]
DBHOST = os.environ["DBHOST"]
DBNAME = os.environ["DBNAME"]
# Use SSL if not connecting to localhost
DBSSL = "disable"
if DBHOST != "localhost":
    DBSSL = "require"


def initiate_connection():
    conn = psycopg2.connect(database=DBNAME, user=DBUSER, password=DBPASS, host=DBHOST, sslmode=DBSSL,port=38530)
    conn.autocommit = True
    return conn

def execute_statement(sql):
    cur = conn.cursor()
    cur.execute(sql)
    return cur.fetchall()


### Creating Table and embedding index

In [55]:
### Create table
 
## initialize a connection
conn = initiate_connection()
## create a table
cur = conn.cursor()

def create_table(conn,table_name):
    cur = conn.cursor()
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
    cur.execute(f"DROP TABLE IF EXISTS {table_name}")

    ### we have 1 additional columns for embeddings
    cur.execute(f"CREATE TABLE {table_name} (id bigserial PRIMARY KEY, product_name TEXT, category TEXT, details TEXT, embedding VECTOR(1536));")
    register_vector(conn)



create_table(conn,'amazon_products')


In [59]:
## lets create hsnw index

sql_query = "CREATE INDEX ON amazon_products USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);"
cur = conn.cursor()
cur.execute(sql_query)

### Data Injetion

In [3]:
## lets read the data to injest 

df = pd.read_csv('data/amazon_products.csv')
df['Text'] = df['Product Name'] + ' ' + df['About Product']
df = df[~df['Text'].isnull()].reset_index(drop=True)
df = df.sample(frac=0.25).reset_index(drop=True)
df.to_csv('data/amazon_produts_mini.csv',index=False)
print(df.shape)
df.head(2)

(2432, 4)


Unnamed: 0,Product Name,Category,About Product,Text
0,Bezier Games Palace of Mad King Ludwig,Toys & Games | Games & Accessories | Board Games,Make sure this fits by entering your model num...,Bezier Games Palace of Mad King Ludwig Make su...
1,Great Eastern Entertainment Sonic The Hedgehog...,Toys & Games | Dress Up & Pretend Play | Acces...,Make sure this fits by entering your model num...,Great Eastern Entertainment Sonic The Hedgehog...


In [35]:
## define open ai model for embedding
client = AzureOpenAI(api_key=os.getenv("OPENAI_API_KEY"),
                    api_version=os.getenv("OPENAI_API_VERSION"),
                        azure_endpoint = os.getenv("OPENAI_API_ENDPOINT")
    )

def get_embeddings(text,client):
   """
   creates openai embedding
   """
   response = client.embeddings.create(
       model = 'text-embedding-ada-002',
       input=text
   )
   response = response.json()
   embedding = json.loads(response)
   return embedding["data"][0]["embedding"]


##example
sample_text = print(df['Text'].iloc[0])
emb = get_embeddings(df['Text'].iloc[0],client)
print(emb)

Fun Rugs Shapes Childrens Rug, 51-Inch by 78-Inch Fire retardant | Easy to clean | Available in multiple sizes | Your home is a natural extension of you | Accent your room with these innovative designs from LA Rug Inc to spruce up any decor
[0.026397015899419785, 0.027831636369228363, 0.000690411077812314, -0.020123811438679695, -0.008920730091631413, -0.008216462098062038, -0.01170519832521677, -0.020749827846884727, 0.012337734922766685, -0.006768799852579832, 0.0062927668914198875, -0.019928181543946266, -0.0033306016121059656, -0.008477302268147469, -0.012559449300169945, 0.007714345119893551, 0.018989156931638718, 0.01460704393684864, -0.005578717216849327, 0.0002916110388468951, -0.017098067328333855, 0.010746611282229424, -0.030909547582268715, -0.008764225989580154, -0.028979331254959106, -0.01651117578148842, 0.02041073516011238, -0.018910905346274376, 0.02264091745018959, -0.0025709050241857767, 0.023149555549025536, -0.022132279351353645, -0.007388295140117407, -0.0480989105

In [82]:

data = []
for i,row in df.iterrows():
    
    try: 
        emb = get_embeddings(row['Text'],client)
        sample = (i,row['Product Name'],row['Category'],row['About Product'],emb)
        data.append(sample)
    except:
        print(i,row["Text"]) 
        break

In [83]:
# copy the data to our table

###   inject data


## initialize a connection
conn = initiate_connection()
cur = conn.cursor()
# execute_values(cur,f"INSERT INTO amazon_products (id,product_name,category,details,embedding) VALUES %s",data)



## print total number of records in the table
cur.execute(f"SELECT COUNT(*) as cnt FROM amazon_products")
num_records = cur.fetchone()[0]
print("Total number of records inserted: ",num_records)

Total number of records inserted:  2432


### Creating a column for text search and indexing it

In [7]:
sql = """alter table amazon_products add column tsvector_search tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(product_name, '') || ' ' || coalesce(category, '') || ' ' || coalesce(details, ''))) STORED;"""
cur = conn.cursor()
cur.execute(sql)

## Now lets create GIN index
sql = "CREATE INDEX amazon_txt_idx ON amazon_products USING GIN (tsvector_search);"
cur = conn.cursor()
cur.execute(sql)


In [11]:
sql = "select * from amazon_products limit 2"
execute_statement(sql)

[(0,
  'Star Wars: Legion - Airspeeder Unit Expansion',
  'Toys & Games | Toy Figures & Playsets | Playsets & Vehicles | Vehicles',
  'Make sure this fits by entering your model number. | You may be battling the Empire on the frozen wastes of Hoth, or fighting on the surface of any other planet, across the thousands of planets that make up the galaxy.',
  '[0.0012129159,-0.011669059,-0.0027408106,-0.036702458,-0.014987056,0.015884168,-0.0005491516,-0.033536177,-0.0006076948,-0.02409011,0.017599236,0.0125134,-0.008766636,0.010620229,0.012724485,0.010211252,0.022388235,-0.006906447,0.012691503,-0.014116329,0.018562313,-0.014327414,0.0045977016,-0.017876286,0.0018239089,-0.009320735,0.028259045,-0.015250913,0.018351229,-0.011121556,0.033483405,0.0006538697,-0.015211334,-0.024380352,-0.010732368,-0.012196773,0.030422669,0.012144001,0.027018918,0.0042942665,0.046254065,0.029340856,0.0011733375,0.0252115,-0.025554514,-0.0015245966,0.03066014,-0.010409144,0.011424991,0.021768171,-0.01887894,0

### Hybrid search


**Approcah 1: We can use CrossEnocde of a sentence transformer to rank our queries**

In [40]:
def semantic_search(query):
    
    embedding_ = get_embeddings(query,client)
    with conn.cursor() as cur:
        result = cur.execute(f"SELECT id, product_name FROM amazon_products ORDER BY embedding <=> '{embedding_}' LIMIT 5")
        return cur.fetchall()
    

def keyword_search(query):
    with conn.cursor() as cur:
        result = cur.execute(f"""SELECT id, product_name, ts_rank_cd(tsvector_search, search_query) AS rank
                           FROM amazon_products, plainto_tsquery('{query}') as search_query where search_query @@ tsvector_search
                           ORDER BY rank DESC limit 5""")
        
        
        return cur.fetchall()
    



In [50]:
def semantic_search(query):
    
    embedding_ = get_embeddings(query,client)
    with conn.cursor() as cur:
        result = cur.execute(f"SELECT id, product_name, RANK () OVER(ORDER BY embedding <=> '{embedding_}') as rank  FROM amazon_products ORDER BY embedding <=> '{embedding_}' LIMIT 5")
        return cur.fetchall()
    
results_semantic = semantic_search(query)
for i in results_semantic:
    print(i)

(1381, 'MooToys Handy Tool Set with Play Tools Includes Electric Plastic Drill, and Pretend Tools for Toddlers, Yellow (MT-110)', 1)
(181, 'Playtex Musical Monkey Blue', 2)
(534, 'Toomies Tomy Octopus Ball Toy', 3)
(2313, 'Infantino Discover and Play Soft Blocks Development Toy', 4)
(466, 'Ambi Toys, Activity Case', 5)


In [44]:

query = "toy for kids"

print('Semantic Search: ')

results_semantic = semantic_search(query)
for i in results_semantic:
    print(i)

print(' ')
print('----'*20)
print(' ')
print('Full Text Search: ')

results_text = keyword_search(query)
results_text = [(i[0],i[1]) for i in results_text]
for i in results_text:
    print(i)

Semantic Search: 
(1381, 'MooToys Handy Tool Set with Play Tools Includes Electric Plastic Drill, and Pretend Tools for Toddlers, Yellow (MT-110)')
(181, 'Playtex Musical Monkey Blue')
(534, 'Toomies Tomy Octopus Ball Toy')
(2313, 'Infantino Discover and Play Soft Blocks Development Toy')
(466, 'Ambi Toys, Activity Case')
 
--------------------------------------------------------------------------------
 
Full Text Search: 
(32, 'Avengers NERF Power Moves Marvel Iron Man Repulsor Blast Gauntlet NERF Dart-Launching Toy for Kids Roleplay, Toys for Kids Ages 5 and Up')
(1771, 'Step2 2-in-1 Toy Box & Art Lid | Plastic Toy & Art Storage Container, Grey')
(325, 'Wild Republic T-Rex Plush, Dinosaur Stuffed Animal, Plush Toy, Gifts for Kids, Dinosauria 19"')
(1705, 'Viking Toys 1975 Cute Rider Kids Ride On Toy, Bunny, Pink/White')
(483, 'Wild Republic Walrus Plush, Stuffed Animal, Plush Toy, Gifts for Kids, Cuddlekins 8 Inches')


In [45]:
### Now we will use sentance transformer corss encoder for ranking
### It is more sutiable if we use transformer for rncoding also

import itertools 
from sentence_transformers import CrossEncoder

def rerank(query, results):
    # deduplicate
    results = set(itertools.chain(*results))

    # re-rank
    encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    scores = encoder.predict([(query, item[1]) for item in results])
    return [v for _, v in sorted(zip(scores, results), reverse=True)]


In [47]:
results = [results_semantic,results_text]
final_results = rerank(query, results)


for i in final_results:
    print(i)

(32, 'Avengers NERF Power Moves Marvel Iron Man Repulsor Blast Gauntlet NERF Dart-Launching Toy for Kids Roleplay, Toys for Kids Ages 5 and Up')
(483, 'Wild Republic Walrus Plush, Stuffed Animal, Plush Toy, Gifts for Kids, Cuddlekins 8 Inches')
(325, 'Wild Republic T-Rex Plush, Dinosaur Stuffed Animal, Plush Toy, Gifts for Kids, Dinosauria 19"')
(1705, 'Viking Toys 1975 Cute Rider Kids Ride On Toy, Bunny, Pink/White')
(534, 'Toomies Tomy Octopus Ball Toy')
(466, 'Ambi Toys, Activity Case')
(2313, 'Infantino Discover and Play Soft Blocks Development Toy')
(1381, 'MooToys Handy Tool Set with Play Tools Includes Electric Plastic Drill, and Pretend Tools for Toddlers, Yellow (MT-110)')
(1771, 'Step2 2-in-1 Toy Box & Art Lid | Plastic Toy & Art Storage Container, Grey')
(181, 'Playtex Musical Monkey Blue')


**Approach2 : Using a single search using sql**

In [114]:
sql =  """
WITH semantic_search AS (
    SELECT id, product_name,RANK () OVER (ORDER BY embedding <=> '{embedding}') AS rank
    FROM amazon_products
    ORDER BY embedding <=> '{embedding}'
    LIMIT 20
),
keyword_search AS (
    SELECT id, product_name,RANK () OVER (ORDER BY ts_rank_cd(tsvector_search, query) DESC)
    FROM amazon_products, plainto_tsquery('english', '{query}') query
    WHERE tsvector_search @@ query
    ORDER BY ts_rank_cd(tsvector_search, query) DESC
    LIMIT 20
)
SELECT
    COALESCE(semantic_search.id, keyword_search.id) AS id,
    COALESCE(1.0 / ({k}+ semantic_search.rank), 0.0) +
    COALESCE(1.0 / ({k} + keyword_search.rank), 0.0) AS score,
    semantic_search.product_name
FROM semantic_search
FULL OUTER JOIN keyword_search ON semantic_search.id = keyword_search.id
ORDER BY score DESC
LIMIT 5
"""

In [115]:
k = 60
query_embedding = get_embeddings(query,client)
sql_statement = sql.format(embedding=query_embedding,query=query,k=20)
execute_statement(sql_statement)

[(451,
  0.07423136499888862,
  'Terra by Battat – Wild Animals – Assorted Miniature Wild Animal Toys & Cake Toppers For Kids 3+ (60 Pc)'),
 (375, 0.0493577452062433, None),
 (1412, 0.04933652723611931, None),
 (1236, 0.04933079638833474, None),
 (1425, 0.04931886256910439, None)]