<a href="https://colab.research.google.com/github/bhojrajnarwae/AI-Semantic-Search-Engine/blob/main/AI_Semantic_Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Environment Setup**

---
We start by installing the OpenAI and Pinecone clients.

In [1]:
pip install -U openai pinecone-client datasets


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.4-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pinecone-client
  Downloading pinecone_client-2.2.1-py3-none-any.whl (177 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting dnspython>=2.0.0
  Downloading dnspython-2.3.0-py

In [3]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# **Creating Embeddings**

---
To create embeddings we must first initialize our connection to OpenAI Embeddings, we sign up for an API key at OpenAI.


In [5]:
import openai

openai.api_key = "<<enter API key>>"                  # enter Api key
# get API key from top-right dropdown on OpenAI website

openai.Engine.list()            # check we have authenticated or not 

<OpenAIObject list at 0x7f3d0d3a92c0> JSON: {
  "data": [
    {
      "created": null,
      "id": "babbage",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "davinci",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-davinci-edit-001",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "gpt-3.5-turbo-0301",
      "object": "engine",
      "owner": "openai",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "babbage-code-search-code",
      "object": "engine",
      "owner": "openai-dev",
      "permissions": null,
      "ready": true
    },
    {
      "created": null,
      "id": "text-similarity-babbage-001",
      "object": "engine",
      "owner"

# **OpenAI Playground Connection**




In [6]:
import os
import openai

openai.api_key = "<<enter API key>>"           # Personal API


response = openai.Completion.create(
  model="text-davinci-003",
  prompt="",
  temperature=0.7,
  max_tokens=256,
  top_p=1,
  frequency_penalty=0,
  presence_penalty=0
)

The openai.Engine.list() function should return a list of models that we can use. We will use OpenAI's Ada 002 model.

In [7]:
MODEL = "text-embedding-ada-002"                            # text embedded model

res = openai.Embedding.create(
    input=[
        "Sample document text goes here",
        "there will be several phrases in each batch"
    ], engine=MODEL
)
res

<OpenAIObject list at 0x7f3ce2169cc0> JSON: {
  "data": [
    {
      "embedding": [
        -0.0031135426834225655,
        0.011766765266656876,
        -0.00509151816368103,
        -0.027159256860613823,
        -0.01633599027991295,
        0.03237545117735863,
        -0.016160769388079643,
        -0.0010808103252202272,
        -0.02583836019039154,
        -0.006641550455242395,
        0.02012345939874649,
        0.016672953963279724,
        -0.009178885258734226,
        0.02331787347793579,
        -0.010149340145289898,
        0.013458321802318096,
        0.02527226135134697,
        -0.016915567219257355,
        0.012056553736329079,
        -0.01636294648051262,
        -0.004303023684769869,
        -0.006402306258678436,
        -0.00437378603965044,
        0.020810864865779877,
        -0.010567175224423409,
        -0.003726816037669778,
        0.013626803644001484,
        -0.02635054476559162,
        -0.0004172029148321599,
        -0.0021852082572877407,
 

In [None]:
# extract embeddings to a list
embeds = [record['embedding'] for record in res['data']]
print(embeds)

[[-0.003040769835934043, 0.011684642173349857, -0.005026957020163536, -0.027237210422754288, -0.016361193731427193, 0.03234503045678139, -0.016159038990736008, -0.001036894042044878, -0.025822116062045097, -0.00666779326274991, 0.02014825865626335, 0.016657691448926926, -0.009164425544440746, 0.023423193022608757, -0.0101212989538908, 0.01344340294599533, 0.02522912435233593, -0.016873324289917946, 0.012115909717977047, -0.016361193731427193, -0.00426887022331357, -0.006502698641270399, -0.004369948524981737, 0.020808637142181396, -0.01053908932954073, -0.003652293002232909, 0.01369272917509079, -0.026361199095845222, -0.0003171329153701663, -0.0022186669521033764, 0.005822105333209038, -0.010087606497108936, -0.028221039101481438, -0.016159038990736008, -0.0042183310724794865, 0.007466311100870371, -0.0029228453058749437, -0.031455542892217636, 0.023881414905190468, -0.03328842669725418, -0.0003649345017038286, 0.013072783127427101, 0.00707547552883625, -0.005680595990270376, 0.003106

# **Initializing a Pinecone Index**

---
Next, we initialize an index to store the vector embeddings.


In [10]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
    api_key="<<Enter Pinecone API>>",                                      # Enter Your Pinecone API
    environment= "<<Enter Environment>>"                         # find next to API key in console <<Enter Environment>>
)

# check if 'openai' index already exists (only create index if not)
if 'openai' not in pinecone.list_indexes():
    pinecone.create_index('openai', dimension=len(embeds[0]))      # creating index named openai in pinecone vector database
# connect to index
index = pinecone.Index('openai')                              


# **Populating the Index**
---
We will be needing HuggingFace Datasets for downloading the TREC dataset that we will use in this guide. With both OpenAI and Pinecone connections initialized, we can move onto populating the index. For this, we need the TREC dataset.


In [11]:
from datasets import load_dataset                            # Loading trec dataset from Hugging Face dataset 

# load the first 2K rows of the TREC dataset
trec = load_dataset('trec', split='train[:2000]')
trec

Downloading builder script:   0%|          | 0.00/5.09k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.34k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading and preparing dataset trec/default to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/336k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.4k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5452 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/500 [00:00<?, ? examples/s]

Dataset trec downloaded and prepared to /root/.cache/huggingface/datasets/trec/default/2.0.0/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2. Subsequent calls will reuse this data.


Dataset({
    features: ['text', 'coarse_label', 'fine_label'],
    num_rows: 2000
})

In [12]:
from tqdm.auto import tqdm                                  # this is our progress bar

batch_size = 35                                             # process everything in batches of 35
for i in tqdm(range(0, len(trec['text']), batch_size)):
    # set end position of batch
    i_end = min(i+batch_size, len(trec['text']))
    # get batch of lines and IDs
    lines_batch = trec['text'][i: i+batch_size]
    ids_batch = [str(n) for n in range(i, i_end)]
    # create embeddings
    res = openai.Embedding.create(input=lines_batch, engine=MODEL)
    embeds = [record['embedding'] for record in res['data']]
    # prep metadata and upsert batch
    meta = [{'text': line} for line in lines_batch]
    to_upsert = zip(ids_batch, embeds, meta)
    # upsert to Pinecone
    index.upsert(vectors=list(to_upsert))


  0%|          | 0/58 [00:00<?, ?it/s]

# **Querying**

---
With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text query, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a query vector xq. We then use xq to query the Pinecone index.

In [17]:
query = "What caused the 1929 Great Depression?"                       # query searched in dataset

xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.92: Why did the world enter a global depression in 1929 ?
0.87: When was `` the Great Depression '' ?
0.82: What tragedy befell the city of Dogtown in 1899 ?
0.82: What were the causes of the Civil War ?
0.81: What crop failure caused the Irish Famine ?


In [16]:
query = "What were the popular songs in the early 20th century?"       # query searched in dataset

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 10 most similar results
res = index.query([xq], top_k=10, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.93: What were popular songs and types of songs in the 1920s ?
0.83: What was the name of that popular song the Creeps sang ?
0.82: What song did Patti Page set people dancing to in 1950 ?
0.82: Name a band which was famous in the 1960 's .
0.79: What are some important events of the 1830 's ?
0.79: Who danced into stardom with Fred Astaire in 1941 's You 'll Never Get Rich ?
0.79: What are the shortest and the longest songs ever produced ?
0.79: What 1920s cowboy star rode Tony the Wonder Horse ?
0.79: What song served as the closing theme of The Johnny Cash Show ?
0.79: When did the `` Star-Spangled Banner '' become the national anthem ?


Let's perform one final search using the definition of songs rather than the word or related words.

In [18]:
query = "What were the popular act or art of singing in the early 20th century?"

# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']

# query, returning the top 10 most similar results
res = index.query([xq], top_k=10, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

0.90: What were popular songs and types of songs in the 1920s ?
0.81: Name a band which was famous in the 1960 's .
0.80: What was the name of that popular song the Creeps sang ?
0.80: Who is considered The First Lady of the American Stage ?
0.80: What song did Patti Page set people dancing to in 1950 ?
0.79: In what medium is Stuart Hamblen considered to be the first singing cowboy ?
0.79: What was the backup singing group for Roy Rogers ?
0.79: Who patented the first phonograph ?
0.79: Who starred in Singing in the Rain and The Singing Nun ?
0.79: What 1920s cowboy star rode Tony the Wonder Horse ?


**Working On Query**
---
Now we can input our query to check the working of search Engine.

In [19]:
query = input("Enter a query: ")                                          # Input the query to be searched

xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']
res = index.query([xq], top_k=5, include_metadata=True)

for match in res['matches']:
    print(f"{match['score']:.2f}: {match['metadata']['text']}")

Enter a query: Who were the great musicians in 1920s?
0.88: What were popular songs and types of songs in the 1920s ?
0.84: Name a band which was famous in the 1960 's .
0.82: What composer was awarded the Medal of Honor by Franklin D. Roosevelt ?
0.82: What group starred in the movie Rock Around the Clock ?
0.81: Who patented the first phonograph ?


Matching the indices of query in metadata.

In [20]:
res                                                           # matching the query with indices in pinecone with gpt-3

{'matches': [{'id': '835',
              'metadata': {'text': 'What were popular songs and types of songs '
                                   'in the 1920s ?'},
              'score': 0.882370889,
              'values': []},
             {'id': '1799',
              'metadata': {'text': 'Name a band which was famous in the 1960 '
                                   "'s ."},
              'score': 0.84462744,
              'values': []},
             {'id': '1657',
              'metadata': {'text': 'What composer was awarded the Medal of '
                                   'Honor by Franklin D. Roosevelt ?'},
              'score': 0.818383813,
              'values': []},
             {'id': '1766',
              'metadata': {'text': 'What group starred in the movie Rock '
                                   'Around the Clock ?'},
              'score': 0.815941036,
              'values': []},
             {'id': '831',
              'metadata': {'text': 'Who patented the first phon

# **Generating Prompts**
---
We will be generating prompts to check if the serach engine could provide the releated context of query.


In [21]:
import openai                                                            # Generating Prompt
openai.api_key="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"
def create_prompt(query):
    header = "Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text and requires some latest information to be updated, print 'Sorry Not Sufficient context to answer query' \n"
    return header + query + "\n"

def generate_answer(prompt):
    response = openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    temperature=0,
    max_tokens=256,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    stop = [' END']
    )
    return (response.choices[0].text).strip()

In [22]:
prompt = create_prompt(query)                                  # Displaying generated prompt 
print(prompt)

Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text and requires some latest information to be updated, print 'Sorry Not Sufficient context to answer query' 
Who were the great musicians in 1920s?



Final reply from Engine on provided prompt.

In [23]:
reply = generate_answer(prompt)                 # Printing the reply of generated prompt using gpt-3 model
print(reply)

The 1920s saw the emergence of some of the most influential musicians in history, including Louis Armstrong, Duke Ellington, Bessie Smith, Jelly Roll Morton, and Fats Waller.
