## Pinecone Vector Database Semantic Search Example

This is a simple Semantic Search applicaiton using Pinecone vector database
to store embeddings. Pinecode offers a free starter-index as a community edition.
Actually, it's not that bad, as it allows you to index 100K vector embeddings.

Only a single index is allowed in the community edition. The diagram below
shows the process and flow, and the steps in the notebook demonstrates simple
cronological steps to create a semantic search application. 

<img src="images/pinecone_vectordb.png">

[source](https://www.pinecone.io/learn/vector-database/)

In [2]:
import os
from pinecone import Pinecone, PodSpec
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
from dotenv import load_dotenv, find_dotenv
from tqdm.auto import tqdm

Utility function to extract relevant information returned
by Pinecone search querty and displaying them.

In [3]:
def extract_and_print_matches(results):
    for result in results['matches']:
        print(f"Score  : {round(result['score'], 2)}")
        print(f"Matches: {result['metadata']['text']}")
        print('-' * 50)

### Step 1: Load the IMDB dataset, only use first 50k samples

In [4]:
dataset = load_dataset("imdb", split='train[:50000]')
print(dataset[:1])

{'text': ['I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far b

#### Examine the data

In [5]:
reviews = []
for record in dataset['text']:
    reviews.extend(record.split('\n'))
reviews = list(set(reviews))
print('\n'.join(reviews[:2]))
print('-' * 50)
print(f'Number of reviews: {len(reviews)}')

I am going to go out on a limb, and actually defend "Shades of Grey" as a good clip-show episode, which delved into the life and death struggle of Commander William Thomas Riker who was battling a terminally fatal disease.<br /><br />The scenes from the flashback sequences were implemented quite well with the mood Riker was in such as when he was reliving his romantic episodes such as "11001001," "Angel One," and "Up the Long Ladder." Tragic moments were highlighted such as Tasha's death in "Skin of Evil," as well as elements of pulse-pounding danger in "Heart of Glory," "Conspiracy," and the aforementioned "Skin of Evil." Riker also exhibited courage under fire by telling some humorous jokes such as "An ancestor of mine was bitten by a rattlesnake once...after 3 days of intense pain, the snake died." This episode highlighted the psychological ordeal of Will Riker under extreme duress. And, YES, I am biased in my opinion in proclaiming "Shades of Grey" as a solid episode, because at th

### Step 2: Instantiate the sentence transformer embedding model

In [6]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
# Try encoding a sample review
embeddings = model.encode(reviews[0:1])
print(f"vector shape: {embeddings.shape}; vector length:{len(embeddings[0])}")

vector shape: (1, 384); vector length:384


### Step 3: Set up Pinecone environment. 

Use the `.env` file to load the Pinecone API key and the environment name, 
which is "gcp-starter." In this case, the GCP starter environment is a community edition of Pinecone available for free.

In [10]:
 _ = load_dotenv(find_dotenv())
api_key = os.getenv("PINECONE_API_KEY")
if api_key is None:
    raise ValueError("Please set the PINECONE_API_KEY environment")

pc = Pinecone(
    api_key=api_key,
    environment="gcp-starter",
    spec=PodSpec(environment="gcp-starter")
) 

In [11]:
# check if an index exists in Pinecone
index_name = "starter-index"
existing_indexes = [
    index_info["name"] for index_info in pc.list_indexes()
]
existing_indexes

[]

In [12]:
# Delete index if one already exists, since Pincone allows only 
# one index in the starter community edition
if index_name in existing_indexes:
    print(f"Index {index_name} already exists. Deleting it.")
    pc.delete_index(index_name)

### Step 3: Create an index
And then get a handle or pointer to it

In [13]:
print(f"Creating a new index {index_name}...")
pc.create_index(name=index_name,
        metric="cosine",
        dimension=embeddings.shape[1],
        spec=PodSpec(environment="gcp-starter")
)
# Connect or get a pointer to the index
pindex = pc.Index(index_name)

Creating a new index starter-index...


### Step 4: Upsert data into the index.

In our case, we are going to create IMDB review embeddings in batches
and upsert each batch, along with `ids` and `metadata`.

In [15]:
print("Upserting the embeddings into the index...")
batch_size = 500
for i in tqdm(range(0, len(reviews), batch_size)):

    # create minibatches from the dataset
    i_end = min(i+batch_size, len(reviews))
    
    # create IDs each batch
    ids = [str(x) for x in range(i, i_end)]
    
    # create metadata batch as text and insert review
    metadatas = [{'text': text} for text in reviews[i:i_end]]
    batch = reviews[i:i+batch_size]

    # create an embedding for the batch
    embeddings = model.encode(batch)
    records = zip(ids, embeddings, metadatas)
        
    # upsert each batch to Pinecone
    print(f"Upserting {i} to {i_end} records...")

    pindex.upsert(vectors=records)

Upserting the embeddings into the index...


  0%|          | 0/50 [00:00<?, ?it/s]

Upserting 0 to 500 records...
Upserting 500 to 1000 records...
Upserting 1000 to 1500 records...
Upserting 1500 to 2000 records...
Upserting 2000 to 2500 records...
Upserting 2500 to 3000 records...
Upserting 3000 to 3500 records...
Upserting 3500 to 4000 records...
Upserting 4000 to 4500 records...
Upserting 4500 to 5000 records...
Upserting 5000 to 5500 records...
Upserting 5500 to 6000 records...
Upserting 6000 to 6500 records...
Upserting 6500 to 7000 records...
Upserting 7000 to 7500 records...
Upserting 7500 to 8000 records...
Upserting 8000 to 8500 records...
Upserting 8500 to 9000 records...
Upserting 9000 to 9500 records...
Upserting 9500 to 10000 records...
Upserting 10000 to 10500 records...
Upserting 10500 to 11000 records...
Upserting 11000 to 11500 records...
Upserting 11500 to 12000 records...
Upserting 12000 to 12500 records...
Upserting 12500 to 13000 records...
Upserting 13000 to 13500 records...
Upserting 13500 to 14000 records...
Upserting 14000 to 14500 records...


In [16]:
# Check the index stats
print(pindex.describe_index_stats())

{'dimension': 384,
 'index_fullness': 0.24904,
 'namespaces': {'': {'vector_count': 24904}},
 'total_vector_count': 24904}


### Step 5: Query the Pinecone indexed vector database

In [17]:
query = """This is a classic espionage thriller. I loved the movie, it was capitivating, 
            the plot brilliant, based on a true story, the characters were well developed,
            and their actions unpredictable. The actors were amazing, and the direction of plot was
            very well thought out. Recommended to everyone if you love clock and dagger
            twists and turns of cold war dramas and betrayals, and if you relish how John Le Carre spins his plots 
            in his absorbing novels on cold war espionage tales of spooks and crooks, you shall throughly enjoy this one!"""


In [18]:
# create an embedding for the query
query_embedding = model.encode(query).tolist()
results = pindex.query(vector=query_embedding, top_k=5,
                include_values=False, include_metadata=True)

In [19]:
print("Top 5 results for the query:")

Top 5 results for the query:


In [20]:
extract_and_print_matches(results)

Score  : 0.63
Matches: Every now and then there gets released this movie no one has ever heard of and got shot in a very short time with very little money and resource but everybody goes crazy about and turns out to be a surprisingly great one. This also happened in the '50's with quite a few little movies, that not a lot of people have ever heard of. There are really some unknown great surprising little jewels from the '50's that are worth digging out. "Panic in the Streets" is another movie like that that springs to the mind. Both are movies that aren't really like the usual genre flicks from their time and are also made with limited resources.<br /><br />I was really surprised at how much I ended up liking this movie. It was truly a movie that got better and better as it progressed. Like all 'old' movies it tends to begin sort of slow but once you get into the story and it's characters you're in for a real treat with this movie.<br /><br />The movie has a really great story that inv

### Step 6 (optional): Remove the index
Since only a single index is allowed, we might as well remove it at the end,
and recreate a new one next time if needed or use a different dataset to index

In [21]:
# Delete the index
print(f"Deleting the index {index_name}...")
pc.delete_index(index_name)
print("Done!")

Deleting the index starter-index...
Done!
