# Langchain Retrievers for Pinecone

#### Pinecone API overview
https://docs.pinecone.io/docs/overview

#### LangChain Pinecone
https://docs.pinecone.io/docs/langchain

Dependencies:

pip install --upgrade --quiet  langchain-pinecone langchain-openai langchain

1. Requires a pinecone index
* Create pinecone index using the website
* For dimension select the OpenAI model dimension

2. Requires OpenAI access
* https://platform.openai.com/docs/models/embeddings
* Embedding Model:  text-embedding-ada-002 (1536)

### REMOVE in final
https://www.packtpub.com/article-hub/hands-on-tutorial-on-how-to-use-pinecone-with-langchain

https://medium.com/@james.li/cheatsheet-for-pinecone-crud-using-langchain-caa0a5f97fe0

In [1]:
!pip install --upgrade --quiet  langchain-pinecone langchain-openai langchain

## Setup Pinecone index

Two ways to do it:

1. Set it up using the Pinecone console
2. Use the Pinecone API to create the index

## Setup environment

**Note**

You MUST change the Location of the keys file

In [2]:
from datasets import load_dataset
from dotenv import load_dotenv
import os
import warnings
from IPython.display import JSON

warnings.filterwarnings("ignore")

# Load the file that contains the API keys
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

pinecone_api_key = os.environ.get('PINECONE_API_KEY')
openai_api_key = os.environ.get('OPENAI_API_KEY')

## Load data set

acloudfan/newsgroups-mini

In [3]:
dataset_name = 'acloudfan/newsgroups-mini'
newsgroup_dataset = load_dataset(dataset_name)
# Split the 'train' split to test & train
newsgroup_dataset = newsgroup_dataset['train'].train_test_split(test_size=0.1)

newsgroup_dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'class'],
        num_rows: 405
    })
    test: Dataset({
        features: ['text', 'class'],
        num_rows: 45
    })
})

## 1. Convert the text to LangChain documents with metadata

In [4]:
from langchain_pinecone import PineconeVectorStore
from langchain.docstore.document import Document
from langchain_openai import OpenAIEmbeddings

pinecone_index_name = 'newsgroup'
index_dimension = 1536

train_docs = []
doc_ids = []
for row in newsgroup_dataset['train']:
    doc_id = 'post-'+str(len(doc_ids))
    doc_ids.append(doc_id)
    # Add newsgroup name/class and id of the document as metadata
    doc = Document(page_content=row['text'], metadata={'group': row['class'], 'id' :doc_id})
    train_docs.append(doc)

len(train_docs)

405

## 2. Add the documents to the PineCone index
**upsert** : If the document for given ID is already in the store, it will be updated for page_content and metadata

You may break this into 2 parts:

1. Create connection to Pinecone

   *pinecone_vdb = PineconeVectorStore(index_name = pinecone_index_name, embedding = openai_embeddings)*
   
3. Add documents to the index by using the add method

In [5]:
# Vector store requires access to the embedding model for converting the document to vector representation
openai_embeddings = OpenAIEmbeddings()

# Create the vector store  db object - adds the documents with upsert behavior
pinecone_vdb = PineconeVectorStore.from_documents( train_docs, index_name = pinecone_index_name, embedding = openai_embeddings, ids=doc_ids)

## 3. Similarity Search with test data

In [6]:
# Utility function for prining search results
def print_search_results(docs,display_text=False):
    for doc in docs:
        print('-------')
        print(doc.metadata)
        if display_text: print(doc.page_content)

In [7]:
%%time
# This will create a new connection
# Uncomment if you are facing running cells with long pauses - that may cause connection to drop
pinecone_vdb = PineconeVectorStore(index_name = pinecone_index_name, embedding = openai_embeddings)


# Must be < 45  (newsgroup_dataset['test'].num_rows)
test_post_index = 22

test_query = newsgroup_dataset['test'][test_post_index]['text']
print(newsgroup_dataset['test'][test_post_index]['class'])
# print(newsgroup_dataset['test'][test_post_index]['text'])

# Change the value of k for 
k = 5

# similarity search
docs = pinecone_vdb.similarity_search(test_query, k = k)
print_search_results(docs)

comp.sys.mac.hardware
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-129'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-303'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-205'}
-------
{'group': 'comp.os.ms-windows.misc', 'id': 'post-349'}
-------
{'group': 'comp.os.ms-windows.misc', 'id': 'post-22'}
CPU times: total: 406 ms
Wall time: 937 ms


## 4. MMR Search

In [8]:
lambda_mult = 0.5

docs = pinecone_vdb.max_marginal_relevance_search(test_query, k=3, fetch_k=10, lambda_mult=lambda_mult)

print_search_results(docs)

-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-129'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-173'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-205'}


## 5. Search with relevance score

Returns a list of docs and scores as tuples

https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.similarity_search_with_relevance_scores

Return docs and relevance scores in the range [0, 1 ]

In [9]:
# Returns a list of tuples (Document, Score)
doc_distance_tuples = pinecone_vdb.similarity_search_with_relevance_scores(test_query, k=10)
for doc_distance_tuple in  doc_distance_tuples:
    print("Relevance score = ", doc_distance_tuple[1])
    print_search_results([doc_distance_tuple[0]])
    print()


Relevance score =  1.001629355
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-129'}

Relevance score =  0.9190459845000001
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-205'}

Relevance score =  0.9190459845000001
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-303'}

Relevance score =  0.9069040420000001
-------
{'group': 'comp.os.ms-windows.misc', 'id': 'post-22'}

Relevance score =  0.9069040420000001
-------
{'group': 'comp.os.ms-windows.misc', 'id': 'post-349'}

Relevance score =  0.906125903
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-59'}

Relevance score =  0.9032123384999999
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-173'}

Relevance score =  0.9032123384999999
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-271'}

Relevance score =  0.902611971
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-103'}

Relevance score =  0.902611971
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-262'}



## 6.  Run similarity search with distance.

https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.similarity_search_with_score



In [10]:
# Returns a list of tuples (Document, Score)
doc_distance_tuples = pinecone_vdb.similarity_search_with_score(test_query, k=10)
for doc_distance_tuple in  doc_distance_tuples:
    print("Distance score = ", doc_distance_tuple[1])
    print_search_results([doc_distance_tuple[0]])
    print()


Distance score =  1.00325871
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-129'}

Distance score =  0.838091969
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-205'}

Distance score =  0.838091969
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-303'}

Distance score =  0.813808084
-------
{'group': 'comp.os.ms-windows.misc', 'id': 'post-22'}

Distance score =  0.813808084
-------
{'group': 'comp.os.ms-windows.misc', 'id': 'post-349'}

Distance score =  0.812251806
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-59'}

Distance score =  0.806424677
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-173'}

Distance score =  0.806424677
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-271'}

Distance score =  0.805223942
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-262'}

Distance score =  0.805223942
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-103'}



## 7. Search

https://api.python.langchain.com/en/latest/vectorstores/langchain_core.vectorstores.VectorStore.html#langchain_core.vectorstores.VectorStore.search

https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore#similarity-score-threshold-retrieval

search(query: str, search_type: str, **kwargs: Any)

**search_type** = similarity, similarity_score_threshold, mmr

**Note**
* Pinecone does not support *similarity_score_threshold*


In [11]:
# Similarity search
kwargs={"k": 3}

docs = pinecone_vdb.search(test_query, "similarity", **kwargs)

print_search_results(docs)

-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-129'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-205'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-303'}


In [12]:
# MMR Search
kwargs={"k": 3, "fetch_k": 10}

docs = pinecone_vdb.search(test_query, "mmr", **kwargs)

print_search_results(docs)

-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-129'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-173'}
-------
{'group': 'comp.sys.mac.hardware', 'id': 'post-205'}
