[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/experimental/merge-namespaces/merge-namespaces.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/experimental/merge-namespaces/merge-namespaces.ipynb)

# Managing RAG Documents with LangChain

When upserting documents with LangChain's [`PineconeVectorStore`](https://api.python.langchain.com/en/latest/vectorstores/langchain_pinecone.vectorstores.PineconeVectorStore.html#langchain_pinecone.vectorstores.PineconeVectorStore) method, by default the vector IDs generated are random UUIDs. As a best practice when [managing RAG documents](https://docs.pinecone.io/guides/data/manage-rag-documents), ID prefixes should be used.

This notebook gives an example of specifying ID prefixes when upserting to a Pinecone index with LangChain.


## Setup

In [2]:
%pip install --upgrade --quiet  \
    langchain-pinecone \
    langchain-openai \
    langchain \
    langchain-community \
    pinecone-notebooks

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.6/315.6 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.9/215.9 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.2/125.2 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [11]:
import getpass
openai_api_key = getpass.getpass(prompt='Enter your OpenAI API key:')

Enter your OpenAI API key:··········


## Chunk the file

In [16]:
filepath = "/content/sample_data/state_of_the_union.txt"

In [17]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loader = TextLoader(filepath)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

## Connect to Pinecone

In [9]:
from pinecone_notebooks.colab import Authenticate

Authenticate()

In [18]:
import os
pinecone_api_key = os.environ.get("PINECONE_API_KEY")

import time

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=pinecone_api_key)

In [19]:
import time

index_name = "langchain-id-test" # change to match an existing index

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)

index = pc.Index(index_name)

## Upsert data

### Generate IDs with the specified prefix

In [None]:
prefix = "sotu" # change to reflect your document

ids = []
for i in range(len(docs)):
  ids.append(prefix+"#"+str(i))

### Upsert to Pinecone

In [26]:
from langchain_pinecone import PineconeVectorStore

docsearch = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings)
vectorstore.add_documents(docs, ids=ids) # prints IDs of upserted vectors

['sotu#0',
 'sotu#1',
 'sotu#2',
 'sotu#3',
 'sotu#4',
 'sotu#5',
 'sotu#6',
 'sotu#7',
 'sotu#8',
 'sotu#9',
 'sotu#10',
 'sotu#11',
 'sotu#12',
 'sotu#13',
 'sotu#14',
 'sotu#15',
 'sotu#16',
 'sotu#17',
 'sotu#18',
 'sotu#19',
 'sotu#20',
 'sotu#21',
 'sotu#22',
 'sotu#23',
 'sotu#24',
 'sotu#25',
 'sotu#26',
 'sotu#27',
 'sotu#28',
 'sotu#29',
 'sotu#30',
 'sotu#31',
 'sotu#32',
 'sotu#33',
 'sotu#34',
 'sotu#35',
 'sotu#36',
 'sotu#37',
 'sotu#38',
 'sotu#39',
 'sotu#40',
 'sotu#41']

In [24]:
query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
