# 📌 Pinecone Vector Database

[Pinecone](https://www.pinecone.io/) is a **managed vector database** designed for storing, indexing, and querying high-dimensional vectors (embeddings).  
It’s built for **semantic search, recommendation systems, and Retrieval-Augmented Generation (RAG)** at scale, removing the need to manage your own infrastructure.  

---

## 🚀 Why Pinecone?
- **Fully managed & serverless** → No need to handle clusters or scaling manually.  
- **Fast similarity search** → Uses optimized Approximate Nearest Neighbor (ANN) algorithms.  
- **Scalable** → Handles millions to billions of vectors efficiently.  
- **Hybrid search** → Supports filtering with metadata + vector similarity.  
- **Integrations** → Works smoothly with LangChain, Hugging Face, OpenAI, etc.  

---

## 🧩 What I Did in This Notebook
- Generated an API key and **initialized Pinecone**.  
- Created an **index** (dimension = `384` from SentenceTransformer embeddings).  
- Loaded and chunked my **resume PDF** using LangChain utilities.  
- Converted chunks into embeddings using **Sentence Transformers**.  
- **Upserted** embeddings and metadata into Pinecone.  
- Queried Pinecone to **retrieve the most relevant text chunks**.  

---

## 🔑 Key Concepts
- **Index** → A logical collection of vectors (similar to a table in SQL).  
- **Namespace** → Partition within an index (helps organize data).  
- **Upsert** → Insert or update vectors in the index.  
- **Query** → Find the most similar vectors to a given embedding.  

---

## 📖 Next Steps
- Experiment with other vector databases (FAISS, Weaviate, ChromaDB).  
- Try hybrid search (combine **metadata filtering** + vector search).  
- Build a **mini RAG pipeline** that retrieves context from Pinecone and generates answers.  




In [32]:
from dotenv import load_dotenv
from langchain_pinecone import PineconeVectorStore
from pinecone import Pinecone
from sentence_transformers import SentenceTransformer
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from uuid import uuid4
import pypdf

In [33]:
load_dotenv()

True

## Initialize Pinecone and Create and Index

In [34]:
import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec

# Load .env
load_dotenv()

api_key = os.getenv("PINECONE_API_KEY")
if api_key is None:
    raise ValueError("Missing Pinecone API key. Did you set it in your .env?")

pc = Pinecone(api_key=api_key)

index_name = "cv-index"

# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # Match SentenceTransformer embedding size
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    #pc.describe_index(index_name).wait_until_ready()

# Connect to the index
index = pc.Index(index_name)
print(f"Connected to Pinecone index: {index_name}")


Connected to Pinecone index: cv-index


**The dimension=384 is specific to the all-MiniLM-L6-v2 model. If you use a different Sentence Transformer model, check its embedding dimension.**



## Load and Chunk Your CV



In [None]:
# Load the PDF CV
pdf_path = "data/resume.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load()

# Combine all pages into a single text
cv_text = "".join(page.page_content for page in pages)

# Create a Document object
cv_document = Document(page_content=cv_text, metadata={"source": "cv"})

# Chunk the document
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,  # Adjust chunk size as needed
    chunk_overlap=50,  # Overlap to retain context
    length_function=len,
)

chunks = text_splitter.split_documents([cv_document])

# Print the number of chunks and a sample
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:2]):  # Show first two chunks
    print(f"Chunk {i+1}: {chunk.page_content[:100]}... [{chunk.metadata}]")

Number of chunks: 42
Chunk 1: CLEAVESTONE ADUNGO 
cleavestone94@gmail.com   |   +254703457427   |   Nairobi Kenya       
GitHub: L... [{'source': 'cv'}]
Chunk 2: Summary      
Data Scientist transitioning from finance with 2+ years of hands-on ML, NLP, and data ... [{'source': 'cv'}]


- **PyPDFLoader** extracts text from your PDF CV.  
- **RecursiveCharacterTextSplitter** splits the text into manageable chunks (e.g., `500` characters with `50`-character overlap).  
- Each chunk is stored as a **Document** with metadata (e.g., `source: cv`).  


## Initialize Sentence Transformer for Embeddings

The **all-MiniLM-L6-v2** model is lightweight and performs well for most tasks. You can explore other models on Hugging Face.





In [36]:
# Load Sentence Transformer model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

## Embed and Store Chunks in Pinecone
Embed the chunks using Sentence Transformers and store them in the Pinecone index.



In [37]:
from uuid import uuid4

uuids = [str(uuid4()) for _ in range(len(chunks))]
embeddings = [embedding_model.encode(chunk.page_content).tolist() for chunk in chunks]

vectors = [
    (
        uuids[i],
        embeddings[i],
        {
            "text": chunks[i].page_content,
            **{k: str(v) for k, v in chunks[i].metadata.items()}  # flatten metadata
        }
    )
    for i in range(len(chunks))
]

index.upsert(vectors=vectors)
print(f"Successfully stored {len(vectors)} chunks in Pinecone index '{index_name}'")


Successfully stored 42 chunks in Pinecone index 'cv-index'


- Each chunk is **embedded** using the Sentence Transformer model, converting text into a **384-dimensional vector**.  
- We generate **unique IDs** for each chunk using `uuid4`.  
- The **upsert** method stores the embeddings and metadata in **Pinecone**.  


## Query the Vector Store



In [39]:
# Define a query
query = "what are the candidates skills relevant to data science?"

# Embed the query
query_embedding = embedding_model.encode(query).tolist()

# Query Pinecone
results = index.query(
    vector=query_embedding,
    top_k=3,  # Return top 3 results
    include_metadata=True
)

# Print results
for match in results["matches"]:
    print(f"* [Score={match['score']:.3f}] {match['metadata']['text'][:100]}... [{match['metadata']}]")

* [Score=0.585] data into actionable insights and delivering production-ready ML solutions. Competitive ML participa... [{'source': 'cv', 'text': 'data into actionable insights and delivering production-ready ML solutions. Competitive ML participant (top \n30% Zindi), passionate about solving high-impact problems with data. \nSkills      \nProgramming & Analytics: Python (pandas, NumPy, scikit-learn, matplotlib, seaborn), SQL, Excel \nMachine Learning & AI: Classification, Regression, NLP, Deep Learning, Feature Engineering, Model Evaluation \nLLMs & Conversational AI: LangChain, RAG, Vector Databases (Pinecone, LanceDB), Prompt Engineering'}]
* [Score=0.567] data into actionable insights and delivering production-ready ML solutions. Competitive ML participa... [{'source': 'cv', 'text': 'data into actionable insights and delivering production-ready ML solutions. Competitive ML participant (top \n30% Zindi), passionate about solving high-impact problems with data. \nSkills      \nProgra