# **What is a Vector Database?**

![e88ebbacb848b09e477d11eedf4209d10ea4ac0a-1399x537.png](attachment:25543317-13ba-44c7-9e25-3d711a07f064.png)

A **vector database** is a specialized type of database designed to store, index, and query high-dimensional vectors efficiently. These vectors are numerical representations of data, often generated by machine learning models, such as embeddings from natural language processing (NLP), computer vision, or other AI systems. 

- **Vectors**: A vector is an array of numbers (e.g., `[0.23, 0.45, 0.67, ...]`) that represents data in a high-dimensional space. For example:
  - In NLP, a word or sentence can be represented as a vector using models like Word2Vec, BERT, or GPT.
  - In computer vision, an image can be represented as a vector using models like ResNet or VGG.
- **High-dimensional space**: Vectors often have hundreds or thousands of dimensions, making them challenging to manage with traditional databases.

Vector databases are optimized for **similarity search**, which involves finding vectors that are "close" to a given query vector based on a distance metric (e.g., cosine similarity, Euclidean distance).

---

### **Key Features of Vector Databases**
1. **Efficient Similarity Search**:
   - Vector databases use specialized indexing techniques (e.g., HNSW, Annoy, FAISS) to perform fast nearest-neighbor searches, even in high-dimensional spaces.
2. **Scalability**:
   - They are designed to handle large-scale datasets with millions or billions of vectors.
3. **Integration with AI/ML Models**:
   - Vector databases work seamlessly with machine learning models that generate embeddings.
4. **Real-time Querying**:
   - They support real-time queries, making them suitable for applications like recommendation systems or search engines.

---

# **When to Use a Vector Database?**

Vector databases are particularly useful in scenarios where you need to perform **similarity-based searches** or work with **high-dimensional data**. Here are some common use cases:

#### 1. **Natural Language Processing (NLP)**
   - **Semantic Search**: Find documents, sentences, or words that are semantically similar to a query.
   - **Chatbots and Question-Answering Systems**: Retrieve relevant responses based on the meaning of the input.
   - **Text Classification and Clustering**: Group similar texts together.

#### 2. **Computer Vision**
   - **Image/Video Search**: Find visually similar images or videos (e.g., reverse image search).
   - **Object Detection and Recognition**: Store and query embeddings of detected objects.
   - **Facial Recognition**: Match faces based on vector representations.

#### 3. **Recommendation Systems**
   - **Product Recommendations**: Suggest products similar to what a user has viewed or purchased.
   - **Content Recommendations**: Recommend articles, videos, or music based on user preferences.

#### 4. **Anomaly Detection**
   - Identify unusual patterns in data by comparing vectors to a baseline (e.g., fraud detection, network intrusion detection).

#### 5. **Multimodal Search**
   - Search across different types of data (e.g., text, images, audio) by converting them into a shared vector space.

#### 6. **Generative AI Applications**
   - Store and retrieve embeddings for prompts, outputs, or intermediate representations in generative models like GPT or DALL-E.

---

# **When NOT to Use a Vector Database**
- **Structured Data**: If your data is purely structured (e.g., tables with rows and columns), a traditional relational database (e.g., MySQL, PostgreSQL) is more suitable.
- **Exact Match Queries**: If you need exact matches (e.g., finding a specific ID or name), a traditional database is more efficient.
- **Low-Dimensional Data**: For low-dimensional data (e.g., 2D or 3D), specialized spatial databases (e.g., PostGIS) may be more appropriate.

---

### **Examples of Vector Databases**
Here are some popular vector databases you can explore:
1. **Pinecone**: Fully managed vector database for scalable similarity search.
2. **Weaviate**: Open-source vector database with hybrid search capabilities.
3. **Milvus**: Open-source vector database designed for AI applications.
4. **Qdrant**: Open-source vector database with a focus on performance and ease of use.
5. **Vespa**: Open-source platform for large-scale vector search and recommendation systems.

---

### **How to Decide if You Need a Vector Database?**
Ask yourself these questions:
1. Are you working with high-dimensional data (e.g., embeddings)?
2. Do you need to perform similarity searches (e.g., finding the most similar items)?
3. Is your dataset large and growing rapidly?
4. Do you need real-time querying capabilities?

If the answer to most of these questions is **yes**, then a vector database is likely the right choice for your use case.

# **How Does It Work?**

### 1. **Storing Vectors**
   - When you add data (e.g., an image or a sentence) to a vector database, it first gets converted into a **vector** using a machine learning model. For example:
     - A sentence like "I love dogs" might become a vector: `[0.2, 0.8, -0.3, 0.5, ...]`.
     - An image of a cat might become another vector: `[0.1, -0.7, 0.4, 0.9, ...]`.
   - These vectors are stored in the database.

### 2. **Indexing for Fast Search**
   - Searching through millions of vectors one by one would be slow. So, vector databases use **indexing techniques** to organize the vectors in a way that makes searching faster.
   - Think of it like organizing books in a library by genre instead of keeping them in random order. Popular indexing methods include:
     - **HNSW (Hierarchical Navigable Small World)**: Like building a network of shortcuts to find similar vectors quickly.
     - **Annoy (Approximate Nearest Neighbors Oh Yeah)**: Like splitting the vectors into groups so you only search the most relevant ones.
     - **FAISS (Facebook AI Similarity Search)**: Like creating a map of vectors to find the closest ones efficiently.

### 3. **Querying for Similarity**
   - When you want to find something similar to a query (e.g., "Find images like this cat picture"), the database:
     1. Converts the query into a vector (e.g., the cat picture becomes `[0.1, -0.7, 0.4, 0.9, ...]`).
     2. Uses the index to find the closest vectors in the database.
     3. Returns the most similar items (e.g., other cat pictures).

### 4. **Measuring Similarity**
   - To decide how "close" two vectors are, the database uses a **distance metric**. Common ones include:
     - **Cosine Similarity**: Measures the angle between two vectors (useful for text and embeddings).
     - **Euclidean Distance**: Measures the straight-line distance between two vectors (useful for images).

---

### **Example**

#### Example 1: **Finding Similar Sentences**
1. **Data**:
   - Sentence 1: "I love dogs" → Vector: `[0.2, 0.8, -0.3, 0.5]`
   - Sentence 2: "I adore cats" → Vector: `[0.3, 0.7, -0.2, 0.6]`
   - Sentence 3: "The sky is blue" → Vector: `[-0.5, 0.1, 0.9, -0.4]`
2. **Query**:
   - "I like puppies" → Vector: `[0.25, 0.75, -0.25, 0.55]`
3. **Search**:
   - The database calculates the similarity between the query vector and all stored vectors.
   - It finds that Sentence 1 (`[0.2, 0.8, -0.3, 0.5]`) and Sentence 2 (`[0.3, 0.7, -0.2, 0.6]`) are the closest.
4. **Result**:
   - Returns: "I love dogs" and "I adore cats" as the most similar sentences.

#### Example 2: **Finding Similar Images**
1. **Data**:
   - Image 1: Cat → Vector: `[0.1, -0.7, 0.4, 0.9]`
   - Image 2: Dog → Vector: `[0.2, -0.6, 0.3, 0.8]`
   - Image 3: Car → Vector: `[-0.9, 0.5, -0.1, 0.2]`
2. **Query**:
   - A new cat image → Vector: `[0.15, -0.65, 0.35, 0.85]`
3. **Search**:
   - The database compares the query vector to all stored vectors.
   - It finds that Image 1 (`[0.1, -0.7, 0.4, 0.9]`) and Image 2 (`[0.2, -0.6, 0.3, 0.8]`) are the closest.
4. **Result**:
   - Returns: The cat and dog images as the most similar.

---

### **Why Is This Useful?**
- **Speed**: Vector databases are optimized to handle millions of vectors and return results in milliseconds.
- **Accuracy**: They can find semantically similar items, even if the exact words or pixels don’t match.
- **Scalability**: They work well for large datasets, like searching through billions of images or documents.

---

### **Real-World Analogy**
Imagine you’re in a library:
- **Traditional Database**: You search for a book by its title or author (exact match).
- **Vector Database**: You describe the kind of book you like (e.g., "a mystery novel with a twist ending"), and the library finds books with similar themes or styles (similarity search).

---

### **Summary**
1. **Store**: Convert data into vectors and store them.
2. **Index**: Organize vectors for fast searching.
3. **Query**: Convert your query into a vector and find the closest matches.
4. **Return**: Get the most similar items.

In [1]:
!pip -q install chromadb langchain

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m45.9 MB/s[0m eta [36m0:00:00[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.6/278.6 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.8/94.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 kB[0m [

In [2]:
!pip -q install langchain-huggingface

In [7]:
pip install -qU "langchain-chroma>=0.1.2"

Note: you may need to restart the kernel to use updated packages.


In [9]:
pip install -q langchain_community

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m412.4/412.4 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
[?25hNote: you may need to restart the kernel to use updated packages.


In [13]:
from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader("/kaggle/input/articles/new_articles", glob = "./*.txt", loader_cls= TextLoader)

document = loader.load()


In [15]:
document[0]

Document(metadata={'source': '/kaggle/input/articles/new_articles/05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt'}, page_content='After Google cut all but three of the projects at its in-house incubator Area 120 and shifted it to work on AI projects across Google, one of the legacy efforts — coincidentally also an AI project — is now officially exiting to Google. Checks, an AI-powered tool to check mobile apps for compliance with various privacy rules and regulations, is moving into Google proper as a privacy product aimed at mobile developers.\n\nChecks originally made its debut in February 2022, although it was in development for some time before that. In its time at Area 120, it became one of the largest projects in the group, co-founders Fergus Hurley and Nia Castelly told me, with 10 people fully dedicated to it and a number of others contributing less formally. The founders’ job titles under Google will now be GM and Legal

In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_split = RecursiveCharacterTextSplitter(chunk_size = 700, chunk_overlap = 200)

documents = text_split.split_documents(document)

documents[0]

Document(metadata={'source': '/kaggle/input/articles/new_articles/05-03-checks-the-ai-powered-data-protection-project-incubated-in-area-120-officially-exits-to-google.txt'}, page_content='After Google cut all but three of the projects at its in-house incubator Area 120 and shifted it to work on AI projects across Google, one of the legacy efforts — coincidentally also an AI project — is now officially exiting to Google. Checks, an AI-powered tool to check mobile apps for compliance with various privacy rules and regulations, is moving into Google proper as a privacy product aimed at mobile developers.')

In [19]:
len(documents)

355

In [3]:
from langchain_huggingface import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
hf = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [20]:
hf

HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': False}, multi_process=False, show_progress=False)

In [21]:
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(documents= documents,
                          embedding = hf,
                          persist_directory="/kaggle/working/")

In [22]:
db

<langchain_community.vectorstores.chroma.Chroma at 0x7a19e0b9b8e0>

In [25]:
retriever = db.as_retriever()

In [27]:
res = retriever.invoke("How much money did microsoft raise?")
res

[Document(metadata={'source': '/kaggle/input/articles/new_articles/05-03-chatgpt-everything-you-need-to-know-about-the-ai-powered-chatbot.txt'}, page_content='April 28, 2023\n\nVC firms including Sequoia Capital, Andreessen Horowitz, Thrive and K2 Global are picking up new shares, according to documents seen by TechCrunch. A source tells us Founders Fund is also investing. Altogether the VCs have put in just over $300 million at a valuation of $27 billion to $29 billion. This is separate to a big investment from Microsoft announced earlier this year, a person familiar with the development told TechCrunch, which closed in January. The size of Microsoft’s investment is believed to be around $10 billion, a figure we confirmed with our source.\n\nApril 25, 2023'),
 Document(metadata={'source': '/kaggle/input/articles/new_articles/05-07-fintech-space-continues-to-be-competitive-and-drama-filled.txt'}, page_content='OpenEnvoy raises $15 million to grow AP automation solution\n\nMiami-based s

In [29]:
retriever = db.as_retriever(search_kwargs={"k": 1})
res = retriever.invoke("What is openai?")
res

[Document(metadata={'source': '/kaggle/input/articles/new_articles/05-05-google-and-openai-are-walmarts-besieged-by-fruit-stands.txt'}, page_content='OpenAI may be synonymous with machine learning now and Google is doing its best to pick itself up off the floor, but both may soon face a new threat: rapidly multiplying open source projects that push the state of the art and leave the deep-pocketed but unwieldy corporations in their dust. This Zerg-like threat may not be an existential one, but it will certainly keep the dominant players on the defensive.')]

In [30]:
retriever.search_type

'similarity'