## **Vector Search**
### What is Vector Search?

Vector search is a method of finding similar items by comparing their vector representations. In the context of AI and machine learning, items such as words, images, or documents are often represented as vectors (arrays of numbers). These vectors capture the essential features or meanings of the items in a mathematical form.

### Why Use Vector Search?

Traditional search methods rely on exact matches of keywords, which might not always capture the true similarity between items. Vector search, on the other hand, can find similar items even if they don't share the exact same words or features, making it powerful for tasks like recommendation systems, image search, and natural language processing.

### How Does Vector Search Work?

1. **Convert Items to Vectors:**
   - Each item (e.g., a word, sentence, or image) is converted into a vector. This can be done using techniques like word embeddings for text or deep learning models for images.

2. **Store Vectors in a Database:**
   - All the vectors are stored in a database. This database can be searched efficiently to find similar vectors.

3. **Search for Similar Vectors:**
   - When a query vector (representing the item you're searching for) is provided, the system finds the vectors in the database that are closest to the query vector. This "closeness" is usually measured using a distance metric like cosine similarity or Euclidean distance.

### Simple Example

Let's say we have a vector search system for finding similar movies based on their descriptions.

1. **Convert Movie Descriptions to Vectors:**
   - We use a pre-trained language model to convert each movie description into a vector.
     - "A young wizard attends a magical school." -> [0.1, 0.3, 0.4, ..., 0.2]
     - "A group of friends embarks on a quest to destroy a powerful ring." -> [0.2, 0.1, 0.3, ..., 0.4]

2. **Store Vectors:**
   - Store these vectors in a database.

3. **Search with a Query:**
   - A user searches for a movie similar to "A boy discovers he is a wizard and goes on adventures."
   - Convert this query into a vector: [0.1, 0.2, 0.4, ..., 0.3]
   - The system compares this query vector with the vectors in the database and finds the closest matches.
   - It might find that "A young wizard attends a magical school." is the closest match.

### Benefits of Vector Search

- **Semantic Search:** Finds similar items based on meaning, not just keywords.
- **Flexibility:** Works with different types of data (text, images, etc.).
- **Accuracy:** Can capture subtle similarities and nuances in data.

### Summary

Vector search is a powerful technique that finds similar items by comparing their vector representations. It converts items into vectors, stores these vectors, and uses distance metrics to find the closest matches. This method is widely used in AI applications for its ability to understand and capture the underlying similarities between items.

##**Cassandra and Astra**

Cassandra and Astra are both databases, but they have different characteristics and use cases. Here's an explanation of each:

### Cassandra Database

**Apache Cassandra** is an open-source, distributed NoSQL database designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. It's known for its scalability and performance, making it suitable for applications that require fast and reliable data access.

**Key Features of Cassandra:**

1. **Distributed Architecture:** Data is distributed across multiple nodes in a cluster, ensuring fault tolerance and high availability.
2. **Scalability:** Easily scales horizontally by adding more nodes to the cluster without downtime.
3. **No Single Point of Failure:** Data is replicated across multiple nodes, so the failure of one node doesn't affect the system's overall availability.
4. **High Performance:** Designed for high-speed read and write operations, making it suitable for applications that need to handle large volumes of data quickly.
5. **Flexible Schema:** Allows for a flexible, schema-less design, enabling dynamic and unstructured data models.
6. **Query Language:** Uses CQL (Cassandra Query Language), which is similar to SQL but tailored for Cassandra's distributed architecture.

**Use Cases for Cassandra:**
- Real-time big data applications
- High-velocity data ingestion
- Distributed data stores for web applications
- IoT data management
- Financial services for transaction data

### Astra Database

**Astra DB** is a database-as-a-service (DBaaS) offering built on Apache Cassandra by DataStax. It provides a managed, cloud-native version of Cassandra, making it easier to deploy, manage, and scale Cassandra databases without the complexity of manual administration.

**Key Features of Astra:**

1. **Managed Service:** Astra DB takes care of the database management tasks such as provisioning, maintenance, scaling, and backups, allowing developers to focus on building applications.
2. **Cloud-Native:** Designed to run on major cloud platforms (AWS, Google Cloud, Azure), providing flexibility and integration with cloud services.
3. **Serverless Architecture:** Supports a serverless model, where users only pay for the actual usage (reads/writes), providing cost efficiency.
4. **Easy to Use:** Provides a simplified interface and developer-friendly tools, including SDKs, REST APIs, and GraphQL support.
5. **Global Distribution:** Allows for the deployment of globally distributed databases with low-latency access to data across multiple regions.
6. **Compatibility with Cassandra:** Fully compatible with Cassandra, allowing for easy migration and integration with existing Cassandra-based applications.

**Use Cases for Astra:**
- Modern web and mobile applications
- Real-time analytics and big data applications
- Applications requiring high availability and global distribution
- Enterprises looking for a scalable and managed database solution without the operational overhead

### Summary

- **Apache Cassandra**: An open-source, distributed NoSQL database known for its scalability, high performance, and fault tolerance. Ideal for applications needing to handle large amounts of data with high availability.
- **Astra DB**: A managed, cloud-native version of Cassandra provided by DataStax. Simplifies the deployment and management of Cassandra databases, offering a serverless architecture, global distribution, and seamless integration with cloud services.

Both databases are powerful tools for managing large-scale data, but Astra DB simplifies many of the operational complexities associated with managing a Cassandra cluster, making it an attractive choice for developers and enterprises looking for a managed solution.

Cassandra = https://cassandra.apache.org/doc/latest/

Astra = https://www.datastax.com/products/datastax-astra

In [1]:
!pip install -q cassio datasets langchain tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.1/45.1 kB[0m [31m886.5 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.6/974.6 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m327.4/327.4 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.9/18.9 MB[0m [31m47.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

### **Library Details**

### 1. cassio

**CassIO** is a library designed for integration with Apache Cassandra, making it easier to use for AI and ML applications. It often provides utilities for handling data ingestion, storage, and retrieval in a way that leverages Cassandra's distributed architecture.

**Key Features:**
- Simplifies interaction with Cassandra databases.
- Provides tools for managing AI/ML data workflows.
- Helps in efficiently storing and querying large datasets.

### 2. datasets

**Datasets** is a library developed by Hugging Face. It's designed to provide easy access to a wide variety of datasets for machine learning and data science projects.

**Key Features:**
- Access to a large collection of datasets for NLP, computer vision, and more.
- Easy-to-use API for downloading, preprocessing, and managing datasets.
- Integration with popular ML frameworks like PyTorch, TensorFlow, and Hugging Face's Transformers.

### 3. langchain

**LangChain** is a framework for building applications with large language models (LLMs). It is particularly focused on enabling applications that are data-aware and can connect to external sources of data.

**Key Features:**
- Framework for developing complex applications using LLMs.
- Provides tools for data connection, such as APIs and databases.
- Facilitates the creation of applications that can process and generate human-like text based on data from multiple sources.

### 4. tiktoken

**Tiktoken** is a library used for tokenizing text specifically for use with OpenAI's models. Tokenization is the process of converting text into tokens (which could be words, subwords, or characters) that the models can process.

**Key Features:**
- Efficiently tokenizes text for OpenAI's models.
- Supports different encoding schemes used by OpenAI models.
- Helps in managing token limits and optimizing input text for model consumption.

### Summary

- **CassIO**: Simplifies interaction with Apache Cassandra for AI/ML applications.
- **Datasets**: Provides access to a variety of datasets for machine learning.
- **LangChain**: Framework for building applications with large language models, especially those that need to connect to external data sources.
- **OpenAI**: Library for accessing and using OpenAI's AI models like GPT-3 and GPT-4.
- **Tiktoken**: Tokenization library optimized for preparing text input for OpenAI's language models.

These libraries collectively cover a wide range of functionalities, from data handling and preparation to building sophisticated AI-powered applications.

In [2]:
!pip install -q langchain-fireworks PyPDF2

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.8/81.8 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
!pip install langchain-community

Collecting langchain-community
  Downloading langchain_community-0.2.5-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.21.3-py3-none-any.whl (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensi

In [4]:
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import Fireworks
from langchain_fireworks import FireworksEmbeddings

from datasets import load_dataset

import cassio

In [5]:
from PyPDF2 import PdfReader

## Setup

In [7]:
from google.colab import userdata
ASTRA_DB_ID = userdata.get('db_id')
ASTRA_DB_TOKEN = userdata.get('db_token')
FIREWORK_API_KEY = userdata.get('fireworkapi')

In [8]:
# Provide the path to pdf file
pdfreader = PdfReader("budget_speech.pdf")

In [9]:
## Extracting Text from PDF Pages and Concatenating

from typing_extensions import Concatenate
#read text from the pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
  content = page.extract_text()
  if content:
    raw_text += content


In [11]:
# Initialize the connection to database

cassio.init(token=ASTRA_DB_TOKEN, database_id=ASTRA_DB_ID)

ERROR:cassandra.connection:Closing connection <LibevConnection(140550921992944) 28ded3f1-49b1-4109-bba7-a8920859d872-us-east1.db.astra.datastax.com:29042:3b8afe0a-9b27-48ef-af55-d95b0aef8e45> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


In [13]:
#Create the LangChain embedding and LLM objects for later usage

llm = Fireworks(fireworks_api_key = FIREWORK_API_KEY)
embedding = FireworksEmbeddings(fireworks_api_key = FIREWORK_API_KEY)

In [14]:
#Create a LangChain vector store
astra_vector_store = Cassandra(
    embedding=embedding,
    table_name="qa_demo",
    session=None,
    keyspace=None
)

## Splitting Text into Chunks with Character Text Splitter

In [15]:
from langchain.text_splitter import CharacterTextSplitter
#We need to split the text using Character Text Splitter such that it should not increase token size
text_splitter = CharacterTextSplitter(
    separator= "\n",
    chunk_size = 800,
    chunk_overlap = 200,
    length_function = len
)

texts = text_splitter.split_text(raw_text)

In [18]:
texts[0:3]

['GOVERNMENT OF INDIA\nINTERIM BUDGET 2024-2025\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2024 \nCONTENTS  \n \nPART – A \n Page No.  \nIntroduction  1 \nInclusive Development and Growth  2 \nSocial Justice   3  \nExemplary  Track Record of Governance,  \nDevelopment and Performance (GDP)  7 \nEconomic Management  8 \nGlobal Context  9 \nVision for ‘Viksit Bharat’  10 \nStrategy for  ‘Amrit Kaal’  11 \nInfrastructure Development  17 \nAmrit Kaal as Kartavya Kaal  22 \nRevised Estimates 2023 -24 23 \nBudget Estimates 2024 -25 23 \nPART – B \nDirect taxes  25 \nIndirect Taxes   26 \nEconomy – Then and Now  28 \n  \n  1 \n Interim Budget 2024 -2025  \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1, 2024  \nHon’ble Speaker,  \n I present the Interim Budget for 2024 -25.',
 '1 \n Interim Budget 2024 -2025  \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1, 2024  \nHon’ble Speaker,  \n I present the Interim Budget for 2024 -

## Load the dataset into Vector Store

The `VectorStoreIndexWrapper` is a wrapper class that provides additional functionality on top of the Astra Vector Store. The exact behavior of the wrapper class depends on the implementation, but here are some common features it might provide:

1. **Indexing**: The wrapper might create an index on the stored vectors, allowing for efficient querying and filtering of the data.
2. **Querying**: The wrapper might provide methods for querying the stored vectors, such as:
	* `search`: Find vectors similar to a given query vector.
	* `get_nearest_neighbors`: Retrieve the k-nearest neighbors to a given query vector.
	* `get_similar_vectors`: Retrieve vectors similar to a given query vector.
3. **Filtering**: The wrapper might provide methods for filtering the stored vectors, such as:
	* `filter_by_distance`: Retrieve vectors within a certain distance from a given query vector.
	* `filter_by_similarity`: Retrieve vectors with a certain similarity score to a given query vector.
4. **Aggregation**: The wrapper might provide methods for aggregating the stored vectors, such as:
	* `sum_vectors`: Compute the sum of multiple vectors.
	* `mean_vectors`: Compute the mean of multiple vectors.
5. **Serialization**: The wrapper might provide methods for serializing and deserializing the stored vectors, allowing for easy storage and retrieval of the data.
6. **Compression**: The wrapper might provide methods for compressing and decompressing the stored vectors, reducing storage requirements and improving query performance.

By wrapping the Astra Vector Store with this wrapper class, you can gain additional functionality and flexibility when working with the stored vectors.

In [19]:
# astra_vector_store is also performing Embedding as mention in Setup
astra_vector_store.add_texts(texts[:50])

print("Inserted %i headlines" %len(texts[:50]))

astra_vector_store = VectorStoreIndexWrapper(vectorstore= astra_vector_store)

Inserted 50 headlines


## Run the QA

Some suggested questions

*   What is the Current GDP?
*   How many and what caste we need to focus on?


In [22]:
first_question = True
while True:
  if first_question:
    query_text  = input("Enter the question (or type 'quit' to exit): ").strip()
  else:
    query_text = input("What's your next question (or type 'quit' to exit): ").strip()
  if query_text.lower() == "quit":
    break
  if query_text == "":
    continue

  first_question = False

  print(f"Question: {query_text}")
  answer = astra_vector_store.query(query_text, llm=llm).strip()
  print(f"Answer: {answer}")

Enter the question (or type 'quit' to exit): How many and what caste we need to focus on?
Question: How many and what caste we need to focus on?




Answer: The Prime Minister believes that we need to focus on four major castes - Garib (Poor), Mahilayen (Women), Yuva (Youth), and Annadata (Farmer).
What's your next question (or type 'quit' to exit): quit
