In [2]:
import os
import warnings
import pinecone
from langchain.chat_models import ChatOllama
from langchain.callbacks.base import BaseCallbackHandler
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA

# Config
warnings.filterwarnings("ignore")
pinecone.init(api_key="067de28e-ed06-4e47-b1f8-9ed88c216d70", environment="gcp-starter")

  from tqdm.autonotebook import tqdm


In [3]:
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
chat = ChatOllama(
    base_url="https://383b-34-171-195-199.ngrok-free.app",
    model="mistral:7b",
    callback_manager=callback_manager
)

In [4]:
embedding_model = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-V2")
loder = TextLoader('medium.txt', encoding="utf-8")
document = loder.load()
text_splitter = CharacterTextSplitter(chunk_size = 1000, chunk_overlap = 100)
texts = text_splitter.split_documents(document)
docsearch = Pinecone.from_documents(texts, embedding_model, index_name = "testindex1")

Created a chunk of size 1913, which is longer than the specified 1000
Created a chunk of size 1225, which is longer than the specified 1000


In [5]:
class ChatData:
    def __init__(self) -> None:
        pass
        
    def chat(self, query: str) -> str:
        qa = RetrievalQA.from_chain_type(llm=chat, chain_type="stuff", retriever=docsearch.as_retriever(), return_source_documents=True)
        response = qa({"query": query})
        return response

In [6]:
bot = ChatData()

In [7]:
res = bot.chat("What is vector database")

 A vector database is a type of database management system specifically designed to handle unstructured data in the form of vectors. It goes beyond just providing ANN search algorithms, offering user-friendly features such as cloud-nativity, multi-tenancy, and scalability that are commonly found in today's structured/semi-structured database systems.

In [8]:
res

{'query': 'What is vector database',
 'result': " A vector database is a type of database management system specifically designed to handle unstructured data in the form of vectors. It goes beyond just providing ANN search algorithms, offering user-friendly features such as cloud-nativity, multi-tenancy, and scalability that are commonly found in today's structured/semi-structured database systems.",
 'source_documents': [Document(page_content='Vector databases vs. vector search libraries\nA common misconception that I hear around the industry is that vector databases are merely wrappers around ANN search algorithms. This could not be further from the truth!\n\nA vector database is, at its core, a full-fledged solution for unstructured data. As we’ve already seen in the previous section, this means that user-friendly features present in today’s database management systems for structured/semi-structured data — cloud-nativity, multi-tenancy, scalability, etc — should also be attributes f

In [9]:
bot.chat("List some vector databases")

 Some popular vector databases include Milvus, Faiss, Annoy, ElasticSearch Vector Search, and Pinecone.io. These databases offer various features such as different vector dimensionalities, indexing algorithms, query types, scalability, and integration with machine learning frameworks and popular programming languages. It is important to consider the specific requirements of your use case when choosing a vector database.

{'query': 'List some vector databases',
 'result': ' Some popular vector databases include Milvus, Faiss, Annoy, ElasticSearch Vector Search, and Pinecone.io. These databases offer various features such as different vector dimensionalities, indexing algorithms, query types, scalability, and integration with machine learning frameworks and popular programming languages. It is important to consider the specific requirements of your use case when choosing a vector database.',
 'source_documents': [Document(page_content='Wrapping up\nIn this tutorial, we took a quick tour of vector databases. Specifically, we looked at 1) what features go into a mature vector database, 2) how a vector database differs from vector search libraries, 3) how a vector database differs from vector search plugins in traditional databases or search systems, and 4) the key challenges associated with building a vector database.\n\nThis tutorial is not meant to be a deep dive into vector databases, nor is it meant to

In [10]:
bot.chat("List of vector databases")

 Here's a list of some popular vector databases you may find useful:

1. Milvus: An open-source vector database developed by NVIDIA that provides efficient indexing and querying for similarity search on large datasets. It supports various similarity measures, including cosine similarity, Euclidean distance, and Hamming distance. Milvus is capable of handling both static and dynamic data and can be deployed in both on-premises and cloud environments. (https://milvus.io/)
2. Faiss: An open-source library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors, such as word embeddings or raw feature vectors. It provides multiple indexing structures like IVF, HNSW, and FAISS, along with various similarity measures. (https://faiss.ai/)
3. Annoy: A C++ library for efficient large-scale approximate nearest neighbor search on real-valued vectors. It uses a tree-based data structure called Annoy Index to perform fast queries. (https://github.com/spotify

{'query': 'List of vector databases',
 'result': " Here's a list of some popular vector databases you may find useful:\n\n1. Milvus: An open-source vector database developed by NVIDIA that provides efficient indexing and querying for similarity search on large datasets. It supports various similarity measures, including cosine similarity, Euclidean distance, and Hamming distance. Milvus is capable of handling both static and dynamic data and can be deployed in both on-premises and cloud environments. (https://milvus.io/)\n2. Faiss: An open-source library developed by Facebook AI Research for efficient similarity search and clustering of dense vectors, such as word embeddings or raw feature vectors. It provides multiple indexing structures like IVF, HNSW, and FAISS, along with various similarity measures. (https://faiss.ai/)\n3. Annoy: A C++ library for efficient large-scale approximate nearest neighbor search on real-valued vectors. It uses a tree-based data structure called Annoy Inde

In [11]:
print(document[0].page_content)

Inthe previous tutorial, we took a quick look at the ever-increasing amount of data that is being generated on a daily basis. We then covered how these bits of data can be split into structured/semi-structured and unstructured data, the differences between them, and how modern machine learning can be used to understand unstructured data through embeddings. Finally, we briefly touched upon unstructured data processing via ANN search.

Through all of this information, it’s now clear that the ever-increasing amount of unstructured data requires a paradigm shift and a new category of database management system — the vector database.

A vector database is a fully managed, no-frills solution for storing, indexing, and searching across a massive dataset of unstructured data that leverages the power of embeddings from machine learning models.

Vector databases from 1000 feet
Guess how many curators it took to label the now-famous ImageNet dataset. Ready for the answer?

25000 people (that’s a 

In [13]:
bot.chat("What is the info related to snowflake in document")

 Snowflake is a cloud-based data warehousing platform that uses a "shared nothing" architecture, which separates compute resources from storage resources and enables greater scalability and concurrency compared to traditional shared storage database models.

In the context of your analogy, Snowflake's architecture can be thought of as a component in the vector database ecosystem that addresses the challenge of designing a flexible and scalable data model for vector databases. By decoupling storage from compute resources, Snowflake allows organizations to store and analyze massive amounts of data without having to worry about the limitations of traditional shared storage architectures. This can be particularly important for vector databases, given the large and complex nature of the data they work with.

{'query': 'What is the info related to snowflake in document',
 'result': ' Snowflake is a cloud-based data warehousing platform that uses a "shared nothing" architecture, which separates compute resources from storage resources and enables greater scalability and concurrency compared to traditional shared storage database models.\n\nIn the context of your analogy, Snowflake\'s architecture can be thought of as a component in the vector database ecosystem that addresses the challenge of designing a flexible and scalable data model for vector databases. By decoupling storage from compute resources, Snowflake allows organizations to store and analyze massive amounts of data without having to worry about the limitations of traditional shared storage architectures. This can be particularly important for vector databases, given the large and complex nature of the data they work with.',
 'source_documents': [Document(page_content='Picture an airplane. The airplane itself contains a number of