<a href="https://colab.research.google.com/github/rabbitmetrics/langchain-13-min/blob/main/notebooks/langchain-13-min.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
from dotenv import load_dotenv, find_dotenv
# import os

if load_dotenv(find_dotenv()):
    print("Environment variables loaded successfully!") 
    # print(os.getenv('OPENAI_API_KEY'))
else:
    print("Could not load environment variables.") 

Environment variables loaded successfully!


In [3]:
# import schema for chat messages and ChatOpenAI in order to query chatmodels GPT-3.5-turbo or GPT-4
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(model_name="gpt-3.5-turbo") # can also pass through temperatue
messages = [
    SystemMessage(content="You are an expert computer scientist and developer."),
    HumanMessage(content="Explain how vector stores and embeddings can be used to enchance ChatGPT, so that we can input medical documents that it was never trained on from 2022 in at least 500 words.")
]
response=chat(messages)
# chat(messages)
print(response.content,end='\n')

Vector stores and embeddings can indeed be used to enhance ChatGPT's capabilities when dealing with medical documents it was never trained on. By leveraging these techniques, we can enable ChatGPT to understand and generate responses based on the content of medical documents from 2022.

A vector store is a large collection of vectors, where each vector represents a piece of text or a document. These vectors are typically high-dimensional numerical representations that capture the semantic meaning of the corresponding text. Embeddings, on the other hand, are lower-dimensional representations of the original text, typically derived from vector stores using techniques like word2vec or BERT.

To integrate vector stores and embeddings into ChatGPT, we can follow a two-step process: indexing and retrieval.

In the indexing step, we process the medical documents from 2022 and convert them into vector representations using techniques like doc2vec or Universal Sentence Encoder. These vector rep

In [6]:
response_str=response.content
response_str

"Vector stores and embeddings can indeed be used to enhance ChatGPT's capabilities when dealing with medical documents it was never trained on. By leveraging these techniques, we can enable ChatGPT to understand and generate responses based on the content of medical documents from 2022.\n\nA vector store is a large collection of vectors, where each vector represents a piece of text or a document. These vectors are typically high-dimensional numerical representations that capture the semantic meaning of the corresponding text. Embeddings, on the other hand, are lower-dimensional representations of the original text, typically derived from vector stores using techniques like word2vec or BERT.\n\nTo integrate vector stores and embeddings into ChatGPT, we can follow a two-step process: indexing and retrieval.\n\nIn the indexing step, we process the medical documents from 2022 and convert them into vector representations using techniques like doc2vec or Universal Sentence Encoder. These vec

In [7]:
# Import utility for splitting up texts and split up the explanation given above into document chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
)

texts = text_splitter.create_documents([response_str])
texts


[Document(page_content="Vector stores and embeddings can indeed be used to enhance ChatGPT's capabilities when dealing with", metadata={}),
 Document(page_content='medical documents it was never trained on. By leveraging these techniques, we can enable ChatGPT to', metadata={}),
 Document(page_content='understand and generate responses based on the content of medical documents from 2022.', metadata={}),
 Document(page_content='A vector store is a large collection of vectors, where each vector represents a piece of text or a', metadata={}),
 Document(page_content='document. These vectors are typically high-dimensional numerical representations that capture the', metadata={}),
 Document(page_content='semantic meaning of the corresponding text. Embeddings, on the other hand, are lower-dimensional', metadata={}),
 Document(page_content='representations of the original text, typically derived from vector stores using techniques like', metadata={}),
 Document(page_content='word2vec or BERT.'

In [6]:
# Individual text chunks can be accessed with "page_content"
# texts[0].page_content
for i in range(0, len(texts)):
    print(texts[i].page_content)

Embeddings and vector databases are two important concepts in the field of natural language
processing (NLP) and machine learning. They are closely related and play a crucial role in various
applications such as information retrieval, recommendation systems, and semantic search. In this
explanation, we will discuss the process of embeddings and vector databases and their relationship.
1. Embeddings:
Embeddings are vector representations of words, sentences, or documents in a high-dimensional
space. The goal of embeddings is to capture the semantic and syntactic relationships between
different entities in a way that is mathematically meaningful and computationally efficient.
Embeddings are learned through a process called word embedding or sentence embedding.
a. Word Embedding:
Word embedding is the process of representing words as continuous vectors in a high-dimensional
space. It is typically done using neural networks, specifically a technique called word2vec.
Word2vec models learn w

In [9]:
# Import and instantiate OpenAI embeddings
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings() 
embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', deployment='text-embedding-ada-002', openai_api_version='', openai_api_base='', openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key='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', openai_organization='', allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6, request_timeout=None, headers=None, tiktoken_model_name=None, show_progress_bar=False)

In [12]:
# Turn the first text chunk into a vector with the embedding
query_result = embeddings.embed_query(texts[0].page_content)
query_result

# todo: avail functions for OpenAIEmbeddings? should this be embed_query, or document/search, etc. 

[-0.020228765904903412,
 0.007087043486535549,
 0.013748585246503353,
 -0.014494957402348518,
 -0.0008453357149846852,
 0.01803847961127758,
 -0.0033795989584177732,
 -0.011076993308961391,
 -0.011893119663000107,
 -0.043470919132232666,
 -0.0033551850356161594,
 0.05175773799419403,
 -0.015429665334522724,
 -0.01950332149863243,
 0.020842604339122772,
 0.01918245106935501,
 0.006710370071232319,
 0.023451417684555054,
 0.01591794565320015,
 -0.03250553458929062,
 -0.01713167130947113,
 0.020633341744542122,
 0.005053703673183918,
 -0.010295744054019451,
 -0.014815826900303364,
 -0.0020019502844661474,
 0.03381691500544548,
 -0.041461993008852005,
 -0.03214281052350998,
 -0.0014386838302016258,
 0.012367448769509792,
 -0.01778736338019371,
 -0.03133366256952286,
 -0.04946979507803917,
 -0.03292405977845192,
 -0.00885182898491621,
 -0.004565423354506493,
 -0.027413465082645416,
 0.029520047828555107,
 -0.00396204786375165,
 0.03189169615507126,
 0.00033547490602359176,
 -0.0030308272689

In [13]:
# Import and initialize Pinecone client
import os
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),  
    environment=os.getenv('PINECONE_ENV')  
)

  from tqdm.autonotebook import tqdm


In [15]:
# Upload vectors to Pinecone
index_name = "file-q-and-a"
search = Pinecone.from_documents(texts, embeddings, index_name=index_name)

In [16]:
# Do a simple vector similarity search
query = "What is the purpose of a vector?"
result = search.similarity_search(query)
result

[Document(page_content='a vector store.', metadata={}),
 Document(page_content='The main purpose of a vector database is to store the embeddings generated from the previous step', metadata={}),
 Document(page_content='the vector store and improve the relevance of future document retrieval.', metadata={}),
 Document(page_content='vector is then compared with the vectors in the vector store to find the most relevant documents.', metadata={})]