<a href="https://colab.research.google.com/github/rabbitmetrics/langchain-13-min/blob/main/notebooks/langchain-13-min.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from dotenv import load_dotenv, find_dotenv
# import os

# Contant for debug prints
DEBUG = True     

if load_dotenv(find_dotenv()):
    print("Environment variables loaded successfully!") if DEBUG else None
    # print(os.getenv('OPENAI_API_KEY'))
else:
    print("Could not load environment variables.") if DEBUG else None

Environment variables loaded successfully!


In [4]:
# import schema for chat messages and ChatOpenAI in order to query chatmodels GPT-3.5-turbo or GPT-4
from langchain.schema import AIMessage, HumanMessage, SystemMessage
from langchain.chat_models import ChatOpenAI

chat = ChatOpenAI(model_name="gpt-3.5-turbo") # can also pass through temperatue
messages = [
    SystemMessage(content="You are an expert computer scientist and developer."),
    HumanMessage(content="Explain the process of embeddings and vector databases, and their relationship in at least 500 words.")
]
response=chat(messages)
# chat(messages)
print(response.content,end='\n')

Embeddings and vector databases are two key concepts in the field of natural language processing (NLP) and machine learning. They play a crucial role in representing textual data in a numerical format and enabling efficient similarity search operations.

Embeddings can be defined as a technique of representing words, sentences, or documents as dense vectors in a high-dimensional space, typically with hundreds of dimensions. The goal of embeddings is to capture the semantic and syntactic relationships between words or documents, such that similar items are represented by vectors that are close to each other in the embedding space. This representation allows machines to understand and reason about textual data, which is otherwise challenging due to its unstructured nature.

The process of generating embeddings involves two main steps: training and inference. During the training phase, a large corpus of text is fed into a neural network-based model, such as Word2Vec, GloVe, or BERT. The m

In [5]:
response_str=response.content
response_str

'Embeddings and vector databases are two key concepts in the field of natural language processing (NLP) and machine learning. They play a crucial role in representing textual data in a numerical format and enabling efficient similarity search operations.\n\nEmbeddings can be defined as a technique of representing words, sentences, or documents as dense vectors in a high-dimensional space, typically with hundreds of dimensions. The goal of embeddings is to capture the semantic and syntactic relationships between words or documents, such that similar items are represented by vectors that are close to each other in the embedding space. This representation allows machines to understand and reason about textual data, which is otherwise challenging due to its unstructured nature.\n\nThe process of generating embeddings involves two main steps: training and inference. During the training phase, a large corpus of text is fed into a neural network-based model, such as Word2Vec, GloVe, or BERT. 

In [7]:
# Import utility for splitting up texts and split up the explanation given above into document chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
)

texts = text_splitter.create_documents([response_str])
texts


[Document(page_content='Embeddings and vector databases are two key concepts in the field of natural language processing', metadata={}),
 Document(page_content='(NLP) and machine learning. They play a crucial role in representing textual data in a numerical', metadata={}),
 Document(page_content='format and enabling efficient similarity search operations.', metadata={}),
 Document(page_content='Embeddings can be defined as a technique of representing words, sentences, or documents as dense', metadata={}),
 Document(page_content='vectors in a high-dimensional space, typically with hundreds of dimensions. The goal of embeddings', metadata={}),
 Document(page_content='is to capture the semantic and syntactic relationships between words or documents, such that similar', metadata={}),
 Document(page_content='items are represented by vectors that are close to each other in the embedding space. This', metadata={}),
 Document(page_content='representation allows machines to understand and reaso

In [8]:
# Individual text chunks can be accessed with "page_content"
# texts[0].page_content
for i in range(0, len(texts)):
    print(texts[i].page_content)
    


Embeddings and vector databases are two key concepts in the field of natural language processing
(NLP) and machine learning. They play a crucial role in representing textual data in a numerical
format and enabling efficient similarity search operations.
Embeddings can be defined as a technique of representing words, sentences, or documents as dense
vectors in a high-dimensional space, typically with hundreds of dimensions. The goal of embeddings
is to capture the semantic and syntactic relationships between words or documents, such that similar
items are represented by vectors that are close to each other in the embedding space. This
representation allows machines to understand and reason about textual data, which is otherwise
challenging due to its unstructured nature.
The process of generating embeddings involves two main steps: training and inference. During the
training phase, a large corpus of text is fed into a neural network-based model, such as Word2Vec,
GloVe, or BERT. The mod

In [23]:
# Import and instantiate OpenAI embeddings
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model_name="ada") 
embeddings

OpenAIEmbeddings(client=<class 'openai.api_resources.embedding.Embedding'>, model='text-embedding-ada-002', document_model_name='text-search-ada-doc-001', query_model_name='text-search-ada-query-001', embedding_ctx_length=8191, openai_api_key=None, openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=6)

In [25]:
# Turn the first text chunk into a vector with the embedding
query_result = embeddings.embed_query(texts[0].page_content)
# print(query_result)

In [26]:
# Import and initialize Pinecone client
import os
import pinecone
from langchain.vectorstores import Pinecone
pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),  
    environment=os.getenv('PINECONE_ENV')  
)

In [27]:
# Upload vectors to Pinecone
index_name = "learn-langchain"
search = Pinecone.from_documents(texts, embeddings, index_name=index_name)

In [28]:
# Do a simple vector similarity search
query = "What is the purpose of a vector?"
result = search.similarity_search(query)
result

[Document(page_content='context of a word and its actual context.', metadata={}),
 Document(page_content='detection, and search engines.', metadata={}),
 Document(page_content='approximate similarity search.', metadata={}),
 Document(page_content='embedding space.', metadata={})]