# Vector Store and Embedding

In [1]:
from langchain.document_loaders import PyPDFLoader
loaders = [
        # PyPDFLoader("pdf/MachineLearning-Lecture01.pdf"),
        PyPDFLoader("pdf/MachineLearning-Lecture01.pdf"),        
        PyPDFLoader("pdf/MachineLearning-Lecture02.pdf"),
        PyPDFLoader("pdf/MachineLearning-Lecture03.pdf"),
]
docs = []

for loader in loaders:
    docs.extend(loader.load())

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

In [3]:
splits = text_splitter.split_documents(docs)
len(splits)

225

In [4]:
splits[70]

Document(page_content="Uno, dos, tres, cuatro, cinco, seis, siete, ocho, nueve, diez.  \nInstructor (Andrew Ng): And so the algorithm discovers  that, gee, the structure \nunderlying the data is really th at there are two sources of so und, and here they are. I'll \nshow you one more example. This is a, well, th is is a second sort of different pair of \nmicrophone recordings:  \nMicrophone 1:  \nOne, two, three, four, five, six, seven, eight, nine, ten.  \nMicrophone 2:  \n[Music playing.]  Instructor (Andrew Ng): So the poor guy is not at a cocktail party. He's talking to his \nradio. There's the second recording:  \nMicrophone 1:  \nOne, two, three, four, five, six, seven, eight, nine, ten.  \nMicrophone 2:  \n[Music playing.]  \nInstructor (Andrew Ng) : Right. And we get this data. It's the same unsupervised \nlearning algorithm. The algorithm is actually called independent component analysis, and \nlater in this quarter, you'll see why. And then output's the following:  \nMicropho

## Embedding

In [5]:
import os
import getpass
# GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass("Provide your Google API key here")

API_KEY = os.environ["GOOGLE_API_KEY"]


In [6]:
import google.generativeai as genai
genai.configure(api_key=API_KEY)

  from .autonotebook import tqdm as notebook_tqdm


In [7]:
for model in genai.list_models():
    print(model.supported_generation_methods)

['generateMessage', 'countMessageTokens']
['generateText', 'countTextTokens', 'createTunedTextModel']
['embedText', 'countTextTokens']
['generateContent', 'countTokens']
['generateContent', 'countTokens']
['embedContent', 'countTextTokens']
['generateAnswer']


In [8]:
for model in genai.list_models():
    if 'embedContent' in model.supported_generation_methods:
        print(model.name)



models/embedding-001


In [9]:
# https://python.langchain.com/docs/integrations/text_embedding/google_generative_ai
from langchain_google_genai import GoogleGenerativeAIEmbeddings
gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")


In [10]:
embedding1 = gemini_embeddings.embed_query("I like dogs")
embedding2 = gemini_embeddings.embed_query("I like canines")
embedding3 = gemini_embeddings.embed_query("the weather is ugly outside")

In [11]:
display(len(embedding1))

768

In [12]:
import numpy as np

In [13]:
np.dot(embedding1, embedding2)

0.9793075046740478

In [14]:
np.dot(embedding1, embedding3)

0.7909725167442816

In [15]:
np.dot(embedding2, embedding3)

0.7868589379350919

## Vector Stores

In [16]:
from langchain.vectorstores import Chroma

In [17]:
persist_dir = 'docs/chroma/'

In [18]:
!rm -Rf ./docs/chroma

In [19]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=gemini_embeddings,
    persist_directory=persist_dir
)

In [20]:
print(vectordb._collection.count())

225


In [21]:
question = "what did they say about matlab?"
docs = vectordb.similarity_search(question, k=5)
len(docs)


5

In [22]:
for idx, doc in enumerate(docs):
    print(f"doc {idx} th")
    print(doc.page_content)
    print(doc.metadata)    
    print("")

doc 0 th
that you can download for free off the Internet. And it has somewhat fewer features than MATLAB, but it's free, and for the purposes of  this class, it will work for just about 
everything.  
So actually I, well, so yeah, just a side comment for those of you that haven't seen 
MATLAB before I guess, once a colleague of mine at a different university, not at 
Stanford, actually teaches another machine l earning course. He's taught it for many years. 
So one day, he was in his office, and an old student of his from, lik e, ten years ago came 
into his office and he said, "Oh, professo r, professor, thank you so much for your 
machine learning class. I learned so much from it. There's this stuff that I learned in your 
class, and I now use every day. And it's help ed me make lots of money, and here's a 
picture of my big house."  
So my friend was very excited. He said, "W ow. That's great. I'm glad to hear this
{'page': 8, 'source': 'pdf/MachineLearning-Lecture01.pdf'}

doc 1 th

Because of the duplicated `MachineLearning-Lecture01.pdf', there are duplicated search result such as Doc[2], Doc[3]

In [23]:
vectordb.persist()