In [1]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
import numpy as np

In [2]:
# Load PDF
loaders = [
    # Duplicate documents on purpose - messy data
    PyPDFLoader("D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture01.pdf"),
    PyPDFLoader("D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture02.pdf"),
    PyPDFLoader("D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture03.pdf")
]
docs = []
for loader in loaders:
    docs.extend(loader.load())

In [3]:
# Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(docs)
len(splits)

209

# Embeddings

Vamos dividir os dados e criar as embeddings (representações vetoriais).

In [4]:
# # Este comando é para quem possui as chaves da OpenAI
# embedding = OpenAIEmbeddings()

In [5]:
# Alternativa do huggingface
# Carregar o modelo de embeddings do HuggingFace
embedding = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")

  embedding = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


In [6]:
# Exemplo de como funciona os embeddings e a similaridade entre eles
sentence1 = "i like dogs"
sentence2 = "i like canines"
sentence3 = "the weather is ugly outside"

In [7]:
# Geração dos embeddings para cada sentença
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [8]:
# Calculo entre os embeddings entre 1 e 2
np.dot(embedding1, embedding2)

0.9151646748649319

In [9]:
# Calculo entre os embeddings entre 1 e 3
np.dot(embedding1, embedding3)

0.08337087441295077

In [10]:
# Calculo entre os embeddings entre 2 e 3
np.dot(embedding2, embedding3)

0.04040370466713196

# VectorStore

Armazenamento dos vetores de embeddings

In [11]:
persist_directory = 'docs/chroma/'

In [12]:
# # Comando que remove documentos no diretório
# !rm -rf .docs/chroma/ 

In [13]:
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

In [14]:
print(vectordb._collection.count())

209


# Similarity Search

Nesta etapa será feito uma pergunta que e será feito o cálculo de similaridade entre a pergunta e os vectores armazenados no vectorstore

In [15]:
question = "is there an email i can ask for help"

In [16]:
docs = vectordb.similarity_search(question,k=3)

In [17]:
len(docs)

3

In [18]:
docs[0].page_content

"cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now or at the end of class today, mingle a \nlittle bit and get to know your classmates. I strongly encourage you to f

Vamos salvar o processo para que possamos usar mais tarde!

In [19]:
vectordb.persist()

  vectordb.persist()


# Possíveis falhas

Apesar de ter um bom percentual de similaridade, podemos encontrar alguns erros como duplicação de documentos

In [28]:
question = "what did they say about matlab?"

In [29]:
docs = vectordb.similarity_search(question,k=5)

Observe que estamos recebendo pedaços duplicados (por causa do arquivo duplicado MachineLearning-Lecture01.pdf no índice).

A busca semântica traz todos os documentos semelhantes, mas não impõe diversidade.

docs[0] e docs[1] são idênticos.

In [30]:
docs[0]

Document(metadata={'page': 8, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the 

In [31]:
docs[1]

Document(metadata={'page': 8, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture01.pdf'}, page_content='those homeworks will be done in either MATLA B or in Octave, which is sort of — I \nknow some people call it a free ve rsion of MATLAB, which it sort  of is, sort of isn\'t.  \nSo I guess for those of you that haven\'t s een MATLAB before, and I know most of you \nhave, MATLAB is I guess part of the programming language that makes it very easy to write codes using matrices, to write code for numerical routines, to move data around, to \nplot data. And it\'s sort of an extremely easy to  learn tool to use for implementing a lot of \nlearning algorithms.  \nAnd in case some of you want to work on your  own home computer or something if you \ndon\'t have a MATLAB license, for the purposes of  this class, there\'s also — [inaudible] \nwrite that down [inaudible] MATLAB — there\' s also a software package called Octave \nthat you can download for free off the 

Podemos ver um novo modo de falha.

A pergunta abaixo faz uma questão sobre a terceira palestra, mas inclui resultados de outras palestras também.

In [32]:
question = "what did they say about regression in the third lecture?"

In [33]:
docs = vectordb.similarity_search(question,k=5)

In [34]:
for doc in docs:
    print(doc.metadata)

{'page': 0, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture03.pdf'}
{'page': 2, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture02.pdf'}
{'page': 11, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture03.pdf'}
{'page': 13, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture03.pdf'}
{'page': 10, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture03.pdf'}


In [35]:
print(docs[4].page_content)

answer. You predict that if X is to the right of, sort of, the mid-point here then Y is one 
and then next to the left of that mid-point then Y is zero.  
So some people actually do this. Apply linear  regression to classi fication problems and 
sometimes it’ll work okay, but in general it’s actually a pretty bad idea to apply linear 
regression to classification problems like thes e and here’s why. Let’s say I change my 
training set by giving you just one more tr aining example all the way up there, right? 
Imagine if given this training set is actually  still entirely obvious  what the relationship 
between X and Y is, right? It’s ju st – take this value as greate r than Y is one and it’s less 
then Y is zero. By giving you this additiona l training example it really shouldn’t change 
anything. I mean, I didn’t really convey much  new information. There’s no surprise that 
this corresponds to Y equals one. But if you now  fit linear regression to this data set you 
end up with a lin