We discussed Document Loading and Splitting as well as Storage and Retrieval.

Let's load our vectorDB.

In [5]:
from langchain.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import Ollama

In [3]:
persist_directory = 'D:\Github\Chat-with-your-docs\Vectorstores-and-Embeddings\docs\chroma'
embedding = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")
vectordb = Chroma(
    persist_directory=persist_directory,
    embedding_function=embedding
)

#print resultados
print(vectordb._collection.count())

  persist_directory = 'D:\Github\Chat-with-your-docs\Vectorstores-and-Embeddings\docs\chroma'
  embedding = HuggingFaceEmbeddings(model_name = "sentence-transformers/all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm
  vectordb = Chroma(


209


In [4]:
#Verifica se a quantidade de docs igual a valor de K retorna corretamente
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

In [6]:
# Download do modelo LLM do Ollama
llm = Ollama(model='gemma2:2b', temperature=0)

  llm = Ollama(model='gemma2:2b', temperature=0)


# RetrievalQA chain

In [11]:
from langchain.chains import RetrievalQA

In [12]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [13]:
result = qa_chain({"query": question})

In [14]:
result["result"]

"This document describes a machine learning course.  Here are some of the major topics covered in the course: \n\n* **Statistics:** The course will cover basic statistical concepts, likely including probability distributions and hypothesis testing.\n* **Algebra:** Students will be expected to have a working knowledge of algebra for understanding mathematical concepts related to machine learning. \n* **Machine Learning:**  The core focus is on machine learning, covering various algorithms and techniques like linear regression, logistic regression, decision trees, support vector machines, and more. \n\n\nLet me know if you'd like me to elaborate on any specific topic! \n"

# Prompt


In [15]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [16]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [17]:
question = "Is probability a class topic?"

In [18]:
result = qa_chain({"query": question})

In [19]:
result["result"]

'Yes, the text states that the instructor assumes familiarity with basic probability and statistics.  \n\nthanks for asking! \n'

In [20]:
result["source_documents"][0]

Document(metadata={'page': 8, 'source': 'D:/Github/Chat-with-your-docs/Load-docs/docs/MachineLearning-Lecture01.pdf'}, page_content="statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.")

# RetrievalQA chain types

In [21]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

In [22]:
result = qa_chain_mr({"query": question})

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Token indices sequence length is longer than the specified maximum sequence length for this model (1700 > 1024). Running this sequence through the model will result in indexing errors


In [23]:
result["result"]

'Yes, probability is likely covered in the course. \n'

If you wish to experiment on the `LangSmith platform` (previously known as LangChain Plus):

 * Go to [LangSmith](https://www.langchain.com/langsmith) and sign up
 * Create an API key from your account's settings
 * Use this API key in the code below   
 * uncomment the code  
 Note, the endpoint in the video differs from the one below. Use the one below.

In [24]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

'Yes, probability is likely covered in the course. \n'

In [25]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

'The provided text strongly suggests that **probability is integrated into the course**, rather than being a standalone class topic.  \n\nHere\'s why:\n\n* **Emphasis on Familiarity:** The instructor explicitly mentions assuming familiarity with basic probability and statistics (like random variables, expectation, variance) and even plans to provide a refresher in discussion sections. This indicates that probability will be covered within the broader context of the course.\n* **Integration into Other Areas:**  The text mentions "statistics" being covered in discussion sections, implying probability will be used as part of a statistical framework. Additionally, machine learning is mentioned as a "huge field" with extensions to lectures, suggesting that probability concepts are likely used within this domain. \n* **Refresher Sessions:** The instructor explicitly states they\'ll provide a refresher course on basic probability and statistics in discussion sections for those who may need it

# RetrievalQA limitations
 
QA fails to preserve conversational history.

In [26]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [27]:
question = "Is probability a class topic?"
result = qa_chain({"query": question})
result["result"]

'Yes, the text states "I also assume familiarity with basic probability and statistics."  So yes, probability is a class topic. \n'

In [28]:
question = "why are those prerequesites needed?"
result = qa_chain({"query": question})
result["result"]

'The provided text explains the concepts of least squares regression and how it can be used with probabilistic semantics.  It also discusses the assumptions made about the data and error terms. \n\nHere\'s a breakdown of why these prerequisites are important:\n\n* **Least Squares Regression:** The text describes how least squares regression works to find the best-fitting line through data points. It explains that this method relies on certain assumptions, like linearity and homoscedasticity (constant variance).\n* **Probabilistic Semantics:**  The text introduces the idea of adding probabilistic elements to the model by assuming error terms have a distribution. This allows for more flexibility in modeling real-world situations where there might be randomness or uncertainty. \n* **Assumptions and Practicality:** The text emphasizes that assumptions are not always "absolutely true" but can still be useful for practical purposes. It explains how these assumptions help us understand the mo

Note, The LLM response varies. Some responses **do** include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the next section.