In [1]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import RetrievalQA
from utils import clean
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

In [3]:
dir = "../../documents"
documents_path = os.listdir(dir)
documents_path = [f"{dir}/{file}" for file in documents_path]

In [4]:
documents = []
for file in documents_path:
    loader = PyPDFLoader(file)
    loaded_docs = loader.load()
    
    for doc in loaded_docs:
        doc.page_content = clean(doc.page_content)
    
    documents.extend(loaded_docs)

In [None]:
text_splitter = SentenceTransformersTokenTextSplitter(tokens_per_chunk=300, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)

In [6]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# embeddings = embedding_model.embed_documents([chunk.page_content for chunk in chunks])

In [7]:
vectorstore = FAISS.from_documents(documents=chunks, embedding=embedding_model)

In [8]:
retriever  = vectorstore.as_retriever()
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0.8)

rag_pipeline = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)

In [9]:
query = "how use nural network in nlp"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Neural networks are used in NLP by representing words as embeddings and learning features from data. These networks can then be applied to various tasks, including:

*   **Language modeling:** Predicting the probability of a sequence of words.
*   **Text classification:** Assigning categories to text, such as sentiment analysis.
*   **Sequence modeling:** Tasks like part-of-speech tagging.


In [10]:
query = "who author this book"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
The book is authored by Daniel Jurafsky & James H. Martin.


In [11]:
query = "this book explain the lstm and rnn"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Yes, the book explains both LSTMs (Long Short-Term Memory networks) and RNNs (Recurrent Neural Networks). It covers:

*   The basic structure of RNNs, how they process sequences one item at a time, and how the hidden layer's activation depends on both the current input and the previous hidden layer state.
*   The vanishing gradients problem that affects simple RNNs and how LSTMs address this by explicitly managing context over time, learning to forget irrelevant information and remember important information.
*   The architecture of LSTMs, including the context layer and the use of gates to control the flow of information.
*   Common applications of RNNs in NLP, such as language modeling, sequence labeling (e.g., part-of-speech tagging), sequence classification (e.g., sentiment analysis), and encoder-decoder architectures.


In [13]:
query = "who best naive bays or transformer"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
The context discusses the performance of Naive Bayes and Logistic Regression, not Transformers. It states that Logistic Regression generally works better on larger documents or datasets and is a common default. However, Naive Bayes can work extremely well on very small datasets or short documents and is easy to implement and very fast to train.


In [14]:
query = "In this book what the chapter number for  vector semantic"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Based on the provided text, vector semantics is discussed in chapter 6.2.


In [16]:
query = "what the transformer use case"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Based on the provided text, transformers are used for building language models. These models can have a wide context window, allowing them to draw on a large amount of information.


In [17]:
query = "what the naive bays use case"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Based on the provided text, here are some use cases for the Naive Bayes classifier:

*   **Sentiment Analysis:** Classifying text as reflecting a positive or negative sentiment.
*   **Movie Review Classification:** Determining the sentiment (positive or negative) of a movie review based on the frequency of words.
*   **Text Classification:** Generally, Naive Bayes can be used for various text classification tasks.
*   **Small Datasets or Short Documents:** Naive Bayes can perform well on very small datasets or when classifying short documents, sometimes even better than logistic regression.


In [18]:
query = "what the mask in transformer"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
In the context of transformers, especially in the context of masked language model (MLM) training, "masking" refers to the process of replacing some of the input tokens with a special "[MASK]" token. The model is then trained to predict the original masked tokens based on the surrounding context.
