In [1]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import SentenceTransformersTokenTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_mistralai import ChatMistralAI
from langchain.chains import RetrievalQA
from utils import clean
from dotenv import load_dotenv

In [2]:
load_dotenv()

True

In [3]:
dir = "../../documents"
documents_path = os.listdir(dir)
documents_path = [f"{dir}/{file}" for file in documents_path]

In [4]:
documents = []
for file in documents_path:
    loader = PyPDFLoader(file)
    loaded_docs = loader.load()
    
    for doc in loaded_docs:
        doc.page_content = clean(doc.page_content)
    
    documents.extend(loaded_docs)

In [7]:
text_splitter = SentenceTransformersTokenTextSplitter(tokens_per_chunk=300, chunk_overlap=30)
chunks = text_splitter.split_documents(documents)

In [6]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
# embeddings = embedding_model.embed_documents([chunk.page_content for chunk in chunks])

In [8]:
vectorstore = FAISS.from_documents(documents=chunks, embedding=embedding_model)

In [9]:
retriever  = vectorstore.as_retriever()
llm = ChatMistralAI(model="mistral-medium-latest", temperature=0.8, max_retries=2)

rag_pipeline = RetrievalQA.from_chain_type(llm=llm, retriever=retriever, return_source_documents=True)

In [10]:
query = "how use nural network in nlp"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Neural networks are widely used in Natural Language Processing (NLP) for various tasks. Here are some key ways they are applied, based on the provided context:

1. **Language Modeling and Word Prediction**: Neural networks can be used as language models to predict the next word in a sequence. They also generate word embeddings, such as Word2Vec or GloVe, which are dense representations of words that capture semantic meanings and can be used in other NLP tasks.

2. **Sentiment Analysis**: Neural networks can classify the sentiment of a text. Instead of using hand-built features, they can learn features from the data by representing words as embeddings. For example, a feedforward network can take word embeddings as input and use a hidden layer to represent non-linear interactions between features, improving the sentiment classifier's performance.

3. **Sequence Modeling Tasks**: Recurrent Neural Networks (RNNs), including Long Short-Term Memory (LSTM) networks, are particularly u

In [11]:
query = "who author this book"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
The book is authored by Daniel Jurafsky and James H. Martin, as indicated by the line: "speech and language processing. daniel jurafsky & james h. martin."


In [12]:
query = "this book explain the lstm and rnn"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Yes, this book explains both Long Short-Term Memory (LSTM) networks and Recurrent Neural Networks (RNNs).

For RNNs, it covers the basic structure, how they process sequences one element at a time, and how the output of each neural unit at time \( t \) is based on both the current input and the hidden layer from the previous time step \( t-1 \). It also mentions that RNNs can be trained using backpropagation through time (BPTT) and discusses their limitations, such as the vanishing gradients problem.

For LSTMs, the book describes them as an extension to RNNs designed to address the vanishing gradients problem. LSTMs use specialized neural units with gates to control the flow of information into and out of the units. These gates help the network learn to forget information that is no longer needed and to remember information required for future decisions. The book also explains that LSTMs have become the standard unit for modern systems using recurrent networks.


In [13]:
query = "who best naive bays or transformer"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
Based on the provided context, it is not explicitly stated which model is better between Naive Bayes and Transformers. However, the context does mention some advantages and disadvantages of Naive Bayes and briefly touches on Transformers.

Naive Bayes is noted for its simplicity, ease of implementation, and fast training times. It can work well on very small datasets or short documents and often makes correct classification decisions despite its conditional independence assumptions. However, it has limitations with larger documents or datasets and can be less accurate with correlated features.

Transformers, on the other hand, are described in terms of their architecture and modularity, with a focus on dimensionality and attention mechanisms. They are generally more complex and powerful, suitable for a wide range of tasks, especially those involving sequential data like natural language processing.

In summary, the choice between Naive Bayes and Transformers depends on the spec

In [14]:
query = "In this book what the chapter number for  vector semantic"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
The chapter number for vector semantics in the provided context is **Chapter 6.2**. This section discusses vector semantics, including the representation of word meanings as points in a multi-dimensional space.


In [15]:
query = "what the transformer use case"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
The context provided describes the architecture and components of a transformer, particularly in the context of language modeling. Based on this information, one of the primary use cases of transformers is for language modeling tasks. Here are some key points about transformer use cases from the context:

1. **Language Modeling**: Transformers are used to build language models that predict the next token in a sequence. This involves encoding input tokens, passing them through stacked transformer blocks, and using a language model head to generate logits and word probabilities.

2. **Self-Attention Mechanisms**: Transformers utilize multi-head attention, a form of self-attention, which allows the model to weigh the relevance of different tokens in the input sequence. This helps in capturing complex dependencies and relationships within the data.

3. **Wide Context Window**: Transformer-based language models can handle large context windows, sometimes up to 200,000 tokens or more

In [17]:
query = "what the naive bays use case"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
The context provided highlights several use cases and characteristics of the Naive Bayes classifier, particularly the multinomial Naive Bayes. Here are the key use cases mentioned:

1. **Text Classification**: Naive Bayes is commonly used for text classification tasks, such as sentiment analysis, where it classifies text as reflecting positive or negative sentiment.

2. **Small Datasets**: Naive Bayes can work extremely well on very small datasets, sometimes even outperforming logistic regression in such scenarios.

3. **Short Documents**: It is effective for short documents, making it suitable for tasks involving brief text inputs.

4. **Speed and Simplicity**: Naive Bayes is easy to implement and very fast to train, as it does not require an optimization step. This makes it a good choice for situations where computational resources are limited.

5. **Binarized Features**: Naive Bayes with binarized features (where features are represented as binary values) often works better 

In [18]:
query = "what the mask in transformer"
response = rag_pipeline.invoke(query)

print("Answer:")
print(response["result"])

Answer:
In the context of transformers, particularly in the Masked Language Model (MLM) training objective, the term "mask" refers to the process of randomly selecting some tokens in the input sequence and replacing them with a special [MASK] token or a random token. This masking technique is used to create a training task where the model must predict the original identities of the masked tokens.

Here’s a breakdown of the masking process as described in the context:

1. **Token Selection**: A subset of the input tokens is randomly selected for masking. For example, in the provided context, the tokens "long," "thanks," and "the" were sampled from the input sequence.

2. **Masking**: Some of the selected tokens are replaced with a [MASK] token, while others might be replaced with a random token from the vocabulary. In the example, "long" and "thanks" were masked, and "the" was replaced with the unrelated word "apricot."

3. **Training Objective**: The model is then trained to predict th