### 1. Load document using LangChain

In [46]:
%%capture
%pip install "langchain-community==0.2.1"
%pip install "pypdf==4.2.0"
%pip install "langchain-text-splitters==0.2.2"
%pip install "langchain==0.2.1"
%pip install "chromadb==0.4.24"

In [7]:
from langchain_community.document_loaders import PyPDFLoader

pdf_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/WgM1DaUn2SYPcCg_It57tA/A-Comprehensive-Review-of-Low-Rank-Adaptation-in-Large-Language-Models-for-Efficient-Parameter-Tuning-1.pdf"
loader = PyPDFLoader(pdf_url)
pages = loader.load_and_split()

full_text = "".join(page.page_content for page in pages)
first_1000 = full_text[:1000]
print(f"Extracted Length: {len(first_1000)}")
print(first_1000)

Extracted Length: 1000
A Comprehensive Review of Low-Rank
Adaptation in Large Language Models for
Efficient Parameter Tuning
September 10, 2024
Abstract
Natural Language Processing (NLP) often involves pre-training large
models on extensive datasets and then adapting them for specific tasks
through fine-tuning. However, as these models grow larger, like GPT-3
with 175 billion parameters, fully fine-tuning them becomes computa-
tionally expensive. We propose a novel method called LoRA (Low-Rank
Adaptation) that significantly reduces the overhead by freezing the orig-
inal model weights and only training small rank decomposition matrices.
This leads to up to 10,000 times fewer trainable parameters and reduces
GPU memory usage by three times. LoRA not only maintains but some-
times surpasses fine-tuning performance on models like RoBERTa, De-
BERTa, GPT-2, and GPT-3. Unlike other methods, LoRA introduces
no extra latency during inference, making it more efficient for practical
application

### 2. Text splitting

In [16]:
latex_text = """

    \documentclass{article}

    \begin{document}

    \maketitle

    \section{Introduction}

    Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in various natural language processing tasks, including language translation, text generation, and sentiment analysis.

    \subsection{History of LLMs}

The earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.

\subsection{Applications of LLMs}

LLMs have many applications in the industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for research in linguistics, psychology, and computational linguistics.

\end{document}

"""

In [18]:
from langchain.text_splitter import CharacterTextSplitter
import warnings
warnings.simplefilter("ignore", SyntaxWarning)

text_splitter = CharacterTextSplitter(
    separator="\\",
    chunk_size=200,
    chunk_overlap=20,
    length_function=len,
)
chunks = text_splitter.split_text(latex_text)
chunks

Created a chunk of size 352, which is longer than the specified 200
Created a chunk of size 378, which is longer than the specified 200
Created a chunk of size 248, which is longer than the specified 200


['\\documentclass{article}\n\n    \x08egin{document}\n\n    \\maketitle',
 'section{Introduction}\n\n    Large language models (LLMs) are a type of machine learning model that can be trained on vast amounts of text data to generate human-like language. In recent years, LLMs have made significant advances in various natural language processing tasks, including language translation, text generation, and sentiment analysis.',
 'subsection{History of LLMs}\n\nThe earliest LLMs were developed in the 1980s and 1990s, but they were limited by the amount of data that could be processed and the computational power available at the time. In the past decade, however, advances in hardware and software have made it possible to train LLMs on massive datasets, leading to significant improvements in performance.',
 'subsection{Applications of LLMs}\n\nLLMs have many applications in the industry, including chatbots, content creation, and virtual assistants. They can also be used in academia for researc

### 3. Embed documents

In [26]:
from langchain_community.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
huggingface_embedding = HuggingFaceEmbeddings(model_name=model_name)

query = "How are you?"
query_embedding = huggingface_embedding.embed_query(query)

print("Embedding length:", len(query_embedding))
print("Embedding values:", query_embedding[:5])

Embedding length: 768
Embedding values: [0.02710612490773201, 0.011331792920827866, -0.0019524091621860862, -0.03695135936141014, 0.017764905467629433]


### 4. Vector database

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

# Load text file
loader = TextLoader("data/new-Policies.txt")
data = loader.load()

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50,
    length_function=len,
)
chunks = text_splitter.split_documents(data)

# Create Chroma vector store
# vectordb.delete_collection()
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=huggingface_embedding,
)

# Similarity search
query = "Smoking policy"
docs = vectordb.similarity_search(query, k=5)
docs

[Document(metadata={'source': 'data/new-Policies.txt'}, page_content='This policy promotes the safe and responsible use of digital communication tools in line with our values and legal obligations. Employees must understand and comply with this policy. Regular reviews will ensure it remains relevant with changing technology and security standards.'),
 Document(metadata={'source': 'data/new-Policies.txt'}, page_content='This policy encourages the responsible use of mobile devices in line with legal and ethical standards. Employees are expected to understand and follow these guidelines. The policy is regularly reviewed to stay current with evolving technology and security best practices.'),
 Document(metadata={'source': 'data/new-Policies.txt'}, page_content='Consequences: Violations of this policy may lead to disciplinary action, including potential termination.'),
 Document(metadata={'source': 'data/new-Policies.txt'}, page_content='4. Mobile Phone Policy\n\nOur Mobile Phone Policy def

In [79]:
query = "Email policy"
retriever = vectordb.as_retriever(search_kwargs={"k": 2})
result = retriever.invoke(query)
result

[Document(metadata={'source': 'data/new-Policies.txt'}, page_content='Confidentiality: Use email for confidential information, trade secrets, and sensitive customer data only with encryption. Be careful when discussing company matters on public platforms or social media.'),
 Document(metadata={'source': 'data/new-Policies.txt'}, page_content='3. Internet and Email Policy\n\nOur Internet and Email Policy ensures the responsible and secure use of these tools within our organization, recognizing their importance in daily operations and the need for compliance with security, productivity, and legal standards.')]