In [None]:
!pip install openai
!pip install langchain pypdf langchain-openai #tiktoken chromadb



In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/2305.14314-QLORA-EfficientFinetuningofQuantizedLLMs.pdf")
pages = loader.load_and_split()

In [None]:
print(f"Number of pages: {len(pages)}")
print(pages[0])
print(pages[33])

Number of pages: 34
page_content='QL ORA: Efficient Finetuning of Quantized LLMs\nTim Dettmers∗Artidoro Pagnoni∗Ari Holtzman\nLuke Zettlemoyer\nUniversity of Washington\n{dettmers,artidoro,ahai,lsz}@cs.washington.edu\nAbstract\nWe present QLORA, an efficient finetuning approach that reduces memory us-\nage enough to finetune a 65B parameter model on a single 48GB GPU while\npreserving full 16-bit finetuning task performance. QLORAbackpropagates gradi-\nents through a frozen, 4-bit quantized pretrained language model into Low Rank\nAdapters (LoRA). Our best model family, which we name Guanaco , outperforms\nall previous openly released models on the Vicuna benchmark, reaching 99.3%\nof the performance level of ChatGPT while only requiring 24 hours of finetuning\non a single GPU. QLORAintroduces a number of innovations to save memory\nwithout sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that\nis information theoretically optimal for normally distributed weights (

In [None]:
from openai import OpenAI
from google.colab import userdata

open_ai_key = userdata.get('openai-secret')
client = OpenAI(api_key=open_ai_key)

In [None]:
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

# Define prompt
prompt_template = """You are an expert on Generative AI in Large Language Models.
Write a summary of the scientific article enclosed within three backticks, in a style required to document the state of the art of a thesis work:
```{text}```
SUMMARY:"""
prompt = PromptTemplate.from_template(prompt_template)

# Define LLM chain
llm = ChatOpenAI(temperature=0.1, model_name="gpt-4-turbo-preview", api_key=open_ai_key)
llm_chain = LLMChain(llm=llm, prompt=prompt)

# Define StuffDocumentsChain
stuff_chain = StuffDocumentsChain(llm_chain=llm_chain, document_variable_name="text")
stuff_chain_summary = stuff_chain.run(pages)

print(stuff_chain_summary)

The article "QL ORA: Efficient Finetuning of Quantized LLMs" by Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer from the University of Washington introduces QLORA, a novel approach for efficient finetuning of large language models (LLMs) that significantly reduces memory usage. This method enables finetuning a 65B parameter model on a single 48GB GPU without compromising task performance, achieving near parity with the performance of ChatGPT with only 24 hours of finetuning on a single GPU. The key innovations include a 4-bit quantized model, Low Rank Adapters (LoRA), and several memory-saving techniques such as 4-bit NormalFloat (NF4) quantization, Double Quantization, and Paged Optimizers. The authors demonstrate QLORA's effectiveness across various datasets and model scales, showing that it can achieve state-of-the-art results even with smaller models. They also provide a detailed analysis of chatbot performance, suggesting that GPT-4 evaluations can serve as a re

In [None]:
!pip install langchain



In [None]:
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.text_splitter import CharacterTextSplitter


# Define LLM chain
llm = ChatOpenAI(temperature=0.1, model_name="gpt-4-turbo-preview", api_key=open_ai_key)

# Map template prompt
map_prompt_template = """
You are an expert on Generative AI in Large Language Models.
The following is a set of pages from a research paper.

{documents}

Based on this list of pages, please identify the main themes
Summaries:
"""

# Reduce template prompt
reduce_template = """
The following is set of summaries:

{documents}

Take these and distill it into a summary of the scientific article, in a style required to document the state of the art of a thesis work
Final summary:
"""

# Definition of map/reudce prompts
map_prompt = PromptTemplate.from_template(map_prompt_template)
reduce_prompt = PromptTemplate.from_template(reduce_template)

#Definition of chains
map_chain = LLMChain(llm=llm, prompt=map_prompt)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain = reduce_chain,
    document_variable_name="documents"
)

reduce_documents_chain = ReduceDocumentsChain(
    combine_documents_chain=combine_documents_chain,
    collapse_documents_chain=combine_documents_chain)

# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    reduce_documents_chain=reduce_documents_chain,
    document_variable_name="documents",
    return_intermediate_steps=False,
)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=0
)

split_docs = text_splitter.split_documents(pages)
map_reduce_chain = map_reduce_chain.run(split_docs)

print(map_reduce_chain)

This thesis work meticulously examines the advancements and challenges in Generative AI and Large Language Models (LLMs), with a special emphasis on optimization techniques, efficiency improvements, and ethical considerations. It introduces the Quantized Low-Rank Adaptation (QLORA) method as a groundbreaking approach to finetuning LLMs, which significantly reduces computational and memory demands, thereby making sophisticated models more accessible for diverse applications. The thesis underscores the effectiveness of 4-bit NormalFloat (NF4) quantization and Double Quantization in minimizing the memory footprint of models without sacrificing performance, and highlights the role of Paged Optimizers in managing memory efficiently.

A comprehensive evaluation of QLORA-finetuned models across various benchmarks is presented, showcasing their ability to achieve superior or comparable results with reduced computational resources. The thesis also explores the impact of data quality over size, 