# LlamaIndex - Private Setup

Using GPT4ALL and our HuggingFace embeddings, we will injest [Chapter 3 of the recent IPCC Climate Report](https://www.ipcc.ch/report/ar6/wg2/chapter/chapter-3/), which covers oceans and coastal ecosystems. Using llama-index, this PDF is injested and vectorized, and questions can be answered about anything from this 172 paged report.

Climate reports are long and tedious to read, so this demo will also help give some insight to the latest findings from the IPCC!

Inspired by the recent popularity of [PrivateGPT](https://github.com/imartinez/privateGPT), this notebook will walk you through a llama-index setup that uses entirely local models. In this notebook, we use GPT4ALL and huggingface embeddings, which should run decently well on CPU alone. If you had more resources, we also provide some links further down for setting up any LLM from huggingface and running on GPU.

## Dependencies Setup

### Setup Line Wrapping

In [26]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

### Download gpt4all model

In [None]:
!wget https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin

### Download 2023 IPPC Climate Report - Chapter 3 on Oceans and Coastal Ecosystems (172 Pages)

In [None]:
!wget https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf

### Download extra packages

In [1]:
!pip install pymupdf pygpt4all sentence_transformers accelerate 






### Install LLaMa from source folder

In [2]:
!pip install -q -U ../



## Documents setup

Here, we use PyMuPDFReader to quickly load all 172 pages of the climate report PDF. The `metadata=True` option will automatically set some helpful information like page numbers and filename, to help us keep track of sources.

In [2]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms import GPT4All
from llama_index.node_parser.simple import SimpleNodeParser
from llama_index.langchain_helpers.text_splitter import TokenTextSplitter
from llama_index import (
    GPTVectorStoreIndex, 
    LangchainEmbedding, 
    LLMPredictor, 
    ServiceContext, 
    StorageContext, 
    download_loader,
    PromptHelper
)

In [6]:
PyMuPDFReader = download_loader("PyMuPDFReader")

In [7]:
documents = PyMuPDFReader().load(file_path='../content/Efficient Methods for NLP.pdf', metadata=True)
# documents = PyMuPDFReader().load(file_path='/content/What_We_Know_GenZ.pdf', metadata=True)
# ensure document texts are not bytes objects
for doc in documents:
    doc.text = doc.text.decode()

In [5]:
# print a document to test. Each document is a single page from the pdf, with appropriate metadata
documents[10]

Document(text='cally comes with a number of caveats. First, sam-\npling frequency is often restricted at various levels\nof the stack and may result in a lag in measure-\nment start. Consequently, shorter experiments may\nlog an energy use of zero, and there will almost\nalways be energy demand that is missed. Second,\ninefﬁciencies such as heat loss are not reported by\ncurrent APIs and hence, do not cover cooling and\nother system management activities. Third, not all\narchitectures and operating systems are supported.\nFor instance, power consumption under macOS\nis difﬁcult to manage, and direct ﬁgures for TPU\npower consumption are not available.\nCarbon emissions\nCarbon emissions are usually\ncomputed using the power consumption and the\ncarbon intensity of the marginal energy genera-\ntion used to run the program. Thus, low-energy\ndoes not mean low-carbon, and high-energy models\ncan—in the right region and with some care—be\nzero-carbon in terms of point energy consumption\ni

Note that due to model limitations (and considerations for speed), these settings are non-standard, but should help ensure reasonable resource usage with GPT4ALL

## GPU Llama Index

As stated earlier, if you have a modest GPU available (at least 15GB of VRAM), you can speed things up considerably.

This next section will setup a new predictor from Huggingface, using the Writer/camel-5b-hf model (which is also conviently licensed for commericial use).

(If you are running in colab, switch to a GPU instance first!)

### LLM + Embed Model Setup

In [31]:
# setup prompts - specific to Camel
from llama_index.prompts.prompts import SimpleInputPrompt

# This will wrap the default prompts that are internal to llama-index
# taken from https://huggingface.co/Writer/camel-5b-hf
query_wrapper_prompt = SimpleInputPrompt(
    "A continuación hay una instrucción que describe una tarea. "
    "Escribe una respuesta que complete adecuadamente la solicitud.\n\n"
    "### Instruction:\n{query_str}\n\n### Response:"
)

In [46]:
import torch
from llama_index.llm_predictor import HuggingFaceLLMPredictor

# NOTE: the first run of this will download/cache the weights, ~20GB
hf_predictor = HuggingFaceLLMPredictor(
    max_input_size=2048, 
    max_new_tokens=256,
    temperature=0.25,
    do_sample=False,
    query_wrapper_prompt=query_wrapper_prompt,
    tokenizer_name="Writer/camel-5b-hf",
    model_name="Writer/camel-5b-hf",
    device_map="auto",
    tokenizer_kwargs={"max_length": 2048},
    model_kwargs={"torch_dtype": torch.bfloat16, "cache_dir": "../cache"},
    
)


Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/10.0G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.99G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/1.09G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [47]:
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
service_context = ServiceContext.from_defaults(chunk_size_limit=512, llm_predictor=hf_predictor, embed_model=embed_model)

### Construct index using GPU

Since we are using a GPU now, the emebddings are much faster to generate! Plus, now we are working with larger chunks, which enables the embeddings to represent the text and enable better retrieval.

In [49]:
index = GPTVectorStoreIndex.from_documents(documents, service_context=service_context)
index.storage_context.persist(persist_dir="../storage")

#### (Optional) Load if already saved

In [50]:
from llama_index import load_index_from_storage
storage_context = StorageContext.from_defaults(persist_dir="../storage")
index = load_index_from_storage(storage_context, service_context=service_context)

### Query using GPU

With a GPU, the response will begin streaming very quickly. Camel is an excellent model given it's modest size of 5B parameters!

In [51]:
query_engine = index.as_query_engine(streaming=True, similarity_top_k=3, service_context=service_context)

In [52]:
response_stream = query_engine.query("What are the stages of NLP process?")
response_stream.print_response_stream()

Token indices sequence length is longer than the specified maximum sequence length for this model (1305 > 512). Running this sequence through the model will result in indexing errors
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Compared to GPT4ALL, this response is more complete and accurate. You can inspect the source nodes below:

In [37]:
print(response_stream.source_nodes)

[NodeWithScore(node=Node(text='Tapo, Nishant Sub-\nramani, Artem Sokolov, Claytone Sikasote,\nMonang Setyawan, Supheakmungkol Sarin,\nSokhar Samb, Benoît Sagot, Clara Rivera,\nAnnette Rios, Isabel Papadimitriou, Salomey\nOsei, Pedro Ortiz Suarez, Iroro Orife, Kelechi\nOgueji, Andre Niyongabo Rubungo, Toan Q.\nNguyen, Mathias Müller, André Müller, Sham-\nsuddeen Hassan Muhammad, Nanda Muham-\nmad, Ayanda Mnyakeni, Jamshidbek Mirza-\nkhalov, Tapiwanashe Matangira, Colin Leong,\nNze Lawson, Sneha Kudugunta, Yacine Jer-\nnite, Mathias Jenny, Orhan Firat, Bonaven-\nture F. P. Dossou, Sakhile Dlamini, Nisansa\nde Silva, Sakine Çabuk Ballı, Stella Biderman,\nAlessia Battisti, Ahmed Baruwa, Ankur Bapna,\nPallavi Baljekar, Israel Abebe Azime, Ayodele\nAwokoya, Duygu Ataman, Orevaoghene Ahia,\nOghenefego Ahia, Sweta Agrawal, and Mofe-\ntoluwa Adeyemi. 2022. Quality at a Glance:\nAn Audit of Web-Crawled Multilingual Datasets.\nTransactions of the Association for Computa-\ntional Linguistics, 10:5