# LlamaIndex Setup

1. Download the climate report 

* `!wget https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf`

2. Download the model 
    * `!wget https://gpt4all.io/models/ggml-gpt4all-j-v1.3-groovy.bin`

3. Download extra packages
* `!pip install pymupdf pygpt4all sentence_transformers accelerate`

4. Then download llamaindex

`!pip install -U git+https://github.com/jerryjliu/llama_index.git`

### Downloading 

* PyMuPDFReader to quickly load all 172 pages of the climate report PDF. The metadata=True option will automatically set some helpful information like page numbers and filename,

In [None]:
from llama_index import (
    download_loader
)

In [None]:
PyMuPDFReader = download_loader("PyMuPDFReader")

In [None]:
documents = PyMuPDFReader().load(file_path='./IPCC_AR6_WGII_Chapter03.pdf', metadata=True)

# ensure document texts are not bytes objects
for doc in documents:
    doc.text = doc.text.decode()

In [None]:
# print a document to test. Each document is a single page from the pdf, with appropriate metadata
documents[10]

2. Setup Model

* So this references our gpt4all model

In [None]:
local_llm_path = './ggml-gpt4all-j-v1.3-groovy.bin'
llm = GPT4All(model=local_llm_path, backend='gptj', streaming=True, n_ctx=512)
llm_predictor = LLMPredictor(llm=llm)

And we need to embed the model.

In [None]:
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

* From there, have the components setup 

In [None]:
prompt_helper = PromptHelper(max_input_size=512, num_output=256, max_chunk_overlap=-1000)
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
    embed_model=embed_model,
    prompt_helper=prompt_helper,
    node_parser=SimpleNodeParser(text_splitter=TokenTextSplitter(chunk_size=300, chunk_overlap=20))
)

### Create the index

* break each document into nodes, and
* create an embedding vector for each node using our embed_model. 

This may take a several minutes if running on CPU (this is a large climate report)!

### Resources

* [llamaindex colab](https://colab.research.google.com/drive/16QMQePkONNlDpgiltOi7oRQgmB8dU5fl?usp=sharing#scrollTo=f323fb5a)