# **GPT4ALL+Index: **

This is the final version of our working notebook, which has been tested and is now running with GPU support from Colab. Some of the code used in this notebook was referenced from the following source: https://colab.research.google.com/drive/1NWZN15plz8rxrk-9OcxNwwIk1V1MfBsJ?usp=sharing. Instead of using OpenAI's embedding, we utilized HuggingFaceEmbeddings to form the vector


In [None]:
!pip -q install datasets loralib sentencepiece
!pip -q install git+https://github.com/huggingface/transformers
!pip -q install git+https://github.com/huggingface/peft.git
!pip -q install bitsandbytes
!pip install llama-index
!pip install langchain
!pip install sentence_transformers

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import textwrap

In [None]:
peft_model_id = "nomic-ai/gpt4all-lora"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, return_dict=True, load_in_8bit=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
gpt4all_model = PeftModel.from_pretrained(model, peft_model_id)

In [None]:
from typing import Optional, List, Mapping, Any
from langchain.llms.base import LLM
from llama_index import SimpleDirectoryReader, LangchainEmbedding, GPTListIndex, PromptHelper, LLMPredictor, ServiceContext, GPTSimpleVectorIndex
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext

In [None]:
max_input_size = 2048
num_output = 300
max_chunk_overlap = 102
chunk_size_limit = 600
prompt_helper = PromptHelper(max_input_size, num_output,max_chunk_overlap,chunk_size_limit=chunk_size_limit)

In [None]:
class GPT4ALL_LLM(LLM):
    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        inputs = tokenizer(prompt, return_tensors="pt", )
        input_ids = inputs["input_ids"].cuda()
        generation_config = GenerationConfig(
            temperature=0.1,
            top_p=0.95,
            repetition_penalty=1.2,
        )

        generation_output = gpt4all_model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            output_scores=True,
            max_new_tokens=num_output,
        )
        response = tokenizer.decode(generation_output[0],skip_special_tokens=True).strip()
        return response[len(prompt):]
    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": "GPT4ALL"}
    @property
    def _llm_type(self) -> str:
        return "custom"

Please make sure that the chunk_size_limit is set lower than the prompt setting to send only one question at a time to LLM. Too many questions with extensive content in the prompt may confuse GPT4ALL.

In [None]:
llm_predictor = LLMPredictor(llm=GPT4ALL_LLM())
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, embed_model=embed_model, prompt_helper=prompt_helper, chunk_size_limit = 500) 

Please ensure that you create a folder called "data" under the Google Colab content directory and load a document in TXT/CSV format for indexing purposes. This will ensure that the index.json file is created with validated information

In [None]:
documents = SimpleDirectoryReader('./data').load_data()
index = GPTSimpleVectorIndex.from_documents(documents,service_context=service_context)
index.save_to_disk('index.json')

The following is simple test to ensure LLM model is working. 

In [None]:
llm = GPT4ALL_LLM()
print(llm._call("Hi! How is everythig going ?"))

Please provide your question as query_text related to your document and ensure that K=1 to avoid refined questions. I've noticed that GPT4ALL may get lost in the back-and-forth refinement of questions and answers.

In [None]:
query_text = "how to determine my key project stakeholders ?"
response = index.query(query_text,response_mode="compact",service_context=service_context, similarity_top_k=1)
print(response)