## **Koala7B 8bit + HuggingFaceEmbedding + Llama Index**

Code released by: autratec
This github version were developed from v2

The POC project below enable resercher to setup large language model locally or leverage google colab free environment, wiht vector embedding technology from HaggingFace to do provide an accurate response based on local content through indexing. 

Colab resouce usage:  RAM: 5.4G. GPU8.9G

Pls create a folder callled data and get yoru raw data (csv) in that folder and index.json will be created under root path. Enjoy your test. 

Putting pipeline outside of class to reduce GPU usage. 

Here are the reference of codes being used in this notebook: 

https://colab.research.google.com/drive/10QPfcDt39uGciEDqdYBAbPBNZQDoC99O?usp=sharing

https://discord.com/channels/1059199217496772688/1090945925129707570



In [None]:
!pip -q install git+https://github.com/huggingface/transformers # need to install from github
!pip -q install datasets loralib sentencepiece 
!pip -q install bitsandbytes accelerate
!pip -q install langchain transformers sentence_transformers llama-index

In [None]:
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline
import torch
from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, PromptHelper, LLMPredictor, ServiceContext, LangchainEmbedding
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.llms.base import LLM

In [None]:
tokenizer = LlamaTokenizer.from_pretrained("samwit/koala-7b")
model = LlamaForCausalLM.from_pretrained("samwit/koala-7b",load_in_8bit=True,device_map='auto',)
pipeline = pipeline("text-generation",model=model, tokenizer=tokenizer, max_length=512,temperature=0.7,top_p=0.95,repetition_penalty=1.15)

In [None]:
class customLLM(LLM):
    def _call(self, prompt, stop=None):
        res = pipeline(prompt)
        prompt_length = len(prompt)
        return res[0]["generated_text"][prompt_length:] 
    def _identifying_params(self):
        return {"name_of_model": "koala-7b"}
    def _llm_type(self):
        return "custom"

Simple test to ensure LLM is working.

In [None]:
print(customLLM()._call("Tell me somthing about New York City."))

In [None]:
max_input_size = 512
num_output = 200
max_chunk_overlap = 20
chunk_size_limit = 200

llm_predictor = LLMPredictor(llm=customLLM())

In [None]:
embed_model = LangchainEmbedding(HuggingFaceEmbeddings())

In [None]:
prompt_helper = PromptHelper(max_input_size, num_output,max_chunk_overlap,chunk_size_limit=chunk_size_limit)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, embed_model=embed_model, prompt_helper=prompt_helper, chunk_size_limit = chunk_size_limit) 

Create a folder called "data" and load your csv file for indexing. 

In [None]:
documents = SimpleDirectoryReader('./data').load_data()
index = GPTSimpleVectorIndex.from_documents(documents,service_context=service_context)
index.save_to_disk('index.json')

In [None]:
query_text = "My key resouce left the project and it causing the delay. What should i do?"
response = index.query(query_text,response_mode="compact",service_context=service_context, similarity_top_k=1)
print(response)