<a href="https://colab.research.google.com/github/almutareb/rag-based-llm-app/blob/main/gguf_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install ctransformers[cuda] langchain

In [None]:
from ctransformers import AutoModelForCausalLM
def colab_print(text, max_width = 120):
  words = text.split()
  line = ""
  for word in words:
    if len(line) + len(word) + 1 > max_width:
      print(line)
      line = ""
    line += word + " "
  print (line)

In [None]:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Mistral-7B-Instruct-v0.1-GGUF", model_file="mistral-7b-instruct-v0.1.Q4_K_M.gguf", model_type = "mistral", gpu_layers = 50)

In [None]:
from torch import cuda

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'
print(f"Model loaded on {device}")

In [None]:
colab_print(llm("Give me a well-written paragraph about a cat's inner monologe about his captivity inside the house that he despises, which in fact is a comfortable life for him.",
            max_new_tokens = 2048,
            temperature = 0.1,
            repetition_penalty = 1.2))

The cat sat on the windowsill, staring out at the world beyond. He longed to roam free and explore the great unknown, 
but instead he was trapped within the confines of the house. The walls seemed to close in around him, suffocating his 
desire for adventure. He paced back and forth, clawing at the furniture with frustration. "Why must I be imprisoned 
here?" he thought to himself. "This life is a farce! There's so much more out there waiting to be discovered." But 
despite his inner turmoil, the cat knew that this was his reality. He had been captured and tamed by humans, who saw 
him as little more than a pet. And yet, deep down, he couldn't help but feel grateful for the comforts of home. The 
warm bed, the plentiful food, and the loving touch of his owners were all things that he cherished. He may have been 
trapped within the walls of the house, but he knew that he was safe and secure. And in the end, perhaps that was enough 
to make him content with his life. 


In [None]:
from langchain.llms import CTransformers
config = {'max_new_tokens': 1024, 'temperature': 0.1, 'repetition_penalty': 1.2}
model_id='TheBloke/Mistral-7B-Instruct-v0.1-GGUF'
model_file_id="mistral-7b-instruct-v0.1.Q4_K_M.gguf"
llm = CTransformers(model=model_id,model_file=model_file_id, config=config)

In [None]:
from langchain import PromptTemplate, LLMChain

template = """<s>[INST] You are a helpful, respectful and honest writing assistant. Help the user with his request below as much as possible.
:
{request} [/INST] </s>
"""

#### Prompt
request_p = """Give me a well-written paragraph about a cat's inner monologe about his captivity inside the house that he despises, which in fact is a comfortable life for him."""
prompt = PromptTemplate(template=template, input_variables=["request"])
llm_chain = LLMChain(prompt=prompt, llm=llm)
response = llm_chain.run({"request":request_p})
response

"As I sit here on my cushioned bed, I can't help but feel trapped within these four walls. The humans who keep me captive seem to think that this life of comfort and luxury is all I could ever want, but they couldn't be more wrong. Sure, the food is always fresh and plentiful, and the soft blankets provide a cozy place to rest my weary paws. But at what cost? The freedom to roam through fields and forests, chase after birds and squirrels, and explore new sights and smells - all of these are denied to me in this life inside the house. I long for the days when I can be free to live my life on my own terms, without being confined by the boundaries of human walls."

In [None]:
#! wget -e robots=off --recursive --no-clobber --page-requisites --html-extension \
# --convert-links --restrict-file-names=windows \
# --domains docs.ray.io --no-parent https://docs.ray.io/en/master/

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#!cp -r docs.ray.io /content/drive/MyDrive/RAG_DB/
!cp -r /content/drive/MyDrive/RAG_DB/ docs.ray.io

In [None]:
!pip install sentence_transformers faiss-gpu typing-extensions==4.8.0

In [None]:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1')

In [None]:
# initialize a text-generation transformer pipeline
generate_text = transformers.pipeline(
    model='mistralai/Mistral-7B-Instruct-v0.1',
    tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    # ensure the model doesn't ramble during chat
    #stopping_criteria=stopping_criteria,
    # max number of tokens to generate in the output
    max_new_tokens=1024,
    # limit repition in the output
    repetition_penalty=1.2
)

(…)B-Instruct-v0.1/resolve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

(…)esolve/main/pytorch_model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

In [None]:
# install needed packages
!pip install -U accelerate einops langchain xformers bitsandbytes

In [None]:
from torch import cuda, bfloat16
import transformers

model_id = 'mistralai/Mistral-7B-Instruct-v0.1'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the 'bitsandbytes' library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

# begin initializing model
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map='auto',
)

# enable evaluation mode to allow model inference
model.eval()

print(f"Model loaded on {device}")

In [None]:
# Define stop tokens to controll output
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

stop_list = ['\nHuman:', '\n```\n']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids

In [None]:
# convert the stop token ids into LongTensor
import torch

stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

In [None]:
from transformers import StoppingCriteria, StoppingCriteriaList

# define custom stopping criteria object
class StopOnTokens(StoppingCriteria):
  def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
    for stop_ids in stop_token_ids:
      if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
        return True
    return False
stopping_criteria = StoppingCriteriaList([StopOnTokens()])

In [None]:
# now with a HF pipeline
from langchain.llms import HuggingFacePipeline

llm = HuggingFacePipeline(pipeline=generate_text)

llm(prompt="Explain to me the difference between Data Lakehouse and Data Warehouse")

In [None]:
from langchain.document_loaders import ReadTheDocsLoader
from langchain.embeddings.base import Embeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from typing import List
import time
import os

FAISS_INDEX_PATH="faiss_index"

loader = ReadTheDocsLoader("docs.ray.io/en/master/")

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size = 1000,
    chunk_overlap  = 100,
    length_function = len,
)

# Stage one: read all the docs, split them into chunks.
st = time.time()
print('Loading documents ...')
docs = loader.load()
#Theoretically, we could use Ray to accelerate this, but it's fast enough as is.
chunks = text_splitter.create_documents([doc.page_content for doc in docs], metadatas=[doc.metadata for doc in docs])
et = time.time() - st
print(f'Time taken: {et} seconds.')

#Stage two: embed the docs.
# use all-mpnet-base-v2 sentence transformer to convert pieces of text in vectors to store them in the vector store
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs
    )
print(f'Loading chunks into vector store ...')
st = time.time()
db = FAISS.from_documents(chunks, embeddings)
db.save_local(FAISS_INDEX_PATH)
et = time.time() - st
print(f'Time taken: {et} seconds.')

In [None]:
# initialize a conversation chain, a summary buffer memory for chat history
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(
    llm,
    db.as_retriever(search_type = "mmr"), # Maximum marginal relevance (mmr) strives to achieve both relevance to the query and diversity among the results.
#    return_source_documents=True
    )

In [None]:
# add a 'chat history' for testing
# should use langchain's ChatMessageHistory instead
chat_history = []

query = "How can I install ray for machine learning applications?"
result = chain({"question": query, "chat_history": chat_history})

colab_print(result['answer'])

In [None]:
chat_history =[(query, result["answer"])]

query = "How does model serving work?"
result = chain({"question": query, "chat_history": chat_history})

colab_print(result['answer'])

In [None]:
# source documents returned by FAISS
# TODO: remove duplicates, there is a langchain tutorial where the returned unique sources -> look for it!
print(*result['source_documents'], sep='\n')