<a href="https://colab.research.google.com/github/duchuy1612/rag-query-engine/blob/main/Run_mistral_on_a_single_GPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Runing Mistral-7b AI on a Single GPU with Google Colab
Welcome to this notebook that will show you how to load and run Mistral-7b with QLoRA which is a 4bit quantization technique with no performance degradation.

In this notebook, we will learn together how to load a model in 4bit, understand all its variants and how to run them for inference.

Note that this could be used for any model that supports device_map (i.e. loading the model with accelerate).

## Step 0 -  Enable text wrapping so we don't have to scrool horizontally


In [1]:
from IPython.display import HTML, display, Markdown

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)


## Step 1 - Install necessary packages
First, install the dependencies below to get started. As these features are available on the main branches only, we need to install the libraries below from source.

In [2]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone


In [3]:
!pip install datasets
!pip install langchain
!pip install neo4j
!pip install llama_index
!pip install -U huggingface_hub
!pip install -q google-generativeai
!pip install sentence-transformers
!pip install gradio
!pip install einops

Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.18.0 dill-0.3.8 multiprocess-0.70.16
Collecting langchain
  Downloading langchain-0.1.10-py3-none-any.whl (806 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.2/806.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-jso

In [4]:
%pip install llama-index-llms-groq
%pip install llama-index-retrievers-bm25
%pip install llama-index-vector-stores-neo4jvector
%pip install llama-index-graph-stores-neo4j
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-gemini
%pip install llama-index-llms-gemini
%pip install llama-index-multi-modal-llms-gemini
%pip install llama-parse
%pip install llama-index-llms-llama-cpp

Collecting llama-index-llms-groq
  Downloading llama_index_llms_groq-0.1.3-py3-none-any.whl (2.7 kB)
Collecting llama-index-llms-openai-like<0.2.0,>=0.1.3 (from llama-index-llms-groq)
  Downloading llama_index_llms_openai_like-0.1.3-py3-none-any.whl (3.0 kB)
Installing collected packages: llama-index-llms-openai-like, llama-index-llms-groq
Successfully installed llama-index-llms-groq-0.1.3 llama-index-llms-openai-like-0.1.3
Collecting llama-index-retrievers-bm25
  Downloading llama_index_retrievers_bm25-0.1.3-py3-none-any.whl (2.9 kB)
Collecting rank-bm25<0.3.0,>=0.2.2 (from llama-index-retrievers-bm25)
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25, llama-index-retrievers-bm25
Successfully installed llama-index-retrievers-bm25-0.1.3 rank-bm25-0.2.2
Collecting llama-index-vector-stores-neo4jvector
  Downloading llama_index_vector_stores_neo4jvector-0.1.3-py3-none-any.whl (5.6 kB)
Installing collected packages: llama-index-vector-stores-

Collecting llama-index-multi-modal-llms-gemini
  Downloading llama_index_multi_modal_llms_gemini-0.1.3-py3-none-any.whl (3.9 kB)
Installing collected packages: llama-index-multi-modal-llms-gemini
Successfully installed llama-index-multi-modal-llms-gemini-0.1.3
Collecting llama-index-llms-llama-cpp
  Downloading llama_index_llms_llama_cpp-0.1.3-py3-none-any.whl (5.1 kB)
Collecting llama-cpp-python<0.3.0,>=0.2.32 (from llama-index-llms-llama-cpp)
  Downloading llama_cpp_python-0.2.55.tar.gz (36.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m36.8/36.8 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting diskcache>=5.6.1 (from llama-cpp-python<0.3.0,>=0.2.32->llama-index-llms-llama-cpp)
  Downloading diskcache-5.6.3-py3-none-any.whl (4

In [6]:
%CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python

UsageError: Line magic function `%CMAKE_ARGS="-DLLAMA_CUBLAS=on"` not found.


## Step 2 - Define quantization parameters through the BitsandBytesConfig from transformers


* load_in_4bit=True: specify that we want to convert and load the model in 4-bit precision.
* bnb_4bit_use_double_quant=True: Use nested quantization for more memory efficient inference and training.
* bnd_4bit_quant_type="nf4": The 4bit integration comes with 2 different quantization types FP4 and NF4. The NF4 dtype stands for Normal Float 4 and is introduced in the QLoRA paper. By default, the FP4 quantization is used.
* bnb_4bit_compute_dype=torch.bfloat16: The compute dtype is used to change the dtype that will be used during computation. By default, the compute dtype is set to float32 but computation can be set to bf16 for speedups.



In [18]:
import torch
import os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

## Step 3 - Load the Model with quantization

In [19]:
from huggingface_hub import notebook_login
notebook_login()
#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

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Mistral 7B Code-Instruct (Load using Llama.cpp)

In [None]:
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

In [None]:
model_url = "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolve/main/llama-2-13b-chat.ggmlv3.q4_0.bin"

In [None]:
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=None,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

### Neural Chat - a model by Intel which was fine-tuned directly from Mistral-7B

In [None]:
neural_model_id = "Intel/neural-chat-7b-v3-1"
neural_model = AutoModelForCausalLM.from_pretrained(neural_model_id, quantization_config=bnb_config, device_map="auto")
neural_tokenizer = AutoTokenizer.from_pretrained(neural_model_id)

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/953 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/145 [00:00<?, ?B/s]

### Mistral 7B Instruct v0.2

In [None]:
instruct_model_id = "mistralai/Mistral-7B-Instruct-v0.2"
instruct_model = AutoModelForCausalLM.from_pretrained(instruct_model_id, quantization_config=bnb_config, device_map="auto")
instruct_tokenizer = AutoTokenizer.from_pretrained(instruct_model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

### Mistral 7B Instruct v0.2 fine-tuned for Knowledge Graphs Query Engine

In [None]:
model_id = "izayashiro/mistralai-HPC-Instruct-v0.2"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

## Analyzing the inference speed of the QLoRA fine-tuned model

In [None]:
import os
import time

# Define the input text
input_text = "Be a helpful HPC assistant and focus only on explaining the impacts of HPC to business leaders of big corporations."

# Encode the input text as input_ids
input_ids = tokenizer.encode(input_text, return_tensors="pt").to('cuda')

# Set the batch size
batch_size = 8

# Measure the inference time
start_time = time.time()
with torch.no_grad():
    outputs = model.generate(input_ids, max_length=2048, do_sample=True, top_k=5, top_p=0.1)
end_time = time.time()

# Calculate the inference time in seconds
inference_time = end_time - start_time

# Calculate the number of tokens processed
num_tokens = len(input_ids[0]) + len(outputs[0])

# Calculate the inference speed in tokens per second
inference_speed = num_tokens / inference_time

# Convert the output IDs to text
generated_text = tokenizer.decode(outputs[0])

# Print the generated text and inference speed
print(f"Generated text: {generated_text}")
print(f"Inference speed: {inference_speed:.2f} tokens/sec")


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Generated text: <s> Be a helpful HPC assistant and focus only on explaining the impacts of HPC to business leaders of big corporations.

HPC, or High Performance Computing, is a type of computing technology that is designed to handle complex computations and data processing tasks that are beyond the capabilities of traditional computing systems. It is often used in scientific research, engineering, and financial analysis to simulate and analyze large datasets and complex systems.

The impacts of HPC on business leaders of big corporations are significant. Here are a few ways in which HPC can benefit business leaders:

1. **Simulation and Analysis**: HPC can be used to simulate and analyze complex systems, such as chemical reactions, weather patterns, or financial markets. This can help business leaders to better understand the behavior of these systems and to identify potential problems or opportunities.

2. **Predictive Modeling**: HPC can be used to create predictive models of future

## Off-loading Mixtral-8x7B Testing
Not recommended for Colab since it's super slow but if you have more RAM and VRAM available, you should try it out since it's pretty cost-effective.

In [None]:
# fix numpy in colab
import numpy
from IPython.display import clear_output

# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

!git clone https://github.com/dvmazur/mixtral-offloading.git --quiet
!cd mixtral-offloading && pip install -q -r requirements.txt
clear_output()

In [None]:
import sys

sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

from src.build_model import OffloadConfig, QuantConfig, build_model

hf_logging.disable_progress_bar()

hqq_aten package not installed. HQQBackend.ATEN backend will not work unless you install the hqq_aten lib in hqq/kernels.


In [None]:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)
state_path = snapshot_download(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
# offload_per_layer = 4
offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)


mixtral_model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)

If you print the model, you will see that most of the nn.Linear layers are replaced by bnb.nn.Linear4bit layers!

In [None]:
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

In [None]:
print(instruct_model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
   

Let's make sure we loaded the whole model on GPU

In [None]:
model.hf_device_map

{'': device(type='cuda', index=0)}

## Connecting to Neo4j Database

In [20]:
## Credentials for Neo4j Database
#os.environ["NEO4J_URI"] = "bolt://0.tcp.ap.ngrok.io:14297"
#os.environ["NEO4J_USERNAME"] = "neo4j"
#os.environ["NEO4J_PASSWORD"] = "duchuy1612"

os.environ["NEO4J_URI"] = "neo4j+s://df3ca389.databases.neo4j.io"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "taRN5hso89An7ZU1iZ7rZQHtGH_oSw5OQThcBh3L8Bw"

In [21]:
from llama_index.vector_stores.neo4jvector import Neo4jVectorStore
from llama_index.graph_stores.neo4j import Neo4jGraphStore
from llama_index.core import StorageContext
from llama_index.core import (
    KnowledgeGraphIndex,
    ServiceContext,
    VectorStoreIndex,
    Document
)
os.environ["NEO4J_URI"] = os.getenv('NEO4J_URI')
os.environ["NEO4J_USERNAME"] = os.getenv('NEO4J_USERNAME')
os.environ["NEO4J_PASSWORD"] = os.getenv('NEO4J_PASSWORD')

edge_types, rel_prop_names = ["relationship"], [
    "relationship"
]  # default, could be omit if create from an empty kg
tags = ["entity"]

In [None]:
graph_store = Neo4jGraphStore(
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"],
    url=os.environ["NEO4J_URI"],
    #database='rebel-llamaindex',
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)

storage_context = StorageContext.from_defaults(graph_store=graph_store)

## All the configs for the LLMs

In [22]:
from google.colab import userdata

In [41]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.callbacks import CallbackManager
from llama_index.llms.huggingface import HuggingFaceInferenceAPI, HuggingFaceLLM
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import Settings
import pprint

callback_manager = CallbackManager([])

# Define embedding models
#embed_model = GooglePaLMEmbedding(model_name="models/embedding-gecko-001", api_key=palm_api_key)
bge_embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5", max_length=1024)
embed_model = GeminiEmbedding(model_name="models/embedding-001", api_key=userdata.get('GOOGLE_API_KEY'))

#instruct_local_mistral = HuggingFaceLLM(model=instruct_model, tokenizer=instruct_tokenizer, max_new_tokens=2048)
#local_mistral = HuggingFaceLLM(model=model, tokenizer=tokenizer, max_new_tokens=2048)
#local_neural_chat = HuggingFaceLLM(model=neural_model, tokenizer=neural_tokenizer, max_new_tokens=2048)
hf_remote_mistral = HuggingFaceInferenceAPI(model_name='mistralai/Mistral-7B-v0.1')
hf_remote_mistral_instruct = HuggingFaceInferenceAPI(model_name='mistralai/Mistral-7B-Instruct-v0.2', max_new_tokens=2048)
hf_remote_falcon = HuggingFaceInferenceAPI(model_name='tiiuae/falcon-7b-instruct')
hf_remote_zephyr = HuggingFaceInferenceAPI(model_name='HuggingFaceH4/zephyr-7b-beta')
hf_remote_mixtral = HuggingFaceInferenceAPI(model_name='mistralai/Mixtral-8x7B-Instruct-v0.1', context_window=4096)
hf_remote_mpt = HuggingFaceInferenceAPI(model_name='mosaicml/mpt-7b')
hf_remote_llama2_7b = HuggingFaceInferenceAPI(model_name='meta-llama/Llama-2-7b-hf')

# Define Gemini
gemini_llm = Gemini(api_key=userdata.get('GOOGLE_API_KEY'))
gemini_service_context = ServiceContext.from_defaults(llm=gemini_llm, chunk_size=256, embed_model=embed_model)

#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
# define LLM
ft_context = ServiceContext.from_defaults(
    llm=gemini_llm,
    callback_manager=callback_manager,
    system_prompt=(
        "You are a helpful assistant helping to answer questions about"
        " High-performance computing and HPC Lab at HCMUT"
        " and write SLURM scripts for job submission in the Supernode-XP at HPC Lab."
    ),
    embed_model=embed_model,
    chunk_size=512,
)

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

  gemini_service_context = ServiceContext.from_defaults(llm=gemini_llm, chunk_size=256, embed_model=embed_model)
  ft_context = ServiceContext.from_defaults(


## Groq API Test (currently free, will probably be paid in the near future)

In [42]:
from llama_index.llms.groq import Groq

mixtral_groq = Groq(model="mixtral-8x7b-32768", api_key=userdata.get('GROQ_API_KEY'))

In [43]:
groq_service_context=ServiceContext.from_defaults(llm=mixtral_groq, chunk_size=256, embed_model=bge_embed_model)

  groq_service_context=ServiceContext.from_defaults(llm=mixtral_groq, chunk_size=256, embed_model=bge_embed_model)


## Custom Query Engine 1 - Knowledge Graph RAG Query Engine + Vector Query Engine

Thanks to the flexible abstraction provided by Llama Index Retriever, implementing this approach was relatively straightforward. We created a new class called CustomRetriever which retrieves data from both VectorIndexRetriever and KnowledgeGraphRAGRetriever.

In [None]:
# import QueryBundle
from llama_index import QueryBundle

# import NodeWithScore
from llama_index.schema import NodeWithScore

# Retrievers
from llama_index.retrievers import (
    BaseRetriever,
    VectorIndexRetriever,
    KnowledgeGraphRAGRetriever,
)

from typing import List


class CustomRetriever(BaseRetriever):
    """Custom retriever that performs both Vector search and Knowledge Graph search"""

    def __init__(
        self,
        vector_retriever: VectorIndexRetriever,
        kg_retriever: KnowledgeGraphRAGRetriever,
        mode: str = "OR",
    ) -> None:
        """Init params."""

        self._vector_retriever = vector_retriever
        self._kg_retriever = kg_retriever
        if mode not in ("AND", "OR"):
            raise ValueError("Invalid mode.")
        self._mode = mode
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve nodes given query."""

        vector_nodes = self._vector_retriever.retrieve(query_bundle)
        kg_nodes = self._kg_retriever.retrieve(query_bundle)

        vector_ids = {n.node.node_id for n in vector_nodes}
        kg_ids = {n.node.node_id for n in kg_nodes}

        combined_dict = {n.node.node_id: n for n in vector_nodes}
        combined_dict.update({n.node.node_id: n for n in kg_nodes})

        if self._mode == "AND":
            retrieve_ids = vector_ids.intersection(kg_ids)
        else:
            retrieve_ids = vector_ids.union(kg_ids)

        retrieve_nodes = [combined_dict[rid] for rid in retrieve_ids]
        return retrieve_nodes

Next, we will create instances of the Vector and KG retrievers, which will be used in the instantiation of the Custom Retriever.

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever

graph_rag_retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    service_context=ft_context,
    callback_manager=callback_manager,
    verbose=True,
    max_knowledge_sequence=15000,
    retriever='embedding',
    with_nl2graphquery=True
)

In [None]:
from llama_index import SimpleDirectoryReader

hpc_lab_documents = SimpleDirectoryReader(input_dir="data_rag/").load_data()
documents = hpc_lab_documents
print(f"Loaded {len(documents)} docs")

Loaded 115 docs


In [None]:
neo4j_vector = Neo4jVectorStore(
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"],
    url=os.environ["NEO4J_URI"],
    index_name="vector",
    text_node_property="text",
    #database='test',
    embedding_dimension=768
)

In [None]:
vector_storage_context = StorageContext.from_defaults(vector_store=neo4j_vector)
vector_index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/115 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/331 [00:00<?, ?it/s]

In [None]:
from llama_index.core import set_global_tokenizer
from transformers import AutoTokenizer

set_global_tokenizer(
    AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2").encode
)

In [None]:
from llama_index import get_response_synthesizer

vector_retriever = VectorIndexRetriever(index=vector_index)

custom_retriever = CustomRetriever(vector_retriever, graph_rag_retriever)

# create response synthesizer
response_synthesizer = get_response_synthesizer(
    service_context=gemini_service_context,
    response_mode="tree_summarize",
)

In [None]:
custom_query_engine_1 = RetrieverQueryEngine.from_args(
    retriever=custom_retriever,
    response_synthesizer=response_synthesizer,
    service_context=service_context
)

In [None]:
response = custom_query_engine_1.query("What are the rules and policies here at HPC Lab?")
display(Markdown(f"<b>{response}</b>"))

[1;3;33mGraph Store Query:
MATCH (n:Entity)
WHERE n._node_type = "Rule" OR n._node_type = "Policy"
RETURN n.name, n.description
[0m[1;3;33mGraph Store Response:
[]
[0m

<b>- Keep the volume low if using headphones. Exemptions can be made by an instructor or systems administrator.
- Turn off or set cell phones to silent while in the lab. If you receive a call, exit the lab to answer it.
- Keep the lab clean and organized. Dispose of trash properly.
- Eating and drinking are prohibited in the server room, especially near electrical equipment.
- Do not work on any computing system without permission or specific instructions.
- Do not illegally copy materials.
- Do not harm or disconnect computer lab equipment.
- Do not plug cables intended for computer workstations into your personal laptop.
- Do not intentionally disrupt service of computers, cables, and peripheral devices.
- Do not remove equipment or take items from the lab.
- Do not attempt to dismantle equipment in the lab.
- Do not install or copy software on any computer in the lab. Software license agreements and copyrights laws will be strictly enforced in the HPC Lab.
- Do not attempt to access private network or system resources.
- Members are responsible for maintaining lab security. The last person to leave the lab should reorganize furniture, check for leftover items, turn off unnecessary power, and lock the doors.
- Members are expected to instruct students to work with the computing system carefully before giving permission to work independently.
- Students are not allowed to work in the HPC Lab unless a lab member is present. Special arrangements can be made for study groups or students under the management of HPC Lab.
- The Technical Member is available to instruct students in the proper use of equipment, and time should be made available during the lab for such instruction. Students should not use equipment without adequate instruction, and students are responsible for the care of equipment and computing resources.</b>

### Knowledge Graph RAG Query Engine
Currently really slow retrievals due to the knowledge graph being too sparse

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever

graph_rag_retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    service_context=ft_context,
    callback_manager=callback_manager,
    verbose=True,
    max_knowledge_sequence=10000,
    retriever='embedding',
)

query_engine = RetrieverQueryEngine.from_args(
    graph_rag_retriever, service_context=service_context
)

In [None]:
response = query_engine.query(
    "What are the rules and policies here at HPC Lab?",
)
display(Markdown(f"<b>{response}</b>"))

<b>- Keep volume low.
- No accessing private network or system resources.
- No installing or copying software.
- No dismantling equipment in lab.
- No removing equipment or items from lab.
- No disrupting service of computers.
- No plugging cables into personal laptop.
- No harming or disconnecting equipment.
- No illegal copying of materials.
- No working on computing system without permission.
- No eating or drinking in server room.
- Dispose trash.
- Keep lab clean.
- Exit lab before answering cell phone.
- Turn off cell phones.
- Maintain lab security.
- Students responsible for care of equipment and computing resources.
- Students should not use equipment without instruction.
- Time should be made available for instruction.
- Technical Member available to instruct students.
- Special arrangements for students under HPC Lab management.
- Special arrangements for study groups.
- Students not allowed to work without lab member present.
- Give permission to work independently.
- Instruct students to work carefully.
- Lock doors.
- Turn off unnecessary power.
- Check for leftover items.
- Reorganize furniture.</b>

Add NL2 Query Engine to the retriever (not recommended since it requires a lot of VRAM)

In [None]:
graph_rag_retriever_with_nl2graphquery = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    service_context=ft_context,
    verbose=True,
    with_nl2graphquery=True,
    retriever='embedding'
)

query_engine_with_nl2graphquery = RetrieverQueryEngine.from_args(
    graph_rag_retriever_with_nl2graphquery, service_context=service_context
)

In [None]:
response = query_engine_with_nl2graphquery.query(
    "What are the rules and policies here at HPC Lab?",
)
display(Markdown(f"<b>{response}</b>"))

[1;3;33mGraph Store Query:
MATCH (n:Entity)
WHERE n._node_type = "Rule" OR n._node_type = "Policy"
RETURN n.name, n.description
[0m[1;3;33mGraph Store Response:
[]
[0m

<b>- Keep volume low.
- Exemptions made by instructor or systems administrator.
- No accessing private network or system resources.
- No installing or copying software.
- No dismantling equipment in lab.
- No removing equipment or items from lab.
- No disrupting service of computers.
- No plugging cables into personal laptop.
- No harming or disconnecting equipment.
- No illegal copying of materials.
- No working on computing system without permission.
- No eating or drinking in server room.
- Dispose trash.
- Keep lab clean.
- Exit lab before answering cell phone.
- Turn off cell phones.
- Maintain lab security.
- Students responsible for care of equipment and computing resources.
- Students should not use equipment without instruction.
- Time should be made available for instruction.
- Technical Member available to instruct students.
- Special arrangements for students under HPC Lab management.
- Special arrangements for study groups.
- Students not allowed to work without lab member present.
- Give permission to work independently.
- Instruct students to work carefully.
- Lock doors.
- Turn off unnecessary power.
- Check for leftover items.
- Reorganize furniture.
- Approval beforehand.
- Approval by members with authority.
- Kept clean and organized.</b>

## Custom Query Engine 2 - Vector Retriever + BM25 Retriever & Re-ranking

### Parse the documents using LlamaParse

In [25]:
import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

parser = LlamaParse(
    api_key=userdata.get('LLAMA_CLOUD_API_KEY'),
    result_type="markdown",  # "markdown" and "text" are available
    verbose=True,
)

file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

Started parsing the file under job_id 2e121d67-6fa9-41fc-80e1-a3bf4c8c6c8d


### Sentence Splitter

In [26]:
# initialize node parser
splitter = SentenceSplitter(chunk_size=256)

# limit to a smaller section
nodes = splitter.get_nodes_from_documents(
    [Document(text=documents[0].get_content()[:1000000])],
    service_context=gemini_service_context,
    verbose=True,
)

In [27]:
# initialize storage context (by default it's in-memory)
storage_context = StorageContext.from_defaults()
storage_context.docstore.add_documents(nodes)

In [28]:
index = VectorStoreIndex(nodes, storage_context=storage_context, service_context=service_context, verbose=True)

### Re-ranker Setup
We'll be using the currently best-performing open-source re-rank model BGE-large which is available on Huggingface

In [29]:
from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(top_n=4, model="BAAI/bge-reranker-large")

config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

### Retrievers Setup

We'll be using the standard Vector Store Retriever alongside the BM25 Retriever

In [30]:
from llama_index.retrievers.bm25 import BM25Retriever

# retireve the top 10 most similar nodes using embeddings
vector_retriever = index.as_retriever(similarity_top_k=10)

# retireve the top 10 most similar nodes using bm25
bm25_retriever = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=10)

### Custom Retriever Implementation

In [31]:
from llama_index.core.retrievers import BaseRetriever


class HybridRetriever(BaseRetriever):
    def __init__(self, vector_retriever, bm25_retriever):
        self.vector_retriever = vector_retriever
        self.bm25_retriever = bm25_retriever
        super().__init__()

    def _retrieve(self, query, **kwargs):
        bm25_nodes = self.bm25_retriever.retrieve(query, **kwargs)
        vector_nodes = self.vector_retriever.retrieve(query, **kwargs)

        # combine the two lists of nodes
        all_nodes = []
        node_ids = set()
        for n in bm25_nodes + vector_nodes:
            if n.node.node_id not in node_ids:
                all_nodes.append(n)
                node_ids.add(n.node.node_id)
        return all_nodes

In [32]:
index.as_retriever(similarity_top_k=5)

hybrid_retriever = HybridRetriever(vector_retriever, bm25_retriever)

### Full Query Engine

In [44]:
from llama_index.core import get_response_synthesizer

synth = get_response_synthesizer(
    streaming=True,
    response_mode="refine",
    service_context=groq_service_context,
)

In [45]:
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
    llm=mixtral_groq,
    response_synthesizer=synth,
)

In [50]:
response = query_engine.query(
    "Tell me something about supernode-xp?"
)
response.print_response_stream()

The SuperNode-XP system is currently being accessed by around 50 user accounts. A significant portion of the system's resources is used by research groups. The system's resource usage is depicted in the graph, with CPU core usage presented as a percentage. Specifically, the graph labeled "Hinh 1. Tình trạng sử dụng tải nguyên tính toán tính theo % nhân vi xử lý trên SuperNode-XP" shows the CPU core usage.

To provide some context, the SuperNode-XP system is a powerful computing system utilized by various research teams, including the Water Resources and Computational outcomes research group led by Nguyễn Thống and the Aerospace Engineering and Machine Design Research Group from the Faculty of Mechanical Engineering and Transportation at the University of Science. The system's primary applications include OpenFOAM and the ANSYS proprietary library, and each node on the SuperNode-XP system typically consists of 48 processing units with Intel Xeon Phi accelerator cards for managing comput

## Running a small Gradio Chatbot

In [None]:
import gradio as gr

def generate_response(prompt, max_length=2048):
    inputs = prompt

    outputs = query_engine.query(prompt)
    return outputs


app = gr.Interface(fn=generate_response, inputs="text", outputs="text")
app.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://0f5ae305608e9a0664.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
[nltk_data] Downloading package stopwords to /tmp/llama_index...
[nltk_data]   Unzipping corpora/stopwords.zip.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1;3;32mEntities processed: ['computing', 'benchmarks', 'tasks', 'pdf', 'architectural', 'characteristic', 'size', 'components', 'High', 'Architecture', 'effectiveness', 'inferring', 'as well as the communication and cache size considerations that are specific to the design of HPC systems. To find out the specific keywords from the text', 'matter', 'includes', 'I', 'which can include the architecture of high-performance computing systems.\n- Communication: This involves the communication between processors', 'typical', 'keywords', 'read', 'considerations', 'thinking', 'discuss', 'used', 'and the computational thinking.\n- Architecture: This refers to the design of processors and the cache hierarchy', 'would', 'To', 'component', 'hierarchy', 'interconnects', 'as well as the communication and cache size considerations that are specific to the design of', 'document', 'mathematical', 'associated', 'well', 'High-performance computing\n\nThe keywords that can be used to look up the answer t

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[1;3;34mGraph RAG context:
The following are knowledge sequence in max depth 2 in the form of directed graph like:
`subject -[predicate]->, object, <-[predicate_next_hop]-, object_next_hop ...` extracted based on key entities as subject:
['WRITE', 'invalidate']
['TYPE', 'Level 1']
['REDUCE', 'memory traffic']
['TYPE', 'Two Way']
['TYPE', 'Set Associative']
['CONTRAST', 'Direct Mapped Cache']
['CONTRAST', 'Direct Mapped Cache', 'ASSOCIATIVITY', '1']
['TYPE', 'Fully Associative']
['IS_FOR', 'storing data']
['IS_PART_OF', 'computer']
['HAS_PROPERTY', 'unit stride']
['HAS_PROPERTY', 'temporal locality']
['HAS_PROPERTY', 'spatial locality']
['LEVEL', 'L2']
['TYPE', 'data']
['TYPE', 'data', 'PART_OF', 'programming']
['TYPE', 'data', 'PART_OF', 'myDrive']
['TYPE', 'data', 'PART_OF', 'MyDrive']
['TYPE', 'data', 'SUBJECT', 'hpc']
['TYPE', 'data', 'PAGE', '22']
['TYPE', 'data', 'FILE', 'ch02_hpc.pdf']
['CACHE', 'L2']
['CACHE', 'L3']
['HIT_RATE', '100% (best)']
['CACHE', 'Registers']
['CONTAINS'

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://0f5ae305608e9a0664.gradio.live




High-performance computing (HPC) is a type of computing that uses supercomputers and parallel processing to solve complex problems and perform large-scale simulations. The benefits of HPC include:
1. Faster processing: HPC can process large amounts of data much faster than traditional computing systems, making it ideal for data-intensive applications such as scientific simulations, weather forecasting, and genomic analysis.
2. Improved accuracy: HPC can perform calculations with greater accuracy than traditional computing systems, leading to more accurate results in fields such as engineering, finance, and scientific research.
3. Scalability: HPC systems can be scaled up to handle increasingly complex problems and larger datasets, making them a cost-effective solution for organizations with growing data needs.
4. Parallel processing: HPC systems use parallel processing, which allows multiple tasks to be processed simultaneously, further increasing processing speed and efficiency.
5. High-end network: HPC systems often come with high-end networks that provide fast data transfer rates, enabling large datasets to be processed and analyzed more quickly.
6. Wide spectrum of programming models: HPC systems support a wide range of programming models, including parallel processing, distributed computing, and data-parallel processing, allowing developers to choose the best approach for their specific application.
7. Heterogeneous integration: HPC systems can integrate multiple types of processors and memory technologies, enabling organizations to optimize their computing resources for different workloads.
8. Market drivers: The growth in energy consumption of data centers in the US, the introduction of accelerators, and cloud-based infrastructures supporting applications are some of the market drivers for HPC.
9. Data center trends: Data center operators are packing data centers with additional IT equipment per unit footprint area, and there is a trend towards mega data centers and exponential growth in data.
10. Support for emerging applications: HPC is required for many emerging applications, such as precision medicine, genomics, and light sheet microscopes.
11. Improved energy efficiency: HPC systems are becoming more energy efficient, with some systems using chiplet-based designs and SiP-level global power management to reduce power consumption.
12. High-performance memory: HPC systems often use high-performance memory technologies, such as SRAM, to provide low-latency and high-bandwidth access to data.
13. Non-volatile storage: HPC systems often use non-volatile storage, such as flash or disk, to ensure data persistence and reduce the need for constant data access.
14. Compute-centric: HPC systems are designed to be compute-centric, with a focus on maximizing processing power and minimizing data transfer times.
15. Parallel filesystems: HPC systems often come with parallel filesystems, which allow for efficient data access and processing across multiple nodes.
16. High-end applications: HPC systems are used for high-end applications such as weather forecasting, scientific simulations, and financial modeling.
17. Improved productivity: HPC systems can help organizations improve productivity by enabling faster processing of large datasets and more accurate results.
18. Cost savings: HPC systems can help organizations save costs by reducing the need for multiple computing systems and minimizing data transfer times.
19. Competitive advantage: HPC systems can provide a competitive advantage for organizations by enabling them to process large datasets and perform complex simulations faster than their competitors.
20. Innovation: HPC systems can enable new innovations in fields such as scientific research, engineering, and finance by providing the processing power and accuracy needed to solve complex problems.
Moreover, HPC systems can be used in conjunction with distributed computing frameworks such as Hadoop to process large datasets and perform advanced analytics. HPC systems can be used as the compute-intensive backend for Hadoop, enabling faster processing of large datasets and more complex analytics. HPC systems can also be used for data transformation, data processing, and data storage, making them a versatile solution for organizations with large data needs.
Security and reliability issues are important considerations when designing and implementing HPC systems. Design tools and security measures can be used to address these concerns, and HPC systems can be integrated with other technologies, such as parallel filesystems and distributed computing frameworks, to optimize performance and reduce the impact on the supply chain.
The context provided in the query confirms that the file 'high-performance-computing.pdf' is a valid source of information on the topic and provides additional details on the market drivers and data center trends related to HPC.