# Storage generation

Through this notebook, the vector store for each model will be generated and stored in the `storage` folder under the name of the model. 
Please note that this was developed thinking of only 2 models (Llama 2 and Mistral). This means that the `messages_to_prompt` function might need to be changed if you want to use it with other models.


In [17]:
import os
import warnings
import pickle
from llama_index import (
    ServiceContext,
    SimpleDirectoryReader, 
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

In [18]:
warnings.filterwarnings("ignore")
# Get the path to the parent directory
parent_dir = os.path.dirname(os.getcwd())

## Loading Documents

In [19]:
# data_path = os.path.join(parent_dir, 'data')

# # Data ingestion
# documents = SimpleDirectoryReader(data_path, exclude_hidden=True).load_data()

In [20]:
# # Storing documents as a list to avoid loading them again
# with open('../storage/documents/documents.pickle', 'wb') as f:
#     pickle.dump(documents, f)

In [21]:
# Opening the stored documents
with open('../storage/documents/documents.pickle', 'rb') as f:
    documents = pickle.load(f)

## Selecting a model to generate storage
This could be combined with a for loop, but to avoid memory issues, we run it separately for now.

In [22]:
# Construct the path to the models directory
models_path = os.path.join(parent_dir, 'models')
models = [f for f in os.listdir(models_path) if os.path.isfile(os.path.join(models_path, f))]
try:
    # remove .gitignore by specifying the name
    models.remove(".gitignore")
except:
    pass

try:
    # remove anything ending with Zone.Identifier
    models = [m for m in models if not m.endswith("Zone.Identifier")]
except:
    pass
# From every entry, remove everything after the first dot
print("Available models:")
for i, m in enumerate(models):
    print(f"{i}: {m.split('.')[0]}")

Available models:
0: llama-2-13b-chat
1: bloomz-7b1
2: mistral-7b-instruct-v0
3: mixtral-8x7b-instruct-v0


In [23]:
# Select a model. The user can only input a number between 0 and len(models)-1, if he inputs something else, the program will ask again
while True:
    try:
        model_index = int(input("Select a model: "))
        if model_index >= 0 and model_index < len(models):
            break
        else:
            print("Invalid input. Please enter a number between 0 and " + str(len(models)-1) + " according to the selection shown above.")
    except ValueError:
        print("Invalid input. Please enter a number between 0 and " + str(len(models)-1) + " according to the selection shown above.")

# Get path to the selected model
model_path = os.path.join(models_path, models[model_index])
model_tag = models[model_index].split('-')[0]

In [24]:
if not models[model_index].startswith("llama"):
        # The following prompt works well with Mistral
        def messages_to_prompt(messages):
                prompt = ""
                for message in messages:
                        if message.role == 'system':
                                prompt += f"<|system|>\n{message.content}</s>\n"
                        elif message.role == 'user':
                                prompt += f"<|user|>\n{message.content}</s>\n"
                        elif message.role == 'assistant':
                                prompt += f"<|assistant|>\n{message.content}</s>\n"

                        # ensure we start with a system prompt, insert blank if needed
                        if not prompt.startswith("<|system|>\n"):
                                prompt = "<|system|>\n</s>\n" + prompt

                        # add final assistant prompt
                        prompt = prompt + "<|assistant|>\n"

                return prompt

llm = LlamaCPP(
        # You can pass in the URL to a GGML model to download it automatically
        # model_url=model_url,
        # optionally, you can set the path to a pre-downloaded model instead of model_url
        model_path=model_path,
        temperature=0.2,
        max_new_tokens=1000,
        # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
        context_window=3900,
        # kwargs to pass to __call__()
        generate_kwargs={},
        # kwargs to pass to __init__()
        # set to at least 1 to use GPU
        model_kwargs={"n_gpu_layers": -1},
        # transform inputs into Llama2 format
        messages_to_prompt=messages_to_prompt,
        completion_to_prompt=completion_to_prompt,
        verbose=True,
)

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /home/splacintescu/RAG_Tester/models/mixtral-8x7b-instruct-v0.1.Q3_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_

llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_m

## Selecting Embeddings model
Currently model name needs to be changed manually.

In [25]:
embedding = "BAAI/bge-base-en-v1.5" # 'intfloat/e5-large-v2' # 'sentence-transformers/all-mpnet-base-v2' "BAAI/bge-large-en-v1.5"   #    BAAI/bge-large-en-v1.5
embedding_tag = embedding.split('/')[1]
embed_model = HuggingFaceEmbedding(embedding, max_length=512)

In [26]:
service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model= embed_model,
    # "local:EuropeanParliament/eubert_embedding_v1",    
    chunk_size=512,
    chunk_overlap=125,
)

In [27]:
vector_index = VectorStoreIndex.from_documents(documents, service_context=service_context, show_progress=True)

Parsing nodes: 100%|██████████| 7132/7132 [00:13<00:00, 547.62it/s] 
Generating embeddings: 100%|██████████| 11938/11938 [02:47<00:00, 71.41it/s]


In [28]:
vector_index.storage_context.persist(persist_dir=f"../storage/{embedding_tag}")

## Loading index
Uncomment the following cell if you want to load an index from a previous run and test the storage loading.

In [29]:
# rebuild storage context
storage_context = StorageContext.from_defaults(persist_dir=f"../storage/{embedding_tag}")

# load index
vector_index = load_index_from_storage(storage_context, service_context= service_context)

## Adding new documents to existing index
If new documents want to be added, then follow the following steps (**LOAD INDEX AND SERVICE CONTEXT FIRST**)

In [30]:
data_path = os.path.join(parent_dir, 'data', 'EUWhoiswho_EP_EN.pdf')

# Data ingestion
new_documents = SimpleDirectoryReader(input_files=[data_path]).load_data()

In [31]:
# Add to index
for chunk in new_documents:
    vector_index.insert(chunk, show_progress=True)

Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 52.50it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 70.34it/s]
Generating embeddings: 100%|██████████| 2/2 [00:00<00:00, 55.92it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 48.34it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 48.85it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 48.45it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 74.38it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 56.05it/s]
Generating embeddings: 100%|██████████| 2/2 [00:00<00:00, 55.48it/s]
Generating embeddings: 100%|██████████| 2/2 [00:00<00:00, 65.84it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 73.82it/s]
Generating embeddings: 100%|██████████| 4/4 [00:00<00:00, 63.08it/s]
Generating embeddings: 100%|██████████| 2/2 [00:00<00:00, 67.30it/s]
Generating embeddings: 100%|██████████| 1/1 [00:00<00:00, 69.89it/s]
Generating embeddings: 100%|██████

In [32]:
# Persist to disk
vector_index.storage_context.persist(persist_dir=f"../storage/{embedding_tag}")

Remember to update the document store in case it is needed in the future! (Loading documents section)