# Storage generation

Through this notebook, the vector store for each model will be generated and stored in the `storage` folder under the name of the model. 
Please note that this was developed thinking of only 2 models (Llama 2 and Mistral). This means that the `messages_to_prompt` function might need to be changed if you want to use it with other models.


In [1]:
import os
import warnings
import pickle
import copy
from llama_index import (
    ServiceContext,
    SimpleDirectoryReader, 
    VectorStoreIndex,
    StorageContext,
    load_index_from_storage,
)
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import LlamaCPP
from llama_index.llms.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

In [2]:
warnings.filterwarnings("ignore")
# Get the path to the parent directory
parent_dir = os.path.dirname(os.getcwd())

## Loading Documents

In [3]:
#Rules and procedures
data_path = os.path.join(parent_dir, 'data/rp')

# #Legislation
# data_path = os.path.join(parent_dir, 'data/law')

In [4]:
# Data ingestion

documents = SimpleDirectoryReader(data_path, exclude_hidden=True).load_data()

In [5]:
# Storing documents as a list to avoid loading them again
with open('../storage/documents/documents.pickle'+data_path.split('/')[-1], 'wb') as f:
    pickle.dump(documents, f)

In [6]:
# Opening the stored documents
with open('../storage/documents/documents.pickle'+data_path.split('/')[-1], 'rb') as f:
    documents = pickle.load(f)

## Chunking

In [7]:
print(documents[100].text)

PE649.486v01-00Question for written answer E-001354/2020
to the Commission
Rule 138
Eugen Tomac (PPE)
Subject: 2021-2027 common agricultural policy
The common agricultural policy has clearly benefited Romania, resulting in a more competitive 
farming sector, the more effective use of natural resources and a dramatic rise in living standards for 
rural areas.
How will the Commission ensure that the new common agricultural policy for 2021-2027 properly 
reflects Romania's priorities and needs?
More specifically:
What funding has been allocated to Romania for 2021-2027 compared with 2014-2020?
What are the subsidy levels per hectare/crop in 2021, compared to 2020?
What are the subsidy levels for beef/pork/ poultry/sheep 2021, compared to 2020?
EN
E-0001354/2020
Answer given by Mr Wojciechowski
on behalf of the European Commission
(27.4.2020)
The Common Agricultural Policy (CAP) proposed for the period post 2020 will give 
Romania greater responsibility and possibilities to tailor their po

In [8]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /home/splacintescu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [9]:
# from langchain.text_splitter import NLTKTextSplitter
# text_splitter = NLTKTextSplitter()

#better results with SpaCy
from langchain.text_splitter import SpacyTextSplitter
text_splitter = SpacyTextSplitter()

# for i in range(len(documents)):
#     documents[i].text = ''.join(text_splitter.split_text((documents[i].text)))

chunkedText = []
for doc in documents:
    chunks = text_splitter.split_text((doc.text))
    for chunk in chunks:
        doc_aux = copy.deepcopy(doc)
        doc_aux.text = chunk
        chunkedText.append(doc_aux)


In [10]:
print(chunkedText[100].text)

PE649.486v01-00Question for written answer E-001354/2020
to the Commission
Rule 138
Eugen Tomac (PPE)


Subject: 2021-2027 common agricultural policy
The common agricultural policy has clearly benefited Romania, resulting in a more competitive 
farming sector, the more effective use of natural resources and a dramatic rise in living standards for 
rural areas.


How will the Commission ensure that the new common agricultural policy for 2021-2027 properly 
reflects Romania's priorities and needs?
More specifically:
What funding has been allocated to Romania for 2021-2027 compared with 2014-2020?


What are the subsidy levels per hectare/crop in 2021, compared to 2020?


What are the subsidy levels for beef/pork/ poultry/sheep 2021, compared to 2020?
EN
E-0001354/2020
Answer given by Mr Wojciechowski
on behalf of the European Commission
(27.4.2020)


The Common Agricultural Policy (CAP) proposed for the period post 2020 will give 
Romania greater responsibility and possibilities to tailo

## Selecting a model to generate storage
This could be combined with a for loop, but to avoid memory issues, we run it separately for now.

In [11]:
# Construct the path to the models directory
models_path = os.path.join(parent_dir, 'models')
models = [f for f in os.listdir(models_path) if os.path.isfile(os.path.join(models_path, f))]
try:
    # remove .gitignore by specifying the name
    models.remove(".gitignore")
except:
    pass

try:
    # remove anything ending with Zone.Identifier
    models = [m for m in models if not m.endswith("Zone.Identifier")]
except:
    pass
# From every entry, remove everything after the first dot
print("Available models:")
for i, m in enumerate(models):
    print(f"{i}: {m.split('.')[:-1]}")

Available models:
0: ['mixtral-8x7b-instruct-v0', '1', 'Q8_0']
1: ['llama-2-13b-chat', 'Q4_0']
2: ['mistral-7b-instruct-v0', '2', 'Q5_K_M']
3: ['mixtral-8x7b-instruct-v0', '1', 'Q3_K_M']


In [12]:
# Select a model. The user can only input a number between 0 and len(models)-1, if he inputs something else, the program will ask again
while True:
    try:
        model_index = int(input("Select a model: "))
        if model_index >= 0 and model_index < len(models):
            break
        else:
            print("Invalid input. Please enter a number between 0 and " + str(len(models)-1) + " according to the selection shown above.")
    except ValueError:
        print("Invalid input. Please enter a number between 0 and " + str(len(models)-1) + " according to the selection shown above.")

# Get path to the selected model
model_path = os.path.join(models_path, models[model_index])
model_tag = models[model_index].split('-')[0]

In [13]:
if not models[model_index].startswith("llama"):
        # The following prompt works well with Mistral
        def messages_to_prompt(messages):
                prompt = ""
                for message in messages:
                        if message.role == 'system':
                                prompt += f"<|system|>\n{message.content}</s>\n"
                        elif message.role == 'user':
                                prompt += f"<|user|>\n{message.content}</s>\n"
                        elif message.role == 'assistant':
                                prompt += f"<|assistant|>\n{message.content}</s>\n"

                        # ensure we start with a system prompt, insert blank if needed
                        if not prompt.startswith("<|system|>\n"):
                                prompt = "<|system|>\n</s>\n" + prompt

                        # add final assistant prompt
                        prompt = prompt + "<|assistant|>\n"

                return prompt

llm = LlamaCPP(
        # You can pass in the URL to a GGML model to download it automatically
        # model_url=model_url,
        # optionally, you can set the path to a pre-downloaded model instead of model_url
        model_path=model_path,
        temperature=0.2,
        max_new_tokens=1000,
        # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
        context_window=3900,
        # kwargs to pass to __call__()
        generate_kwargs={},
        # kwargs to pass to __init__()
        # set to at least 1 to use GPU
        model_kwargs={"n_gpu_layers": -1},
        # transform inputs into Llama2 format
        messages_to_prompt=messages_to_prompt,
        completion_to_prompt=completion_to_prompt,
        verbose=True,
)

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from /home/splacintescu/RAG-Tester/models/mixtral-8x7b-instruct-v0.1.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_mo

## Selecting Embeddings model
Currently model name needs to be changed manually.

In [14]:
embedding = 'sentence-transformers/all-mpnet-base-v2' # 'intfloat/e5-large-v2' #  "BAAI/bge-large-en-v1.5"   # "BAAI/bge-base-en-v1.5"
embedding_tag = embedding.split('/')[1]
embed_model = HuggingFaceEmbedding(embedding, max_length=512)

In [15]:
service_context = ServiceContext.from_defaults(
    llm=llm, 
    embed_model= embed_model,
    # "local:EuropeanParliament/eubert_embedding_v1",    
    chunk_size=512,
    chunk_overlap=125,
)

In [16]:
vector_index = VectorStoreIndex.from_documents(chunkedText, service_context=service_context, show_progress=True)

Parsing nodes:   0%|          | 0/43625 [00:00<?, ?it/s]

Parsing nodes: 100%|██████████| 43625/43625 [00:39<00:00, 1093.75it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:24<00:00, 83.80it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:34<00:00, 58.63it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:34<00:00, 58.60it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:24<00:00, 83.48it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:30<00:00, 67.28it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:33<00:00, 61.80it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:34<00:00, 59.17it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:30<00:00, 66.76it/s] 
Generating embeddings: 100%|██████████| 2048/2048 [00:35<00:00, 58.45it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:34<00:00, 58.73it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:34<00:00, 59.07it/s]
Generating embeddings: 100%|██████████| 2048/2048 [00:34<00:00, 58.61it/s]
Generating embeddings: 10

In [17]:
vector_index.storage_context.persist(persist_dir=f"../storage/{embedding_tag}-qa")

## Loading index
Uncomment the following cell if you want to load an index from a previous run and test the storage loading.

In [None]:
# # rebuild storage context
# storage_context = StorageContext.from_defaults(persist_dir=f"../storage/{embedding_tag}")

# # load index
# vector_index = load_index_from_storage(storage_context, service_context= service_context)

## Adding new documents to existing index
If new documents want to be added, then follow the following steps (**LOAD INDEX AND SERVICE CONTEXT FIRST**)

In [None]:
# data_path = os.path.join(parent_dir, 'data', 'EUWhoiswho_EP_EN.pdf')

# # Data ingestion
# new_documents = SimpleDirectoryReader(input_files=[data_path]).load_data()

In [None]:
# # Add to index
# for chunk in new_documents:
#     vector_index.insert(chunk, show_progress=True)

In [None]:
# Persist to disk
# vector_index.storage_context.persist(persist_dir=f"../storage/{embedding_tag}")

Remember to update the document store in case it is needed in the future! (Loading documents section)