About Fine-Tuning #212

4entertainment · 2024-05-09T08:21:56Z

I have a few more questions. I would be happy if you answer. Here is my FineTuning Code:

from ragatouille import RAGTrainer
from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter
import os
import glob
import random

def main():
    trainer = RAGTrainer(model_name="ColBERT_1.0",  # ColBERT_1 for first sample
                         # pretrained_model_name="colbert-ir/colbertv2.0",
                         pretrained_model_name="intfloat/e5-base", # ???
                         language_code="tr" # ???
                         )
    # pretrained_model_name: base model to train
    # model_name: new name to trained model



    # Path to the directory containing all the `.txt` files for indexing
    folder_path = "/text" # text folder contains several txt files.
    # Initialize lists to store the texts and their corresponding file names
    all_texts = []
    document_ids = []
    # Read all `.txt` files in the specified folder and extract file names
    for file_path in glob.glob(os.path.join(folder_path, "*.txt")):
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
            all_texts.append(content)
            document_ids.append(os.path.splitext(os.path.basename(file_path))[0])  # Extract file name without extension


    # chunking
    corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
    documents = corpus_processor.process_corpus(documents=all_texts, document_ids=document_ids, chunk_size=256) # overlap=0.1 chosen

    # To train retrieval models like colberts, we need training triplets: queries, positive passages, and negative passages for each query.
    # fake query-relevant passage pair
    queries = ["document relevant query-1",
               "document relevant query-2",
               "document relevant query-3",
               "document relevant query-4",
               "document relevant query-5",
               "document relevant query-6"
    ] * 3
    pairs = []
    for query in queries:
        fake_relevant_docs = random.sample(documents, 10)
        for doc in fake_relevant_docs:
            pairs.append((query, doc))


    # prepare training data
    trainer.prepare_training_data(raw_data=pairs,
                                  data_out_path="./data_out_path",
                                  all_documents=all_texts,
                                  num_new_negatives=10,
                                  mine_hard_negatives=True
                                  )
    trainer.train(batch_size=32,
                  nbits=4,  # how many bits will trained-model use
                  maxsteps=500000,
                  use_ib_negatives=True,  # in-batch negative for calculate loss
                  dim=128,  # per embedding will be 128 dimensions
                  learning_rate=5e-6,  # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
                  doc_maxlen=256,  # Maximum document length
                  use_relu=False,  # Disable ReLU
                  warmup_steps="auto",  # Defaults to 10%
    )
if __name__ == "__main__":
    main()

When I use my code, my model with a structure like the one below is recorded in checkpoints.
colbert

vocab.txt
tokenizer_config.json
tokenizer.json
special_tokens_map.json
model.safetensors
config.json
artifact.metadata

I need to fine-tune the intfloat/e5-base or intfloat/multilingual-e5-base model with my own data and ColBERT. Do you know any changes I need to make to the code or its internal library code?

Also, how can I try my model with the structure I shared above, which I fine-tuned using my code? Do you have a code we can "load" and try?

Thanks for your interest

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Fine-Tuning #212

About Fine-Tuning #212

4entertainment commented May 9, 2024 •

edited

Loading

About Fine-Tuning #212

About Fine-Tuning #212

Comments

4entertainment commented May 9, 2024 • edited Loading

4entertainment commented May 9, 2024 •

edited

Loading