Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trainer.prepare_training_data() does not turn 3rd column of triplets into indices #20

Closed
tm17-abcgen opened this issue Jan 6, 2024 · 3 comments · Fixed by #21
Closed
Labels
bug Something isn't working

Comments

@tm17-abcgen
Copy link
Contributor

tm17-abcgen commented Jan 6, 2024

I am currently trying to train a model on my own knowledge corpus. However instead of the triplets ( from triples.train.colbert.jsonl that gets created when running the function) having three numbers, the third column is only text in my case, e.g.:

[61,434,"text"]

I am not sure what i did wrong and i cross checked the different data types and formats with the 2nd and 3rd examples notebook.

trainer.prepare_training_data(
        raw_data = pairs,
        data_out_path="./data/",
        all_documents = chunked_documents,
        num_new_negatives = 32,
        mine_hard_negatives= True,
        )

pairs[0] in my use case would be

('Welche Bauarten und Bauprodukte sind von der Regelung betroffen?', 'die sich für einen Verwendungszweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirken, c) Verfahren für die Feststellung der Leistung eines Bauproduktes im Hinblick auf Merk- male, die sich für einen Verwendungszweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirken, d) zulässige oder unzulässige besondere Verwendungszwecke, e) die Festlegung von Klassen und Stufen in Bezug auf bestimmte Verwendungszwecke, f) die für einen bestimmten Verwendungszweck anzugebende oder erforderliche und anzugebende Leistung in Bezug auf ein Merkmal, das sich für einen Verwendungs- zweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirkt, soweit vorgesehen in Klassen und Stufen, 4. die Bauarten und die Bauprodukte,')

and pairs has a length of 64, while "chunked_documents" comes from here at the beginning:

corpus_processor.process_corpus(full_documents, chunk_size=256)

Hope someone has a clue on why the last column of the triplet did not turn into numbers when executing the prepare_training_data function. Thanks in advance.

Update 1:

Debugged a bit and came to this function in training_data_processor.py. Apparently when you have only one positive passage it puts a negative passage as the third column in the triplet instead of a number. Would like to know why.

def _make_individual_triplets(self, query, positives, negatives):
        """Create the training data in ColBERT(v1) format from raw lists of triplets"""
        triplets = []
        q = self.query_map[query]
        print("q")
        print(q)

        random.seed(42)
        if len(positives) > 1:
            all_pos_texts = [p for p in positives]
            max_triplets_per_query = 20
            negs_per_positive = max(1, max_triplets_per_query // len(all_pos_texts))
            initial_triplets_count = 0
            
            for pos in all_pos_texts:
                p = self.passage_map[pos]
                chosen_negs = random.sample(
                    negatives, min(len(negatives), negs_per_positive)
                )
                for neg in chosen_negs:
                    print("neg")
                    print(neg)
                    n = self.passage_map[neg]
                    print("n")
                    print(n)
                    initial_triplets_count += 1
                    triplets.append([q, p, n])

            extra_triplets_needed = max_triplets_per_query - initial_triplets_count
            while extra_triplets_needed > 0:
                p = self.passage_map[random.choice(all_pos_texts)]
                n = self.passage_map[random.choice(negatives)]
                triplets.append([q, p, n])
                extra_triplets_needed -= 1
        else:
            p = self.passage_map[positives[0]]
            for n in negatives:
                triplets.append([q, p, n])

        return triplets

Update 2:

Actually the same problem is in the 3rd example notebook if i understand it right, if you were to after the last cell:

trainer.prepare_training_data(
        raw_data = pairs,
        all_documents = documents,
        num_new_negatives = 10,
        mine_hard_negatives= True,
        )

train it with:

from pathlib import Path
trainer.data_dir=Path("./data/")
trainer.train(batch_size=8,
              nbits=4, # How many bits will the trained model use when compressing indexes
              maxsteps=500000, # Maximum steps hard stop
              use_ib_negatives=True, # Use in-batch negative to calculate loss
              dim=128, # How many dimensions per embedding. 128 is the default and works well.
              learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)
              doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.
              use_relu=False, # Disable ReLU -- doesn't improve performance
              warmup_steps="auto", # Defaults to 10%
             )

you would get the same error, namely that the third column of the triplets is a str and not an indice:

#> Starting...
nranks = 1 	 num_gpus = 1 	 device=0
{
    "query_token_id": "[unused0]",
    "doc_token_id": "[unused1]",
    "query_token": "[Q]",
    "doc_token": "[D]",
    "ncells": null,
    "centroid_score_threshold": null,
    "ndocs": null,
    "load_index_with_mmap": false,
    "index_path": null,
    "nbits": 4,
    "kmeans_niters": 20,
    "resume": false,
    "similarity": "cosine",
    "bsize": 8,
    "accumsteps": 1,
    "lr": 5e-6,
    "maxsteps": 500000,
    "save_every": 8,
    "warmup": 8,
    "warmup_bert": null,
    "relu": false,
    "nway": 2,
...
[Jan 07, 01:09:02] #> Got 64 queries. All QIDs are unique.

[Jan 07, 01:09:02] #> Loading collection...
0M 
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?e83155fb-e8f7-4acb-b764-aa4bfae7ca90) or open in a [text editor](command:workbench.action.openLargeOutput?e83155fb-e8f7-4acb-b764-aa4bfae7ca90). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
[/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429): FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
#> LR will use 8 warmup steps and linear decay over 500000 steps.
[89, 'Artists from Pixar and Aardman Studios signed a tribute stating, "You\'re our inspiration, Miyazaki-san!" He has also been cited as inspiration for video game designers including Shigeru Miyamoto on The Legend of Zelda and Hironobu Sakaguchi on Final Fantasy, as well as the television series Avatar: The Last Airbender, and the video game Ori and the Blind Forest (2015).Studio Ghibli has searched for some time for Miyazaki and Suzuki\'s successor to lead the studio; Kondō, the director of Whisper of the Heart, was initially considered, but died from a sudden heart attack in 1998. Some candidates were considered by 2023—including Miyazaki\'s son Goro, who declined—but the studio was not able to find a successor.']
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/infra/launcher.py", line 115, in setup_new_process
    return_val = callee(config, *args)
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/training.py", line 87, in train
    for batch_idx, BatchSteps in zip(range(start_batch_idx, config.maxsteps), reader):
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line 57, in __next__
    passages = [self.collection[pid] for pid in pids]
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line 57, in <listcomp>
    passages = [self.collection[pid] for pid in pids]
                ~~~~~~~~~~~~~~~^^^^^
  File "/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/data/collection.py", line 25, in __getitem__
    return self.data[item]
           ~~~~~~~~~^^^^^^
TypeError: list indices must be integers or slices, not str

Tomorrow i will look at example notebook 2 again as a guide, as the training worked there (third triplet column are indices)

@bclavie
Copy link
Owner

bclavie commented Jan 7, 2024

Hey, thank you for flagging this. It's an oversight from how the code worked in testing, where the mapping occurred at an earlier stage (when training JaColBERT). Should be fixed in #21, testing the third notebook locally with it again and will then merge + push to PyPi.

Thanks also for digging into the code to figure out the issue! This is much appreciated 😄

@bclavie bclavie added the bug Something isn't working label Jan 7, 2024
@bclavie
Copy link
Owner

bclavie commented Jan 7, 2024

Just finished running the notebook - haven't started training (on the move and no GPU access at the moment), but can confirm the exported triplets are now in the right format:

[0,67,40]
[0,67,19]
[0,67,28]
[0,67,83]
[0,67,17]
...

The fixed version will be on PyPi within ~10 minutes.

@bclavie bclavie closed this as completed Jan 7, 2024
@tm17-abcgen
Copy link
Contributor Author

Can also confirm that it is working. Thanks for that fix. 👯‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants