You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently trying to train a model on my own knowledge corpus. However instead of the triplets ( from triples.train.colbert.jsonl that gets created when running the function) having three numbers, the third column is only text in my case, e.g.:
[61,434,"text"]
I am not sure what i did wrong and i cross checked the different data types and formats with the 2nd and 3rd examples notebook.
('Welche Bauarten und Bauprodukte sind von der Regelung betroffen?', 'die sich für einen Verwendungszweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirken, c) Verfahren für die Feststellung der Leistung eines Bauproduktes im Hinblick auf Merk- male, die sich für einen Verwendungszweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirken, d) zulässige oder unzulässige besondere Verwendungszwecke, e) die Festlegung von Klassen und Stufen in Bezug auf bestimmte Verwendungszwecke, f) die für einen bestimmten Verwendungszweck anzugebende oder erforderliche und anzugebende Leistung in Bezug auf ein Merkmal, das sich für einen Verwendungs- zweck auf die Erfüllung der Anforderungen nach § 3 Satz 1 und 2 auswirkt, soweit vorgesehen in Klassen und Stufen, 4. die Bauarten und die Bauprodukte,')
and pairs has a length of 64, while "chunked_documents" comes from here at the beginning:
Hope someone has a clue on why the last column of the triplet did not turn into numbers when executing the prepare_training_data function. Thanks in advance.
Update 1:
Debugged a bit and came to this function in training_data_processor.py. Apparently when you have only one positive passage it puts a negative passage as the third column in the triplet instead of a number. Would like to know why.
def_make_individual_triplets(self, query, positives, negatives):
"""Create the training data in ColBERT(v1) format from raw lists of triplets"""triplets= []
q=self.query_map[query]
print("q")
print(q)
random.seed(42)
iflen(positives) >1:
all_pos_texts= [pforpinpositives]
max_triplets_per_query=20negs_per_positive=max(1, max_triplets_per_query//len(all_pos_texts))
initial_triplets_count=0forposinall_pos_texts:
p=self.passage_map[pos]
chosen_negs=random.sample(
negatives, min(len(negatives), negs_per_positive)
)
forneginchosen_negs:
print("neg")
print(neg)
n=self.passage_map[neg]
print("n")
print(n)
initial_triplets_count+=1triplets.append([q, p, n])
extra_triplets_needed=max_triplets_per_query-initial_triplets_countwhileextra_triplets_needed>0:
p=self.passage_map[random.choice(all_pos_texts)]
n=self.passage_map[random.choice(negatives)]
triplets.append([q, p, n])
extra_triplets_needed-=1else:
p=self.passage_map[positives[0]]
forninnegatives:
triplets.append([q, p, n])
returntriplets
Update 2:
Actually the same problem is in the 3rd example notebook if i understand it right, if you were to after the last cell:
frompathlibimportPathtrainer.data_dir=Path("./data/")
trainer.train(batch_size=8,
nbits=4, # How many bits will the trained model use when compressing indexesmaxsteps=500000, # Maximum steps hard stopuse_ib_negatives=True, # Use in-batch negative to calculate lossdim=128, # How many dimensions per embedding. 128 is the default and works well.learning_rate=5e-6, # Learning rate, small values ([3e-6,3e-5] work best if the base model is BERT-like, 5e-6 is often the sweet spot)doc_maxlen=256, # Maximum document length. Because of how ColBERT works, smaller chunks (128-256) work very well.use_relu=False, # Disable ReLU -- doesn't improve performancewarmup_steps="auto", # Defaults to 10%
)
you would get the same error, namely that the third column of the triplets is a str and not an indice:
#> Starting...nranks=1num_gpus=1device=0
{
"query_token_id": "[unused0]",
"doc_token_id": "[unused1]",
"query_token": "[Q]",
"doc_token": "[D]",
"ncells": null,
"centroid_score_threshold": null,
"ndocs": null,
"load_index_with_mmap": false,
"index_path": null,
"nbits": 4,
"kmeans_niters": 20,
"resume": false,
"similarity": "cosine",
"bsize": 8,
"accumsteps": 1,
"lr": 5e-6,
"maxsteps": 500000,
"save_every": 8,
"warmup": 8,
"warmup_bert": null,
"relu": false,
"nway": 2,
...
[Jan07, 01:09:02] #> Got 64 queries. All QIDs are unique.
[Jan07, 01:09:02] #> Loading collection...0MOutputistruncated. Viewasa [scrollableelement](command:cellOutput.enableScrolling?e83155fb-e8f7-4acb-b764-aa4bfae7ca90) oropenina [texteditor](command:workbench.action.openLargeOutput?e83155fb-e8f7-4acb-b764-aa4bfae7ca90). Adjustcelloutput [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...
[/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429](https://file+.vscode-resource.vscode-cdn.net/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/transformers/optimization.py:429): FutureWarning: ThisimplementationofAdamWisdeprecatedandwillberemovedinafutureversion. UsethePyTorchimplementationtorch.optim.AdamWinstead, orset`no_deprecation_warning=True`todisablethiswarningwarnings.warn(
#> LR will use 8 warmup steps and linear decay over 500000 steps.
[89, 'Artists from Pixar and Aardman Studios signed a tribute stating, "You\'re our inspiration, Miyazaki-san!" He has also been cited as inspiration for video game designers including Shigeru Miyamoto on The Legend of Zelda and Hironobu Sakaguchi on Final Fantasy, as well as the television series Avatar: The Last Airbender, and the video game Ori and the Blind Forest (2015).Studio Ghibli has searched for some time for Miyazaki and Suzuki\'s successor to lead the studio; Kondō, the director of Whisper of the Heart, was initially considered, but died from a sudden heart attack in 1998. Some candidates were considered by 2023—including Miyazaki\'s son Goro, who declined—but the studio was not able to find a successor.']
ProcessProcess-1:
Traceback (mostrecentcalllast):
File"/usr/lib/python3.11/multiprocessing/process.py", line314, in_bootstrapself.run()
File"/usr/lib/python3.11/multiprocessing/process.py", line108, inrunself._target(*self._args, **self._kwargs)
File"/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/infra/launcher.py", line115, insetup_new_processreturn_val=callee(config, *args)
^^^^^^^^^^^^^^^^^^^^^File"/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/training.py", line87, intrainforbatch_idx, BatchStepsinzip(range(start_batch_idx, config.maxsteps), reader):
File"/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line57, in__next__passages= [self.collection[pid] forpidinpids]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^File"/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/training/lazy_batcher.py", line57, in<listcomp>passages= [self.collection[pid] forpidinpids]
~~~~~~~~~~~~~~~^^^^^File"/home/tm16/Work/00_RandomCoding/RAGatouille/v001/venv/lib/python3.11/site-packages/colbert/data/collection.py", line25, in__getitem__returnself.data[item]
~~~~~~~~~^^^^^^TypeError: listindicesmustbeintegersorslices, notstr
Tomorrow i will look at example notebook 2 again as a guide, as the training worked there (third triplet column are indices)
The text was updated successfully, but these errors were encountered:
Hey, thank you for flagging this. It's an oversight from how the code worked in testing, where the mapping occurred at an earlier stage (when training JaColBERT). Should be fixed in #21, testing the third notebook locally with it again and will then merge + push to PyPi.
Thanks also for digging into the code to figure out the issue! This is much appreciated 😄
Just finished running the notebook - haven't started training (on the move and no GPU access at the moment), but can confirm the exported triplets are now in the right format:
I am currently trying to train a model on my own knowledge corpus. However instead of the triplets ( from triples.train.colbert.jsonl that gets created when running the function) having three numbers, the third column is only text in my case, e.g.:
[61,434,"text"]
I am not sure what i did wrong and i cross checked the different data types and formats with the 2nd and 3rd examples notebook.
pairs[0] in my use case would be
and pairs has a length of 64, while "chunked_documents" comes from here at the beginning:
Hope someone has a clue on why the last column of the triplet did not turn into numbers when executing the prepare_training_data function. Thanks in advance.
Update 1:
Debugged a bit and came to this function in training_data_processor.py. Apparently when you have only one positive passage it puts a negative passage as the third column in the triplet instead of a number. Would like to know why.
Update 2:
Actually the same problem is in the 3rd example notebook if i understand it right, if you were to after the last cell:
train it with:
you would get the same error, namely that the third column of the triplets is a str and not an indice:
Tomorrow i will look at example notebook 2 again as a guide, as the training worked there (third triplet column are indices)
The text was updated successfully, but these errors were encountered: