Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138

jonppe · 2024-02-14T18:51:02Z

Multiple small fixes and some refactorings while trying to fix the examples/02-basic_training.ipynb that seems to have been broken for some time.
Disclaimer: I'm not expert on all the data formats usually used in RAG inference/training so it could be I've mistaken something. Or perhaps the labeled pairs or other pattern in the example 2 are not really needed at all since the bugs prepare_training_data() feel somewhat fundamental... 😀

Anyway, the return value could be wrong in case when both document_splitter_fn and preprocessing_fn are None.

This test seems to succeed in CI but constantly fails on my machine due to identical score for 2 passages. "Specific works"... would be in the 4th retrieved passage: {'content': 'She met with Suzuki ... May 2006.', 'score': 24.421875, 'rank': 3, 'document_id': '82f22269-1fb8-479d-8...9f415dfdf9', 'passage_id': 45} {'content': 'Specific works that ... children.', 'score': 24.421875, 'rank': 4, 'document_id': '82f22269-1fb8-479d-8...9f415dfdf9', 'passage_id': 70}

…e code Avoid a few duplicate lines by adding a new method to make future changes easier. With even more work it could be perhaps possible to make .index() call .add_to_index() directly or other way round. Note: it feels like 'split_documents' argument should be removed. It would be easier to just check if 'document_splitter_fn' is None. This is the way how 'preprocessing_fn' argument is already used. 'split_documents' is now left only on the public methods.

Add test main formats. The test will currently fail. Fixes in later commits.

_process_raw_labeled_pairs() expects items in the format (query, passage, label). Thus we can not differentiate triplets and labeled pairs by using only length.

Queries with no positives or no negatives would results in access out of bounds.

Fix prepare_training_data() when the raw_data contains dicts. This is the case e.g. when using similar pattern to examples/02-basic_training.ipynb where the data is first processed with CorpusProcessor.

Use instance variables so that one can use multiple trainer objects in same python process. This also needed to fix test_prepare_training_data() test. Otherwise only the first parametrized test will succeed. Intance variables feels more natural here than e.g. some pytest fixture change.

Also here the item can contain dicts if use similar pattern to examples/02-basic_training.ipynb where the data is first processed with CorpusProcessor.

bclavie · 2024-02-14T20:26:24Z

Hey, this is hugely appreciated! I will properly review at a later point but just wanted to say thanks, the training part has definitely been in need of some love, especially to accommodate the document splitter (or rather, the splitter output needs some love)

Test training using the examples/02-basic_training.ipynb as an example. Alternative could be to test the actual notebook but it would probably bring some additional dependencies.

Skip the test when CUDA is not available. This is the case in CI. I'm not quite sure what would be needed to make it work without CUDA. Currently, without CUDA the test fails with something like: Process Process-1: Traceback (most recent call last): File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process return_val = callee(config, *args) File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/colbert/training/training.py", line 55, in train colbert = torch.nn.parallel.DistributedDataParallel(colbert, device_ids=[config.rank], File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 712, in __init__ self._log_and_throw( File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw raise err_type(err_msg) ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}. =========================== short test summary info ============================ FAILED tests/test_training.py::test_training - AssertionError: Timout in training ============= 1 failed, 67 passed, 4 skipped in 269.14s (0:04:29) ============== Error: Process completed with exit code 1.

jonppe · 2024-02-15T13:56:33Z

I' didn't find any easy way to make the test_training.py work without CUDA so I added a conditional skip for now.
Fingers crossed that the tests pass now since I'm not sure if I have much time to look at this for a few days...

bclavie · 2024-02-16T19:07:00Z

tests/e2e/test_e2e_indexing_searching.py

-        in all_results[1][2]["content"]
+        "Specific works that have influenced Miyazaki include Animal Farm (1945)"
+        in actual
+        or "She met with Suzuki" in actual


Nearly all looks good to me -- but just a question here, is there a specific reason to include She met with Suzuki here? The data processing looks like it still functions the same, so the initial test should still work unless I'm missing something

The original test didn't work with me. I'm not sure what was is different on my machine but the 3rd and 4th results come in different order. And like the commit message says, they seem to have the same score.

This is strange... 🤔

The output I've been getting (same locally on multiple machines and in the CI is):

{'content': "Specific works that have[...]", 'score': 24.423484802246094, [...]}, {'content': 'She met with Suzuki in August 2005, [...]', 'score': 24.417688369750977, [...]},

The score are extremely similar to the point where it's possibly due to hardware differences, but still kind of odd!

(when casting to half precision, they do both have the same score, and since that's what a lot of people do, I'd say the issue is fairly minor and can be blamed on different hardware configs)

jonppe · 2024-02-16T19:20:53Z

Yes, this should fix #139
I maybe should have created an issue already earlier.

bclavie · 2024-02-16T19:24:41Z

Merging and releasing a 0.0.7post3 (0.0.8 I'm intending to be full vector indexes, and hopefully 0.0.9 better training with a full multilingual example!) with this soon, thank you so much! 🙏

jonppe added 9 commits February 14, 2024 20:22

CorpusProcessor.process_corpus(): fix type hint for return value

b95bc2d

Anyway, the return value could be wrong in case when both document_splitter_fn and preprocessing_fn are None.

Add unit test for prepare_training_data()

d7543fe

Add test main formats. The test will currently fail. Fixes in later commits.

RAGTrainer: fix parsing labeled pairs

e3cc2c4

_process_raw_labeled_pairs() expects items in the format (query, passage, label). Thus we can not differentiate triplets and labeled pairs by using only length.

training_data_processor: prevent out of bounds access with labeled pairs

5cf4b72

Queries with no positives or no negatives would results in access out of bounds.

RAGTrainer: fix prepare_training_data() with dict content

fd9e68f

Fix prepare_training_data() when the raw_data contains dicts. This is the case e.g. when using similar pattern to examples/02-basic_training.ipynb where the data is first processed with CorpusProcessor.

training_data_processor: fix _process_raw_pairs() with dict items

7d11afc

Also here the item can contain dicts if use similar pattern to examples/02-basic_training.ipynb where the data is first processed with CorpusProcessor.

Add tests/test_training.py

ec0bbbc

Test training using the examples/02-basic_training.ipynb as an example. Alternative could be to test the actual notebook but it would probably bring some additional dependencies.

jonppe force-pushed the fix_exercise_2_training branch from 10d7c38 to ec0bbbc Compare February 14, 2024 20:38

bclavie mentioned this pull request Feb 16, 2024

Bugfix/fix training notebook #140

Closed

bclavie reviewed Feb 16, 2024

View reviewed changes

bclavie approved these changes Feb 16, 2024

View reviewed changes

bclavie merged commit 2b9aa4c into bclavie:main Feb 16, 2024
2 checks passed

bclavie mentioned this pull request Feb 16, 2024

02-basic_training.ipynb fails #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138

Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138

jonppe commented Feb 14, 2024

bclavie commented Feb 14, 2024

jonppe commented Feb 15, 2024

bclavie Feb 16, 2024

jonppe Feb 16, 2024

bclavie Feb 16, 2024

bclavie Feb 16, 2024

jonppe commented Feb 16, 2024

bclavie commented Feb 16, 2024

Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138

Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138

Conversation

jonppe commented Feb 14, 2024

bclavie commented Feb 14, 2024

jonppe commented Feb 15, 2024

bclavie Feb 16, 2024

Choose a reason for hiding this comment

jonppe Feb 16, 2024

Choose a reason for hiding this comment

bclavie Feb 16, 2024

Choose a reason for hiding this comment

bclavie Feb 16, 2024

Choose a reason for hiding this comment

jonppe commented Feb 16, 2024

bclavie commented Feb 16, 2024