Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple fixes: labeled data processing, dict data, fixes examples/02-basic_training.ipynb #138

Merged
merged 11 commits into from
Feb 16, 2024

Conversation

jonppe
Copy link
Contributor

@jonppe jonppe commented Feb 14, 2024

Multiple small fixes and some refactorings while trying to fix the examples/02-basic_training.ipynb that seems to have been broken for some time.
Disclaimer: I'm not expert on all the data formats usually used in RAG inference/training so it could be I've mistaken something. Or perhaps the labeled pairs or other pattern in the example 2 are not really needed at all since the bugs prepare_training_data() feel somewhat fundamental... 😀

Anyway, the return value could be wrong in case when both
document_splitter_fn and preprocessing_fn are None.
This test seems to succeed in CI but constantly fails on my machine
due to identical score for 2 passages. "Specific works"... would be
in the 4th retrieved passage:

  {'content': 'She met with Suzuki ... May 2006.', 'score': 24.421875, 'rank': 3, 'document_id': '82f22269-1fb8-479d-8...9f415dfdf9', 'passage_id': 45}
  {'content': 'Specific works that ... children.', 'score': 24.421875, 'rank': 4, 'document_id': '82f22269-1fb8-479d-8...9f415dfdf9', 'passage_id': 70}
…e code

Avoid a few duplicate lines by adding a new method to make future changes easier.
With even more work it could be perhaps possible to make .index()
call .add_to_index() directly or other way round.

Note: it feels like 'split_documents' argument should be removed.
It would be easier to just check if 'document_splitter_fn' is None.
This is the way how 'preprocessing_fn' argument is already used.
'split_documents' is now left only on the public methods.
Add test main formats.
The test will currently fail. Fixes in later commits.
_process_raw_labeled_pairs() expects items
in the format (query, passage, label).

Thus we can not differentiate triplets and labeled pairs
by using only length.
Queries with no positives or no negatives would results in access
out of bounds.
Fix prepare_training_data() when the raw_data contains dicts.

This is the case e.g. when using similar pattern to
examples/02-basic_training.ipynb
where the data is first processed with CorpusProcessor.
Use instance variables so that one can use multiple trainer objects
in same python process.

This also needed to fix test_prepare_training_data() test.
Otherwise only the first parametrized test will succeed.
Intance variables feels more natural here than e.g. some pytest
fixture change.
Also here the item can contain dicts if use similar pattern to
examples/02-basic_training.ipynb
where the data is first processed with CorpusProcessor.
@bclavie
Copy link
Owner

bclavie commented Feb 14, 2024

Hey, this is hugely appreciated! I will properly review at a later point but just wanted to say thanks, the training part has definitely been in need of some love, especially to accommodate the document splitter (or rather, the splitter output needs some love)

Test training using the examples/02-basic_training.ipynb
as an example. Alternative could be to test the actual notebook
but it would probably bring some additional dependencies.
Skip the test when CUDA is not available. This is the case in CI.
I'm not quite sure what would be needed to make it work without CUDA.

Currently, without CUDA the test fails with something like:

    Process Process-1:
    Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
        self.run()
      File "/opt/hostedtoolcache/Python/3.9.18/x64/lib/python3.9/multiprocessing/process.py", line 108, in run
        self._target(*self._args, **self._kwargs)
      File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/colbert/infra/launcher.py", line 134, in setup_new_process
        return_val = callee(config, *args)
      File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/colbert/training/training.py", line 55, in train
        colbert = torch.nn.parallel.DistributedDataParallel(colbert, device_ids=[config.rank],
      File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 712, in __init__
        self._log_and_throw(
      File "/home/runner/.cache/pypoetry/virtualenvs/ragatouille-EqBdPU0M-py3.9/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw
        raise err_type(err_msg)
    ValueError: DistributedDataParallel device_ids and output_device arguments only work with single-device/multiple-device GPU modules or CPU modules, but got device_ids [0], output_device 0, and module parameters {device(type='cpu')}.
    =========================== short test summary info ============================
    FAILED tests/test_training.py::test_training - AssertionError: Timout in training
    ============= 1 failed, 67 passed, 4 skipped in 269.14s (0:04:29) ==============

    Error: Process completed with exit code 1.
@jonppe
Copy link
Contributor Author

jonppe commented Feb 15, 2024

I' didn't find any easy way to make the test_training.py work without CUDA so I added a conditional skip for now.
Fingers crossed that the tests pass now since I'm not sure if I have much time to look at this for a few days...

in all_results[1][2]["content"]
"Specific works that have influenced Miyazaki include Animal Farm (1945)"
in actual
or "She met with Suzuki" in actual
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly all looks good to me -- but just a question here, is there a specific reason to include She met with Suzuki here? The data processing looks like it still functions the same, so the initial test should still work unless I'm missing something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test didn't work with me. I'm not sure what was is different on my machine but the 3rd and 4th results come in different order. And like the commit message says, they seem to have the same score.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange... 🤔

The output I've been getting (same locally on multiple machines and in the CI is):

 {'content': "Specific works that have[...]",
  'score': 24.423484802246094,
[...]},
 {'content': 'She met with Suzuki in August 2005, [...]',
  'score': 24.417688369750977,
[...]},

The score are extremely similar to the point where it's possibly due to hardware differences, but still kind of odd!

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(when casting to half precision, they do both have the same score, and since that's what a lot of people do, I'd say the issue is fairly minor and can be blamed on different hardware configs)

@jonppe
Copy link
Contributor Author

jonppe commented Feb 16, 2024

Yes, this should fix #139
I maybe should have created an issue already earlier.

@bclavie bclavie merged commit 2b9aa4c into bclavie:main Feb 16, 2024
2 checks passed
@bclavie
Copy link
Owner

bclavie commented Feb 16, 2024

Merging and releasing a 0.0.7post3 (0.0.8 I'm intending to be full vector indexes, and hopefully 0.0.9 better training with a full multilingual example!) with this soon, thank you so much! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants