Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
CI test for some of the featurizers where failing with an error similar to
The error occured from numpy version 24.
During the featurization process, some featurizers returned arrays of various shapes for a different data points. It was especially the case with tokenizers, where the result of returned array depend on the number of input tokens. In numpy versions prior to 24, numpy ignores these issues but from 24, it expected these datapoints to have a dtype of
object
. There were two possible solutions:dtype
asobject
in base featurizer class (a dtype of object says that the object stored in the numpy array is a python datatype)I chose the latter one because:
Hence, I added padding with max length for roberta tokenizer and reaction tokenizer and updated tests for the same.
I also noticed a failure in xgboost which was also fixed.
Type of change
Please check the option that is related to your PR.
Checklist
yapf -i <modified file>
and check no errors (yapf version must be 0.32.0)mypy -p deepchem
and check no errorsflake8 <modified file> --count
and check no errorspython -m doctest <modified file>
and check no errors