Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev split in preprocessing not working when not using multiprocessing #1738

Closed
MichelBartels opened this issue Nov 11, 2021 · 1 comment
Closed
Labels
topic:preprocessing type:bug Something isn't working wontfix This will not be worked on

Comments

@MichelBartels
Copy link
Contributor

Describe the bug
When disabling multiprocessing, preprocessing is done in one huge chunk:

results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts)) # type: ignore

However, the dev split method does not divide up chunks:
def random_split_ConcatDataset(self, ds: ConcatDataset, lengths: List[int]):
"""
Roughly split a Concatdataset into non-overlapping new datasets of given lengths.
Samples inside Concatdataset should already be shuffled.
:param ds: Dataset to be split.
:param lengths: Lengths of splits to be produced.
"""
if sum(lengths) != len(ds):
raise ValueError("Sum of input lengths does not equal the length of the input dataset!")
try:
idx_dataset = np.where(np.array(ds.cumulative_sizes) > lengths[0])[0][0]
except IndexError:
raise Exception("All dataset chunks are being assigned to train set leaving no samples for dev set. "
"Either consider increasing dev_split or setting it to 0.0\n"
f"Cumulative chunk sizes: {ds.cumulative_sizes}\n"
f"train/dev split: {lengths}")
assert idx_dataset >= 1, "Dev_split ratio is too large, there is no data in train set. " \
f"Please lower dev_split = {self.processor.dev_split}"
train = ConcatDataset(ds.datasets[:idx_dataset]) # type: Dataset
test = ConcatDataset(ds.datasets[idx_dataset:]) # type: Dataset
return train, test

This means that dev split cannot work when multiprocessing is disabled.

It could be fixed by changing:

results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts)) # type: ignore

to:

results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, 1))  # type: ignore

This would mean that there would be a chunk for each sample so dev split would be possible again. But perhaps it would be better to overthink random_split_ConcatDataset because it also causes other problems (#96).

Error message
AssertionError: Dev_split ratio is too large, there is no data in train set. Please lower dev_split = 0.1

To Reproduce

from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="prajjwal1/bert-tiny")
reader.train(data_dir="data", train_filename="train-v2.0.json", dev_split=0.1, num_processes=1)
@bogdankostic bogdankostic added topic:preprocessing type:bug Something isn't working labels Nov 19, 2021
@MichelBartels
Copy link
Contributor Author

This has been fixed with #1758 the way described in this issue. However, as explained it would probably still make sense to overthink random_split_ConcatDataset.

@masci masci added the wontfix This will not be worked on label Mar 13, 2024
@masci masci closed this as completed Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:preprocessing type:bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants