Dev split in preprocessing not working when not using multiprocessing #1738

MichelBartels · 2021-11-11T14:10:32Z

Describe the bug
When disabling multiprocessing, preprocessing is done in one huge chunk:

haystack/haystack/modeling/data_handler/data_silo.py

Line 156 in 8082549

    
           results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts))  # type: ignore

However, the dev split method does not divide up chunks:

haystack/haystack/modeling/data_handler/data_silo.py

Lines 384 to 408 in 8082549

    
               def random_split_ConcatDataset(self, ds: ConcatDataset, lengths: List[int]): 
        
                   """ 
        
                   Roughly split a Concatdataset into non-overlapping new datasets of given lengths. 
        
                   Samples inside Concatdataset should already be shuffled. 
        
                   :param ds: Dataset to be split. 
        
                   :param lengths: Lengths of splits to be produced. 
        
                   """ 
        
                   if sum(lengths) != len(ds): 
        
                       raise ValueError("Sum of input lengths does not equal the length of the input dataset!") 
        
                   try: 
        
                       idx_dataset = np.where(np.array(ds.cumulative_sizes) > lengths[0])[0][0] 
        
                   except IndexError: 
        
                       raise Exception("All dataset chunks are being assigned to train set leaving no samples for dev set. " 
        
                                       "Either consider increasing dev_split or setting it to 0.0\n" 
        
                                       f"Cumulative chunk sizes: {ds.cumulative_sizes}\n" 
        
                                       f"train/dev split: {lengths}") 
        
                   assert idx_dataset >= 1, "Dev_split ratio is too large, there is no data in train set. " \ 
        
                                            f"Please lower dev_split = {self.processor.dev_split}" 
        
                   train = ConcatDataset(ds.datasets[:idx_dataset])  # type: Dataset 
        
                   test = ConcatDataset(ds.datasets[idx_dataset:])  # type: Dataset 
        
                   return train, test

This means that dev split cannot work when multiprocessing is disabled.

It could be fixed by changing:

haystack/haystack/modeling/data_handler/data_silo.py

Line 156 in 8082549

    
           results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, num_dicts))  # type: ignore

to:

results = map(partial(self._dataset_from_chunk, processor=self.processor), grouper(dicts, 1))  # type: ignore

This would mean that there would be a chunk for each sample so dev split would be possible again. But perhaps it would be better to overthink random_split_ConcatDataset because it also causes other problems (#96).

Error message
AssertionError: Dev_split ratio is too large, there is no data in train set. Please lower dev_split = 0.1

To Reproduce

from haystack.nodes import FARMReader
reader = FARMReader(model_name_or_path="prajjwal1/bert-tiny")
reader.train(data_dir="data", train_filename="train-v2.0.json", dev_split=0.1, num_processes=1)

The text was updated successfully, but these errors were encountered:

MichelBartels · 2021-12-20T15:43:06Z

This has been fixed with #1758 the way described in this issue. However, as explained it would probably still make sense to overthink random_split_ConcatDataset.

bogdankostic added topic:preprocessing type:bug Something isn't working labels Nov 19, 2021

masci added the wontfix This will not be worked on label Mar 13, 2024

masci closed this as completed Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev split in preprocessing not working when not using multiprocessing #1738

Dev split in preprocessing not working when not using multiprocessing #1738

MichelBartels commented Nov 11, 2021

MichelBartels commented Dec 20, 2021

Dev split in preprocessing not working when not using multiprocessing #1738

Dev split in preprocessing not working when not using multiprocessing #1738

Comments

MichelBartels commented Nov 11, 2021

MichelBartels commented Dec 20, 2021