# Develop a New Operator

This tutorial shows you how to create custom operators for Data-Juicer. We'll cover the complete process from design to testing.

## Understanding Operator Types

Data-Juicer supports several types of operators:

1. **Mapper**: Transforms data samples
2. **Filter**: Removes data samples based on criteria
3. **Deduplicator**: Removes duplicate data samples
4. **Selector**: Selects a subset of data samples
5. **Grouper**: Groups data samples
6. **Aggregator**: Combines multiple data samples

Each operator type has its own base class in `data_juicer/ops/base_op.py`.

## Coding for Your Operator

Before implementing a new operator, please refer to existing [Operators Zoo](../docs/Operators.md) to avoid unnecessary duplication.

### 1. Create a New Operator File

Create a new operator file in the appropriate directory under `data_juicer/ops/`. For example, for a filter operator, create the file in `data_juicer/ops/filter/`.

Let's implement a `YourTextLengthFilter` as an example:

(Optional) If the new OP defines some statistical variables, please add the corresponding new `StatsKeysConstant` attribute in `data_juicer/utils/constant.py` for unified management.b

In [None]:
class StatsKeysConstant(object):
    # ... other keys
    text_len = 'text_len'

In [None]:
import sys
from typing import Dict, Any
from jsonargparse.typing import PositiveInt

from data_juicer.utils.constant import Fields
# NOTE: use a new definition above
from data_juicer.utils.constant import StatsKeys
from data_juicer.ops.base_op import OPERATORS, Filter

@OPERATORS.register_module('your_text_length_filter')
class YourTextLengthFilter(Filter):
    """Filter to keep samples with total text length within a specific
    range. """

    def __init__(self,
                 min_len: PositiveInt = 10,
                 max_len: PositiveInt = sys.maxsize,
                 *args,
                 **kwargs):
        """
        Initialize the enhanced text length filter.

        :param min_len: Minimum text length threshold
        :param max_len: Maximum text length threshold  
        :param args: Additional arguments
        :param kwargs: Additional keyword arguments
        """
        super().__init__(*args, **kwargs)
        self.min_len = min_len
        self.max_len = max_len

    def compute_stats_single(self, sample: Dict[str, Any], context: bool = False) -> Dict[str, Any]:
        # Check if already computed
        if StatsKeys.text_len in sample[Fields.stats]:
            return sample

        # Store the computed statistic
        sample[Fields.stats][StatsKeys.text_len] = len(sample[self.text_key])
        return sample

    def process_single(self, sample: Dict[str, Any]) -> bool:
        length = sample[Fields.stats][StatsKeys.text_len]
        return self.get_keep_boolean(length, self.min_len, self.max_len)

### Key Implementation Details

1. **Registration**: Use `@OPERATORS.register_module("operator_name")` to register your operator.

2. **Inheritance**: Inherit from the appropriate base class (`Filter`, `Mapper`, etc.) from `data_juicer.ops.base_op`.

3. **Single Processing**: Implement `compute_stats_single` and `process_single` methods.

4. **Stats Computation**: Use `Fields.stats` and `StatsKeys` from `data_juicer.utils.constant` to store and access statistics.

5. **Configuration Parameters**: Accept parameters in `__init__` method for customization.

### 2. Add to Operator Imports

Add your new operator to the corresponding `__init__.py` file in the operator directory. For example, for a filter operator, add it to `data_juicer/ops/filter/__init__.py`:

```python
# other OPs
from .your_text_length_filter import YourTextLengthFilter  # import this new OP class
__all__ = [
    # other Ops
    "YourTextLengthFilter",  # add this new Op to __all__
]
```

## Testing for Your Operator

It's better to add corresponding tests for your own OPs. For `YourTextLengthFilter` above, you would like to add `test_text_length_filter.py` into `tests/ops/filter/` directory as below.

In [None]:
import unittest

from data_juicer.core.data import NestedDataset as Dataset

# NOTE: use a new definition above
# from data_juicer.ops.filter.text_length_filter import YourTextLengthFilter
from data_juicer.utils.unittest_utils import DataJuicerTestCaseBase

class YourTextLengthFilterTest(DataJuicerTestCaseBase):

    def _run_text_length_filter(self, dataset: Dataset, target_list, op):
        dataset = op.run(dataset)
        res_list = dataset.to_list()
        print(res_list)
        self.assertEqual(res_list, target_list)

    def test_case1(self):

        ds_list = [{
            'text': '123'
        }, {
            'text': '12345'
        }, {
            'text': '1234567'
        }]
        tgt_list = [{
            'text': '12345'
        }]
        dataset = Dataset.from_list(ds_list)
        op = YourTextLengthFilter(min_len=4, max_len=6)
        self._run_text_length_filter(dataset, tgt_list, op)

if __name__ == '__main__':
    # unittest.main()
    pass

## Advanced Features

### CUDA Support

If your operator uses models that can be accelerated with CUDA, you can enable GPU support:

```python
    # ... (same as above)
    from data_juicer.utils.model_utils import get_model, prepare_model

    @OPERATORS.register_module('your_text_length_filter_with_cuda')
    class YourTextLengthFilterWithCuda(Filter):
        _accelerator = 'cuda' # Enable CUDA acceleration
        def __init__(self,
                    hf_model: str = 'bert-base-uncased',
                    min_len: PositiveInt = 10,
                    max_len: PositiveInt = sys.maxsize,
                    *args,
                    **kwargs):
            # ... (same as above)
            self.model_key = prepare_model(
                model_type="huggingface", pretrained_model_name_or_path=hf_model
            )

        def compute_stats_single(self, sample, rank=None, context=False):
            # ... (some codes)
            model, _ = get_model(self.model_key, rank, self.use_cuda())
            
        def process_single(self, sample=None, rank=None):
            # ... (same as above)
```

### Batch Support

If an operator takes multiple samples as input and produces multiple samples, the input and output need to be batched together by declaring `_batched_op = True`.

In [None]:

from data_juicer.ops.base_op import OPERATORS, Mapper

@OPERATORS.register_module('your_batch_mapper')
class YourBatchMapper(Mapper):
    """A mapper operator processing batched samples."""

    _batched_op = True
    
    def __init__(self,
                *args,
                **kwargs):

        super().__init__(*args, **kwargs)

    def process_batched(self, samples):
        for idx, text in enumerate(samples[self.text_key]):
            samples[self.text_key][idx] = text + f": {len(text)}"
        return samples

In [None]:
from data_juicer.core.data import NestedDataset

ds_list = [{
            'text': '123'
        }, {
            'text': '12345'
        }, {
            'text': '1234567'
        }]
print('unbatched samples', ds_list)
dataset = NestedDataset.from_list(ds_list)
op = YourBatchMapper()
dataset = op.run(dataset)
print(dataset.to_list())

### Transfer Filename

Call `transfer_filename` and `add_suffix_to_filename` to get unique paths for saving of extra datas, such as images and videos, to prevent data coverage and ensure process security.

In [None]:
from data_juicer.utils.file_utils import add_suffix_to_filename, transfer_filename
from data_juicer.ops.op_fusion import LOADED_VIDEOS
from data_juicer.ops.base_op import OPERATORS, Mapper
# ... (import some other libraries)

OP_NAME = 'your_video_split_by_key_frame_mapper'
@OPERATORS.register_module(OP_NAME)
@LOADED_VIDEOS.register_module(OP_NAME)
class YourVideoSplitByKeyFrameMapper(Mapper):
    _batched_op = True
    
    def __init__(self,
             # ... (OP parameters)
             split_num = 1,
             *args,
             **kwargs):
        super().__init__(*args, **kwargs)
        self._init_parameters = self.remove_extra_parameters(locals())
        print(f'init parameters: {self._init_parameters}')
        self.split_num = split_num

    def process_batched(self, sample):
        # ... (some codes)
        original_video_path = sample['videos'][0]
        base_video_path = transfer_filename(
                    original_video_path, OP_NAME, **self._init_parameters)
        print(f'base path: {base_video_path}')
        for count in range(self.split_num):
            split_video_path = add_suffix_to_filename(base_video_path,  f'_{count}')
            print(f'split {count} path: {split_video_path}')
        # ... (some codes)


In [None]:
sample = {'videos': ['./video.mp4']}
print('------ 2 splits ------')
op = YourVideoSplitByKeyFrameMapper(split_num=2)
op.process(sample)
print('------ 3 splits ------')
op = YourVideoSplitByKeyFrameMapper(split_num=3)
op.process(sample)

## Finish the Documents

 In order to facilitate the use of other users, we also need to update this new operator information to the corresponding documents.

- `configs/config_all.yaml`: this complete config file contains a list of all OPs and their arguments, serving as an
   important document for users to refer to all available OPs. Therefore, after adding the new OP, we need to add it to the process
   list (grouped by the OP type and sorted in alphabetical order):
   
   ```yaml
   ...
   - your_text_length_filter:                                # filter text with length out of specific range
       min_len: 10                                             # the min length of filter range
       max_len: 10000                                          # the max length of filter range
   ...
   ```

- `docs/Operators.md`: this doc maintains categorized lists of available OPs. it is automatically generated and kept up-to-date via a pre-commit hook, so manual edits to the operator tables are usually unnecessary. However, please manually update the "ref" column for your operator to include references (e.g., papers, links) or contributor information.

## Coding Style

We define our styles in `.pre-commit-config.yaml`. Before committing,
please install `pre-commit` tool to check and modify accordingly:

```shell
# ===========install pre-commit tool===========
pip install pre-commit

cd <path_to_data_juicer>
# install pre-commit script for data_juicer
pre-commit install


# ===========check all files===========
git add .
pre-commit run --all-files

# commit after all checking are passed
git commit -m "xxxx"
```

**Note**: We have configured pre-commit checks in github workflow. If this 
check in your PR fails, please locally ① ensure that the relevant 
dependencies of pre-commit are consistent with the project configuration 
(which can be completed through `pre-commit clean` and `pre-commit install`); 
and ② execute `pre-commit run --all-files` before push.


## (Optional) Make your OP fusible

- If the calculation process of some intermediate variables in the new OP is reused in other existing OPs, this new OP can be
added to the fusible OPs to accelerate the whole data processing with OP fusion technology. (e.g. both the `word_num_filter`
and `word_repetition_filter` need to split the input text into words)
- When opening OP fusion, these reused calculation processes and intermediate variables can be shared in the `context` between
OPs, thus reducing repeated calculations.
- OPs that contain common intermediate variables can be fused in OP fusion through the following steps:

1. (Optional) If a new intermediate variable is generated in the new OP, we need to add this new intermediate variable name to 
the `InterVars` class in `utils/constant.py`. In general, we need to add a prefix `DEFAULT_PREFIX` before the name.

```python
    class InterVars(object):
        # text
        lines = DEFAULT_PREFIX + 'lines'
        words = DEFAULT_PREFIX + 'words'  # add the new intermediate variable here
        ...
```

2. (Optional) We need to define a registry group in `ops/op_fusion.py` for the new intermediate variable in the 1st step, and add
this registry group to the registry group list that stores all groups of intermediate variables. This facilitates the OP Fusion module
to track OPs involving these intermediate variables.

```python
    ...
    # Type of intermediate vars
    # text
    INTER_LINES = Registry(InterVars.lines)
    INTER_WORDS = Registry(InterVars.words)  # define registry group for the new intermediate variable

    # images
    LOADED_IMAGES = Registry(InterVars.loaded_images)

    # all
    ALL_INTER_VARS = [INTER_LINES, INTER_WORDS, LOADED_IMAGES]  # and add it to the registry group list
    ...
```

3. Before the OP class definition that involves the intermediate variable, register this OP in the registry group corresponding
to this intermediate variable, indicating that the intermediate variable may be calculated and used in this OP.

```python
    ...
    @OPERATORS.register_module(OP_NAME)
    @INTER_WORDS.register_module(OP_NAME)  # register this new OP into the registry group
    class WordNumFilter(Filter):
    ...
```

4. In the calculation process of this intermediate variable of the new OP, we can modify the calculation logic to:
   1. If the argument `context` is True, it means the OP fusion is opening, so we get the value of this intermediate variable 
   from `context` first, which has been calculated by the previous OPs.
   2. If this intermediate variable doesn't exist in the `context`, it means it's the first time to calculate this variable in this
   OP, so we need to define a unique key and use it to store the intermediate variable in the `context` for subsequent OPs after
   it's calculated by this new OP.
   3. If the argument `context` is False, just follow the normal calculation process.

```python
    # before modification
    ...
    tokenizer = get_model(self.model_key)
    words = get_words_from_document(
        sample[self.text_key],
        token_func=tokenizer.encode_as_pieces if tokenizer else None)
    ...        

    # after modification
    ...
    words_key = f'{InterVars.words}-{self.model_key}'
    if context and words_key in sample[Fields.context]:
        # get the value of intermediate variable from context directly
        words = sample[Fields.context][words_key]
    else:
        # normal calculation process
        tokenizer = get_model(self.model_key)
        words = get_words_from_document(
            sample[self.text_key],
            token_func=tokenizer.encode_as_pieces if tokenizer else None)
        if context:
            # After calculating the intermediate variable for the first time,
            # store it in the context for subsequent OPs.
            sample[Fields.context][words_key] = words
    ...
```

## Next Steps

Continue with the next notebook to explore real-world applications and case studies of Data-Juicer.

## Additional Resources

- [Operator Development Guide](https://modelscope.github.io/data-juicer/en/main/docs/DeveloperGuide.html)
- [Existing Operators](https://github.com/modelscope/data-juicer/tree/main/data_juicer/ops)
- [Test Examples](https://github.com/modelscope/data-juicer/tree/main/tests/ops)