# Build Your Own Data Recipes

We already familiar with the basic structure and global arguments of data recipes or data configs of Data-Juicer. In this noteobok, we will learn how to build your own data recipes based on the existing recipes or the [config_all.yaml](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml) file.

We simply set the basic global arguments only for input/output dataset paths and the number of subprocesses, so we can focus on the operator list refinement more.

We use the demo datasets as an example.

```yaml
project_name: 'build_my_own_recipe'
dataset_path: '../demos/data/demo-dataset.jsonl'  # replace it with the path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
export_path: './outputs/my_own_recipe/res.jsonl'

process:
  - language_id_score_filter:
      lang: 'zh'
      min_score: 0.8
```

Now we will decide what kind of OPs we need to add.


## Operator list

Data-Juicer offers an extensive array of OPs for data manipulation, encompassing modification, cleansing, filtering, and deduplication tasks.

Data recipe must include the necessary OPs and their respective arguments for efficient dataset processing. And Data-Juicer will process the OPs sequentially as arranged in the provided OP list.

Based on the OP list that contains only one OP above, we can add some other useful OPs.

For example, for textual samples, we can add a `whitespace_normalization_mapper` to normalize the whitespaces in the text to standard ASCII whitespace characters, which are more friendly to tokenizers of LLMs. Besides, deduplication is always necessary for some large-scale datasets to improve the training efficiency, so we can add a `document_deduplicator` to remove those duplicate texts from the dataset.

This would result the following data recipe and we can write it to a config file in YAML format:

In [1]:
config_str = """
project_name: 'build_my_own_recipe'
dataset_path: '../demos/data/demo-dataset.jsonl'  # replace it with the path to your dataset directory or file
np: 4  # number of subprocess to process your dataset
export_path: './outputs/my_own_recipe/res.jsonl'

process:
  - whitespace_normalization_mapper:
  - language_id_score_filter:
      lang: 'zh'
      min_score: 0.8
  - document_deduplicator: # deduplicate text samples using md5 hashing exact matching method
      lowercase: false   # whether to convert text to lower case
      ignore_non_character: false
"""
recipe_name = 'my_own_recipe.yaml'
with open(recipe_name, 'w') as fout:
    fout.write(config_str)

Load and check the recipe

In [2]:
from data_juicer.config import init_configs
cfg = init_configs(args=f'--config {recipe_name}'.split())
print(f'np = {cfg.np}')

  from .autonotebook import tqdm as notebook_tqdm
[32m2024-08-08 12:17:04[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m618[0m - [1mBack up the input config file [/root/projects/kdd_tutorial_notebooks/my_own_recipe.yaml] into the work_dir [/root/projects/kdd_tutorial_notebooks/outputs/my_own_recipe][0m
[32m2024-08-08 12:17:04[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m640[0m - [1mConfiguration table: [0m
╒═════════════════════════╤══════════════════════════════════════════════════════════════════════════╕
│ key                     │ values                                                                   │
╞═════════════════════════╪══════════════════════════════════════════════════════════════════════════╡
│ config                  │ [Path_fr(my_own_recipe.yaml, cwd=/root/projects/kdd_tutorial_notebooks)] │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────────┤
│ hpo_config              │ 

Now you can process the demo dataset with this new data recipe.

In [3]:
!dj-process --config my_own_recipe.yaml

[32m2024-08-08 12:17:24[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m618[0m - [1mBack up the input config file [/root/projects/kdd_tutorial_notebooks/my_own_recipe.yaml] into the work_dir [/root/projects/kdd_tutorial_notebooks/outputs/my_own_recipe][0m
[32m2024-08-08 12:17:24[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m640[0m - [1mConfiguration table: [0m
╒═════════════════════════╤══════════════════════════════════════════════════════════════════════════╕
│ key                     │ values                                                                   │
╞═════════════════════════╪══════════════════════════════════════════════════════════════════════════╡
│ config                  │ [Path_fr(my_own_recipe.yaml, cwd=/root/projects/kdd_tutorial_notebooks)] │
├─────────────────────────┼──────────────────────────────────────────────────────────────────────────┤
│ hpo_config              │ None                                              

Finally we clean up the temporary recipe.

In [4]:
!rm my_own_recipe.yaml

## Build Method

In addition to modifying from existing built-in recipes, you can also:

- ### Customize the Default Configuration File

The [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml) contains all operators and their default arguments. 

You just need to **remove** ops that you won't use and refine some arguments of ops.

- ### Create a New Configuration from Scratch

You can refer our example config file [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml), [op documents](https://github.com/modelscope/data-juicer/blob/main/docs/Operators.md), and advanced [Build-Up Guide for developers](https://github.com/modelscope/data-juicer/blob/main/docs/DeveloperGuide.md#build-your-own-configs) and create a new recipe from scratch.

## Reuseble Built-in Recipes

Data-Juice offers tens of [built-in data processing recipes](https://github.com/modelscope/data-juicer/blob/main/configs/data_juicer_recipes/README.md) for pre-training, fine-tuning, en, zh, and more scenarios.
### Reproduced Redpajama

We have reproduced the processing flow of some RedPajama datasets. Please refer to the [reproduced_redpajama](https://github.com/modelscope/data-juicer/blob/main/configs/reproduced_redpajama/README.md) folder for details.

### Reproduced BLOOM

We have reproduced the processing flow of some BLOOM datasets. please refer to the [reproduced_bloom](https://github.com/modelscope/data-juicer/blob/main/configs/reproduced_bloom/README.md) folder for details.

### Data-Juicer Recipes
We have refined some open source datasets (including CFT datasets) by using Data-Juicer and have provided configuration files for the refined flow. please refer to the [data_juicer_recipes](https://github.com/modelscope/data-juicer/blob/main/configs/data_juicer_recipes/README.md) folder for details.


## Awesome LLM Data 

We provide a tag-based categorization to help readers easy diving into the myriad of materials, promoting an intuitive understanding of each entry's key focus areas. Soon we will provide a dynamic table of contents to help readers more easily navigate through the materials with features such as search, filter, and sort.

For more detail, please refer to [Awesome LLM Data ](https://github.com/modelscope/data-juicer/blob/main/docs/awesome_llm_data.md)



# Conclusion

In this notebook, we learn how to build our own recipes from existing recipes. And we show that Data-Juicer already prepared lots of built-in data recipes for users to refer.