# 1. Basics of Data Curation

******

Generative AI developemet requires a havy data curation process. The quality the model largely depends on the quality of the data used for training. NVIDIA NeMo Curator is an open-source framework designed to streamline this process by preparing large-scale, high-quality datasets for pretraining and continuous training.

NeMo Curator offers built-in workflows for curating data from various public sources such as Common Crawl, Wikipedia, and arXiv. At the same time, it provides the flexibility to customize pipelines to suit the specific needs of your project.

This notebook guides the process of basic data preparation involved in most Language Models developements: 

**[1.1 Text Cleaning and Unification](#1.1-Text-Cleaning-and-Unification)<br>**
**[1.2 Document Size Filtering](#1.2-Document-Size-Filtering)<br>**
**[1.3 Filter Personally Identifiable Information (PII)](#1.3-Filter-Personally-Identifiable-Information-(PII))<br>**


***************
### Environment Setup

For large-scale data processing, NeMo Curator provides both GPU and CPU based modules. Understanding how these modules interact and how to configure your environment is key to optimizing performance.

CPU-based modules rely on [Dask](https://www.dask.org/) to distribute computations across multi-node clusters while GPU-accelerated modules uses [RAPIDS](https://rapids.ai/) to handle large-scale datasets efficiently.

Let's check first your current environment.

In [1]:
# check CPU details
! lscpu

Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          46 bits physical, 57 bits virtual
  Byte Order:             Little Endian
CPU(s):                   26
  On-line CPU(s) list:    0-25
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Xeon(R) Platinum 8480+
    CPU family:           6
    Model:                143
    Thread(s) per core:   1
    Core(s) per socket:   1
    Socket(s):            26
    Stepping:             8
    BogoMIPS:             4000.00
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge m
                          ca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall
                           nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_go
                          od nopl xtopology cpuid tsc_known_freq pni pclmulqdq v
                          mx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe
                           popcnt tsc_deadline_timer aes xs

In [2]:
# check GPU details
! nvidia-smi

Sun Dec  7 10:56:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA H100 PCIe               On  |   00000000:06:00.0 Off |                    0 |
| N/A   32C    P0             48W /  350W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

NeMo Curator provides a simple function `get_client` that can be used to start a local Dask cluster or connect to an existing one.  Let's initialize the Dask Cluster. 

The next cell starts a Dask `LocalCluster` on your CPU. It can be reused for all modules except for deduplication, which requires a GPU cluster.

In [None]:
from nemo_curator.utils.distributed_utils import get_client

client = get_client(cluster_type="cpu")

Lear more about Nemo Curator's CPU and GPU Modules with Dask in the dedicated [documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/cpuvsgpu.html)

## 1.1 Multilingual Datasets

In this notebook, we will use a subset of the [MC4](https://huggingface.co/datasets/allenai/c4), the C4 Multilingual Dataset.

For the sake of this exercice, to create a more diverse dataset:
- We merged Spanish and French samples (100 per language)
- We duplicated all samples (making 200 samples per language)
- We shuffled the samples

So, we have 400 samples, 200 from each language. The structure is a JSON format with 3 filed: `text`, `timestamp` and `url`. 

Let's have a look at the dataset:

In [3]:
# set dataset file path
multilingual_data = "./datasets/file.json"

In [4]:
# check number of samples
! wc -l {multilingual_data}

400 ./datasets/file.json


In [4]:
# show the 3 first samples
! head -n 3 {multilingual_data} | jq

[1;39m{
  [0m[34;1m"text"[0m[1;39m: [0m[0;32m"Dragon Ball: Le 20e film de la sage sortira le 14 décembre, première image teaser sur Buzz, insolite et culture\nDragon Ball: Le 20e film de la sage sortira le 14 décembre, première image teaser\nLe 20e film Dragon Ball sortira le vendredi 14 décembre 2018. La première affiche teaser montre un Gokû jeune adulte, environ celui de la fin de Dragon Ball et le début de Dragon Ball Z. À lire aussi >>> Le gouvernement mexicain prévoit la diffusion sur place publique des épisodes 130 et 131 de Dragon […]...\nLire la suite du buzz sur bleachmx Source : bleachmx - 12/03/2018 22:31 - trending_up142\nfilm commémoration Akira Toriyama dbz dragon ball Dragon Ball Super dragon ball z affiche Dragon Ball Super Anime V Jump Décembre 2018 Dragon Ball Z Battle of Gods Dragon Ball Z Fukkatsu No [F] Dragon Ball Z La Résurrection de [F] Potins Films\nLe site Deadline indique ce Jeudi que le film d’animation Dragon Ball Super – Broly a rapporté, selon les

Notice, **languages are not annotated in the dataset**, allowing us to leverage AI-based language separation later in the workflow.

Let's now create a document dataset from a pandas data frame. For more information on the arguments see Dask’s from_pandas documentation

NeMo Curator's `DocumentDataset` employs Dask's distributed dataframes to mangage large datasets across multiple nodes and allow for easy restarting of interrupted curation. `DocumentDataset` supports reading and writing to sharded *jsonl* and *parquet* files both on local disk and from remote sources such as S3 bukets.

Let's load our multilingual dataset with Nemo Curator

In [5]:
import warnings

warnings.filterwarnings("ignore")

In [7]:
from nemo_curator.datasets import DocumentDataset

multilingual_data_path = "./datasets/multilingual"
multilingual_dataset = DocumentDataset.read_json(
    multilingual_data_path, add_filename=True
)

Reading 1 files


In [8]:
multilingual_dataset.head(3)

AttributeError: 'DocumentDataset' object has no attribute 'head'

## 1.2 Basic Text cleaning and Unification

NeMo Curator provides various `DocumentModifier` implementations such as the `UnicodeReformatter` which uses [ftfy](https://pypi.org/project/ftfy/) (fixes text for you) to resolve all unicode issues in the dataset. 

It is also possible to implement your custom text cleaner using `DocumentModifier`. For instance, we can standardize inconsistent quotation marks that appear very often in curated large dataset, remove HTML, URLs, and email tags, etc.


Let's first create the output folders to save the cleaned step outputs:

In [12]:
import os

# Set dataset file path
curated_data_path = "./curated"
clean_and_unify_data_path = os.path.join(curated_data_path, "01_clean_and_unify")

# Create directories
os.makedirs(curated_data_path, exist_ok=True)
os.makedirs(clean_and_unify_data_path, exist_ok=True)

Let's now implement a custom text cleaner `QuotationTagUnifier`.

It is designed to modify text documents by normalizing quotation marks and removing unwanted elements. 

The result is a cleaned and standardized text output.

In [13]:
import re

import dask
import pandas as pd
from nemo_curator.modifiers import DocumentModifier, UnicodeReformatter
from nemo_curator.modules.modify import Modify


class QuotationTagUnifier(DocumentModifier):
    def modify_document(self, text: str) -> str:
        text = text.replace("‘", "'").replace("’", "'")
        text = text.replace("“", '"').replace("”", '"')
        text = text.replace("\t", " ")
        text = re.sub(
            r"(<[^>]+>)|(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)",
            "",
            text,
        )

        return text

Next, we can chain modifiers together using the `Sequential` class, which takes a list of operations to be run sequentially and applies them to a given `DocumentDataset`.ipynb_checkpoints/
 
Let's call this sequence the `cleaners`:

In [14]:
from nemo_curator import Sequential

cleaners = Sequential(
    [
        # Apply: Unify all the quotation marks and remove tags
        Modify(QuotationTagUnifier()),
        # Apply: Unify all unicode
        Modify(UnicodeReformatter()),
    ]
)

Let's run that on a toy example with few sentences:

In [15]:
# create the toy samples
dataframe_toy = pd.DataFrame(
    {
        "text": [
            "Ryan went out to play ‘footbal’",
            "He is very  \t  happy.",
            "Visit <a href='www.example.com'>example.com</a> for more information or contact us at info@example.com",
        ]
    }
)

dataset_toy = DocumentDataset(dask.dataframe.from_pandas(dataframe_toy, npartitions=1))

# check the samples
dataset_toy.head()

AttributeError: 'DocumentDataset' object has no attribute 'head'

Now, let's apply our sequence of cleaners to the toy samples. To execute this sequence on the Dask cluster, we use `.persist()`, which keeps the transformed data in memory for optimized processing. 

In [None]:
dataset_test_clean_and_unify = cleaners(dataset_toy).persist()

Let's check the output.

Expected output are samples with normalized quotations, removed tabs and HTML, URL and Email tags. 

In [None]:
# check cleaned toy samples
dataset_test_clean_and_unify.head()

Now, let's apply this cleaning step to our multilingual dataset. We can achieve this by creating a sequence of curation steps, starting with the cleaning sequence as the first function in our data curation pipeline.

Run the next cell to create the cleaning step as a function that would be the first curation step.

In [None]:
# define the sequence of cleaning operations as a function
def clean_and_unify(dataset: DocumentDataset) -> DocumentDataset:
    cleaners = Sequential(
        [
            # Apply: Unify all the quotation marks and remove tags
            Modify(QuotationTagUnifier()),
            # Apply: Unify all unicode
            Modify(UnicodeReformatter()),
        ]
    )
    return cleaners(dataset)


# sequence of data curation setps. so far, only cclean_and_unify is defined
curation_steps = Sequential([clean_and_unify])

Let's now execute this step on out multilingual dataset:

In [None]:
%%time
print("Executing the pipeline...")

dataset_clean_and_unify = curation_steps(multilingual_dataset).persist()

Let's check some outputs:

In [None]:
dataset_clean_and_unify.head()

We can save the created Document into a json file. 

In [None]:
# save output to json
dataset_clean_and_unify.to_json(clean_and_unify_data_path, write_to_filename=True)

In [None]:
! head -n 1 {clean_and_unify_data_path}/file.jsonl | jq

## Dataset document size Filtering

Extremely short documents may lack sufficient context or information for the model to learn meaningful concepts. By filtering out such documents, we can ensure that the data used for training is sufficiently informative and balanced.

Let's explore how to apply word counts and filtering using NeMo Curator.

In [None]:
# import relevant libraries
from nemo_curator import ScoreFilter
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import (
    DocumentFilter,
    RepeatingTopNGramsFilter,
    WordCountFilter,
)

In [None]:
class IncompleteDocumentFilter(DocumentFilter):
    """
    If the document doesn't end with a terminating punctuation mark, then discard.
    """

    def __init__(self):
        super().__init__()
        # Accepted document terminators.
        self._story_terminators = {".", "!", "?", '"', "”"}

    def score_document(self, text: str) -> bool:
        """
        Determines if a document's score is valid based on the last character of the text.
        Args:
            text (str): The document text.
        Returns:
            bool: True if the document's score is valid, False otherwise.
        """
        return text.strip()[-1] in self._story_terminators

    def keep_document(self, score) -> bool:
        return score

The following code defines a function, `filter_dataset`, that cleans a `DocumentDataset` by applying several filters:

- **Word Count Filter**: Removes documents with fewer than 80 words by default.
- **Incomplete Document Filter**: Removes incomplete documents.
- **Repeating N-Grams Filters**: Removes documents with excessive repetition of word sequences (2-grams, 3-grams, 4-grams) above certain thresholds (20%, 18%, 16% respectively).

These filters are applied sequentially to refine the dataset.

In [None]:
def filter_dataset(dataset: DocumentDataset) -> DocumentDataset:
    filters = Sequential(
        [
            ScoreFilter(
                WordCountFilter(min_words=80),
                text_field="text",
                score_field="word_count",
            ),
            ScoreFilter(IncompleteDocumentFilter(), text_field="text"),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18),
                text_field="text",
            ),
            ScoreFilter(
                RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16),
                text_field="text",
            ),
        ]
    )

    return filters(dataset)

Let's now apply that on our multilingual dataset:

In [None]:
%%time
curation_steps = Sequential([clean_and_unify, filter_dataset])

print("Executing the pipeline...")
filtered_dataset = curation_steps(multilingual_dataset).persist()

We can check the outputs. Notice that a new field named `word_count` has been added:

In [None]:
filtered_dataset.head()

Let's save the output, we need to create the folder first.

In [None]:
filtered_data_path = os.path.join(curated_data_path, "02_filter_dataset")

! mkdir -p {filtered_data_path}

In [None]:
# save output to json
filtered_dataset.to_json(filtered_data_path, write_to_filename=True)

Check the saved file by running the next cell.

In [None]:
! head -n 1 {filtered_data_path}/file.jsonl | jq

### 1.3 PII Identification and Removal

The Personal Identifiable Information (PII) identification tool is designed to remove sensitive data from datasets.

The identification leverages [presidio_analyzer](https://pypi.org/project/presidio-analyzer/) a Python based service for detecting PII entities in text.

Let's try to analyze a toy sample: *My name is Dana and my number is 212-555-5555*

Expected output is the type `PERSON` and `PHONE_NUMBER` and the char start and end position.

```
 type: PERSON, start: 11, end: 15, score: 0.85,
 type: PHONE_NUMBER, start: 33, end: 45, score: 0.75
```

In [None]:
import warnings

from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

# Hide deprecation warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Set up the engine, loads the NLP module (spaCy model by default) and other PII recognizers
LANGUAGES_CONFIG_FILE = "./languages-config.yml"
# Create NLP engine based on configuration file
provider = NlpEngineProvider(conf_file=LANGUAGES_CONFIG_FILE)
nlp_engine_with_spanish = provider.create_engine()

analyzer = AnalyzerEngine(
    supported_languages=["en", "es", "fr"], nlp_engine=nlp_engine_with_spanish
)

results = analyzer.analyze(
    text="My name is Dana and my number is 212-555-5555",
    entities=["PHONE_NUMBER", "PERSON"],
    language="en",
)
print(results)

Run the analyzer for French sample:

In [None]:
analyzer.analyze(text="Mon email est mon@example.com", language="fr")

Try your own examples in these three languages for accurate results.

In [None]:
input = # Your code here
analyzer.analyze(text=input, language="en")

Nemo Curator integrates PII Identification and Removal efficiently leveraging Dask for parallelization. The tool currently supports the identification and removal of the following sensitive data types:

`ADDRESS`,`CREDIT_CARD`,`EMAIL_ADDRESS`,`DATE_TIME`,`IP_ADDRESS`,`LOCATION`,`PERSON`,`URL`,`US_SSN`,`US_PASSPORT`,`US_DRIVER_LICENSE`,`PHONE_NUMBER`,

Let;s run the Nemo Curator PII Identification `PiiModifier` on a toy sample. 

In [None]:
# create toy samples with PII data
dataframe_toy = pd.DataFrame(
    {
        "text": [
            "Ryan went out to play football",
            "His email is ryan@example.com and phone is 212-555-5555",
        ]
    }
)
dataset_toy = DocumentDataset(dask.dataframe.from_pandas(dataframe_toy, npartitions=1))

dataset_toy.head()

Let's build and apply the `PiiModifier` on the toy sample. 

In [None]:
from nemo_curator.modifiers.pii_modifier import PiiModifier

modifier = PiiModifier(
    batch_size=2000,
    language="en",
    supported_entities=["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER"],
    anonymize_action="replace",
)

modify = Modify(modifier)
modified_dataset = modify(dataset_toy)

In [None]:
# check modified data
modified_dataset.head()

Now, let's integrate this PII identification step into our curation sequence and apply it to the multilingual dataset. This will ensure that sensitive data is properly detected and removed while maintaining data quality. 

Let's create first the `redact_pii` function for PII identification and removal.

In [None]:
from nemo_curator.modifiers.pii_modifier import PiiModifier


def redact_pii(dataset: DocumentDataset) -> DocumentDataset:
    redactor = Modify(
        PiiModifier(
            supported_entities=["PERSON"],
            anonymize_action="replace",
            device="cpu",
        ),
    )

    return redactor(dataset)

Let's now run the sequence of curation steps including the PII removal function


In [None]:
%%time
curation_steps = Sequential([clean_and_unify, filter_dataset, redact_pii])

print("Executing the pipeline...")
redact_pii_dataset = curation_steps(multilingual_dataset).persist()

In [None]:
# check the filtered data
redact_pii_dataset.head()

Let's now save the fileted data. We need to create the folder to save the output.

In [None]:
redact_pii_data_path = os.path.join(curated_data_path, "03_redact_pii_data_path")

! mkdir -p {redact_pii_data_path}

In [None]:
# save
redact_pii_dataset.to_json(redact_pii_data_path, write_to_filename=True)

In [None]:
# check the saved file
! head -n 1 {redact_pii_data_path}/file.jsonl |jq

The current PII removal s Nemo Curator implementation is limited to HPC clusters using Slurm as the resource manager. Check the [documentation](https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/personalidentifiableinformationidentificationandremoval.rst) for more details.

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you have used Nemo Curator to apply a sequence of basic text curation steps designed to clean and preprocess the dataset.

Before moving on to the next notebook, make sure to stop the Dask cluster. Please run the next cell.

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)  # automatically restarts kernel


We are now ready to move to the next notebook to explore advanced data preparation steps. 

Let's move to the [02_advanced_data_processing](02_advanced_data_processing.ipynb) 

<img src="./images/DLI_Header.png" style="width: 400px;">
