In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
import os
import wget
import nemo_text_processing


# Task Description

Text normalization (TN) is a part of the Text-To-Speech (TTS) pre-processing pipeline. It could also be used for pre-processing Automatic Speech Recognition (ASR) training transcripts.

TN is the task of converting text in written form to its spoken form to improve TTS. For example, `10:00` should be changed to `ten o'clock` and `10kg` to `ten kilograms`.

# NeMo Text Normalization

NeMo TN is based on Python Regex.

Currently, NeMo TN provides support for English and the following semiotic classes from the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish):
DATE, CARDINAL, MEASURE, DECIMAL, ORDINAL, MONEY, TIME, PLAIN. We additionally added the class `WHITELIST` for all whitelisted tokens whose verbalizations are directly looked up from a user-defined list.

The toolkit is modular. The rule-based system is divided into a tagger and a verbalizer following  [Google's Kestrel](https://www.researchgate.net/profile/Richard_Sproat/publication/277932107_The_Kestrel_TTS_text_normalization_system/links/57308b1108aeaae23f5cc8c4/The-Kestrel-TTS-text-normalization-system.pdf) design: the tagger is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer takes the output of the tagger and carries out the normalization. 
In the example `The alarm goes off at 10:30 a.m.`, the tagger for TIME detects `10:30 a.m.` as a valid time data with `hour=10`, `minutes=30`, `suffix=a.m.`, the verbalizer then turns this into `ten thirty a m`.


This tool offers prediction on text files and evaluation on [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish). It reaches 81% in sentence accuracy on the first file of `output-00001-of-00100` of Google text normalization dataset, 97.4% in token accuracy.



# Quick Start

## Add TN to your Python TTS pre-processing workflow

TN is a part of the `nemo_text_processing` package which is installed with `nemo_toolkit`. Installation instructions could be found [here](https://github.com/NVIDIA/NeMo/tree/main/README.rst).

In [None]:
from nemo_text_processing.text_normalization.normalize import normalize

raw_text = "we paid $123 for this desk."
normalize(raw_text, verbose=False)

In the above cell, `$123` would be converted to `one hundred twenty three dollars`, and the rest of the words remain the same.

## Run Text Normalization on an input from a file

Use `run_predict.py` to convert a spoken text from a file `INPUT_FILE` to a written format and save the output to `OUTPUT_FILE`. Under the hood, `run_predict.py` is calling `normalize()` (see the above section).

In [None]:
# If you're running the notebook locally, update the NEMO_TEXT_PROCESSING_PATH below
# In Colab, a few required scripts will be downloaded from NeMo github

NEMO_TOOLS_PATH = '<UPDATE_PATH_TO_NeMo_root>/nemo_text_processing/text_normalization'
DATA_DIR = 'data_dir'
os.makedirs(DATA_DIR, exist_ok=True)

if 'google.colab' in str(get_ipython()):
    NEMO_TOOLS_PATH = '.'

    required_files = ['run_predict.py',
                      'run_evaluate.py']
    for file in required_files:
        if not os.path.exists(file):
            file_path = 'https://raw.githubusercontent.com/NVIDIA/NeMo/' + BRANCH + '/nemo_text_processing/text_normalization/' + file
            print(file_path)
            wget.download(file_path)
elif not os.path.exists(NEMO_TOOLS_PATH):
      raise ValueError(f'update path to NeMo root directory')

In [None]:
INPUT_FILE = f'{DATA_DIR}/test.txt'
OUTPUT_FILE = f'{DATA_DIR}/test_tn.txt'

! echo "The alarm went off at 10:00." > $DATA_DIR/test.txt
! cat $INPUT_FILE
! python $NEMO_TOOLS_PATH/run_predict.py --input=$INPUT_FILE --output=$OUTPUT_FILE

In [None]:
# check that the raw text was converted to the spoken form
! cat $OUTPUT_FILE

## Run evaluation
The data for Evaluation needs to be segmented and labeled by semiotic class, following the format of [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish).
That is, every line of the file needs to have the format `<semiotic class>\t<unnormalized text>\t<self>` if it's trivial class or `<semiotic class>\t<unnormalized text>\t<normalized text>` in case of a semiotic class.


We will create a simple example file to show how evaluation works:





In [None]:
eval_text =  """LETTERS\tA & E\ta and e
PUNCT\t.\tsil
PLAIN\tRetrieved\t<self>
DATE\t2006-08-05\tthe fifth of august two thousand six
PUNCT\t.\tsil
<eos>\t<eos>
PLAIN\tDownloaded\t<self>
PLAIN\ton\t<self>
DATE\t7 August 2007\tthe seventh of august two thousand seven
PUNCT\t.\tsil"""
INPUT_FILE_EVAL = f"{DATA_DIR}/test_eval.txt"
with open(INPUT_FILE_EVAL, 'w') as fp:
  fp.write(eval_text)
! cat $INPUT_FILE_EVAL

In [None]:
! python $NEMO_TOOLS_PATH/run_evaluate.py --input=$INPUT_FILE_EVAL

`run_evaluate.py` call will output both **sentence level** and **token level** accuracies. 
For our example, the expected output is the following:

```
Loading training data: test_eval.text
Sentence level evaluation...
- Data: 1 sentences
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 578.76it/s]
- Normalized. Evaluating...
- Accuracy: 1.0
Token level evaluation...
- Token type: LETTERS
  - Data: 1 tokens
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2549.73it/s]
  - Normalized. Evaluating...
  - Accuracy: 1.0
- Token type: PUNCT
  - Data: 3 tokens
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3933.39it/s]
  - Normalized. Evaluating...
  - Accuracy: 1.0
- Token type: PLAIN
  - Data: 3 tokens
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3576.72it/s]
  - Normalized. Evaluating...
  - Accuracy: 1.0
- Token type: DATE
  - Data: 2 tokens
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1891.46it/s]
  - Normalized. Evaluating...
  - Accuracy: 1.0
- Accuracy: 1.0
 - Total: 9 

Class      | Num Tokens | nemo 
sent level | 1          | 1.0  
PLAIN      | 3          | 1.0  
PUNCT      | 3          | 1.0  
DATE       | 2          | 1.0  
CARDINAL   | 0          | 0    
LETTERS    | 1          | 1.0  
VERBATIM   | 0          | 0    
MEASURE    | 0          | 0    
DECIMAL    | 0          | 0    
ORDINAL    | 0          | 0    
DIGIT      | 0          | 0    
MONEY      | 0          | 0    
TELEPHONE  | 0          | 0    
ELECTRONIC | 0          | 0    
FRACTION   | 0          | 0    
TIME       | 0          | 0    
ADDRESS    | 0          | 0 

```


# Notes

The current system expects well-formed sentences and word boundaries. The default expects a semiotic token to be surrounded by a non-word token. E.g. `A & E` will be detected as `VERBATIM`, however `A&E` will not be detected due to missing spaces around `&`. As an exercise, adjust the word boundary definition in [nemo_text_processing/text_normalization/tagger.py](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/text_normalization/tagger.py) to accommodate this too.