In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
"""
# If you're using Google Colab and not running locally, run this cell

# install NeMo
BRANCH = 'r1.0.0rc1'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[nlp]

In [None]:
import json
import os
import wget
import numpy as np
import inspect
import regex as re


# Introduction
Text normalization for Text to Speech (TTS) converts text into its verbalized form. That is, tokens belonging to special semiotic classes to denote things like numbers,
times, dates, monetary amounts, etc., that are often written in a way that differs from the
way they are verbalized. For example, "10:00" -> "ten o'clock", "10:00 a.m." -> "ten a m", "10kg" -> "ten kilograms". 

We use the same semiotic classes as in the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish):
PLAIN, PUNCT, DATE, CARDINAL, LETTERS, VERBATIM, MEASURE, DECIMAL, ORDINAL, DIGIT, MONEY, TELEPHONE, ELECTRONIC, FRACTION, TIME, ADDRESS. We additionally added the class `WHITELIST` for all whitelisted tokens whose verbalizations are directly looked up from a user-defined list.

This tutorial shows how to use the NeMo rule-based text normalization system.
Similar to [The Google Kestrel TTS text normalization
system](https://www.researchgate.net/profile/Richard_Sproat/publication/277932107_The_Kestrel_TTS_text_normalization_system/links/57308b1108aeaae23f5cc8c4/The-Kestrel-TTS-text-normalization-system.pdf), the NeMo rule-based system is divided into a tagger and a verbalizer: the tagger is responsible for detecting and classifying semiotic classes in the underlying text, the verbalizer takes the output of the tagger and carries out the normalization. 
In the example 'The alarms goes off at 10:30 a.m.', the tagger for time detects `10:30 a.m.` as a valid time data with `hour=10`, `minutes=30`, `suffix=a.m.`, the verbalizer then turns this into `ten thiry a m`.
The system is designed to be easily debuggable and extendable by more rules. We provide both inference for unlabeled and evaluation for labeled data.

We provided a set of rules that covers the majority of semiotic classes as found in the [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish) for the English language. As with every language there is a long tail of special cases.

This tutorial will show how to do prediction on regular text data. It also shows how to do evaluation on a labeled text normalization dataset that follows the format of [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish)


In [None]:
# If you're running the notebook locally, update the TOOLS_DIR path below
# In Colab, a few required scripts will be downloaded from NeMo github

TOOLS_DIR = '<UPDATE_PATH_TO_NeMo_root>/tools/text_normalization/'

if 'google.colab' in str(get_ipython()):
    TOOLS_DIR = 'tools/text_normalization/'
    TOOLS_DATA_DIR = TOOLS_DIR + "data/"
    os.makedirs(TOOLS_DIR, exist_ok=True)
    os.makedirs(TOOLS_DATA_DIR, exist_ok=True)

    required_files = [
      'normalize.py',
      'tagger.py',
      'utils.py',
      'run_evaluate.py',
      'run_predict.py',
      'verbalizer.py',
    ]
    required_data_file = [             
      'currency.tsv',
      'magnitudes.tsv',
      'measurements.tsv',
      'months.tsv',
      'whitelist.tsv'
    ]
    for file in required_files:
        if not os.path.exists(os.path.join(TOOLS_DIR, file)):
            file_path = f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/' + TOOLS_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DIR)
    for file in required_data_file:
        if not os.path.exists(os.path.join(TOOLS_DATA_DIR, file)):
            file_path = f'https://raw.githubusercontent.com/NVIDIA/NeMo/{BRANCH}/' + TOOLS_DATA_DIR + file
            print(file_path)
            wget.download(file_path, TOOLS_DATA_DIR)
elif not os.path.exists(TOOLS_DIR):
      raise ValueError(f'update path to NeMo root directory')

`TOOLS_DIR` should now contain scripts that we are going to need in the next steps, all necessary scripts could be found [here](https://github.com/NVIDIA/NeMo/tree/main/tools/text_normalization).

In [None]:
print(TOOLS_DIR)
! ls -l $TOOLS_DIR
! ls -l $TOOLS_DATA_DIR

# Data Preparation and Download


## Data for Prediction
For prediction, let's download a text file from [http://www.gutenberg.org/files/48874/48874-0.txt](http://www.gutenberg.org/files/48874/48874-0.txt).

In [None]:
## create data directory and download an audio file
WORK_DIR = 'WORK_DIR'
DATA_DIR = WORK_DIR + '/DATA'
os.makedirs(DATA_DIR, exist_ok=True)
text_file = '48874-0.txt'
if not os.path.exists(os.path.join(DATA_DIR, text_file)):
    print('Downloading text file')
    wget.download('http://www.gutenberg.org/files/48874/' + text_file, DATA_DIR)

The `DATA_DIR` should now contain the text file

In [None]:
!ls -l $DATA_DIR

print the first 10 lines of the file :

In [None]:
! head -n 10 $DATA_DIR/$text_file

# Prediction
Here we will show `$TOOLS_DIR/run_predict.py` step by step


In [None]:

import tools.text_normalization.verbalizer as verbalizer 
import tools.text_normalization.tagger as tagger 
import tools.text_normalization.normalize as normalize
from tools.text_normalization.run_predict import load_file, write_file

In [None]:
data = load_file(f"{DATA_DIR}/{text_file}")
print(len(data), "sentences") 

If you want to see how things were normalized, turn on `verbose=True` flag



In [None]:
# normalized_sentences = normalize.normalize_nemo(data, verbose=True)
normalized_sentences = normalize.normalize_nemo(data, verbose=False)

In [None]:
# Saves output to file
output_file_path=f"{DATA_DIR}/{text_file}.normalized"
write_file(file_path=output_file_path, data=normalized_sentences)

In [None]:
# Check file is store correctly
! ls -l $output_file_path
! head -n 10 $output_file_path

## Data for Evaluation

The data for Evaluation needs to be segmented and labeled by semiotic class, following the format of [Google Text normalization dataset](https://www.kaggle.com/richardwilliamsproat/text-normalization-for-english-russian-and-polish).
That is, every line of the file needs to have the format `<semiotic class>\t<unnormalized text>\t<self>` if it's trivial class or `<semiotic class>\t<unnormalized text>\t<normalized text>` in case of a semiotic class
`WHITELIST` is the semiotic class for all whitelisted tokens whose verbalizations are directly looked up from `$TOOLS_DATA_DIR/whitelist.tsv`. To extend the list simply add further key-value pairs to the file.


We will create a simple example file to show how evaluation works:





In [None]:
eval_input_data =  """LETTERS\tA & E\ta and e
PUNCT\t.\tsil
PLAIN\tRetrieved\t<self>
DATE\t2006-08-05\tthe fifth of august two thousand six
PUNCT\t.\tsil
<eos>\t<eos>
PLAIN\tDownloaded\t<self>
PLAIN\ton\t<self>
DATE\t7 August 2007\tthe seventh of august two thousand seven
PUNCT\t.\tsil"""
eval_text_file_path = f"{DATA_DIR}/00001-of-00100"
with open(eval_text_file_path, 'w') as fp:
  fp.write(eval_input_data)
! cat $eval_text_file_path


# Evaluation

Here we will show `$TOOLS_DIR/run_evaluate.py` step by step


In [None]:

import tools.text_normalization.verbalizer as verbalizer 
import tools.text_normalization.tagger as tagger 
import tools.text_normalization.normalize as normalize
from tools.text_normalization.run_predict import load_file, write_file
from tools.text_normalization.utils import (
    evaluate,
    known_types,
    load_files,
    training_data_to_sentences,
    training_data_to_tokens,
)


In [None]:
eval_input_data = load_files([eval_text_file_path])
print(eval_input_data)

In [None]:
print("Sentence level evaluation...")
sentences_un_normalized, sentences_normalized = training_data_to_sentences(eval_input_data)
print("- Data: " + str(len(sentences_un_normalized)) + " sentences")
sentences_prediction = normalize.normalize_nemo(sentences_un_normalized)
print("- Normalized. Evaluating...")
sentences_accuracy = evaluate(sentences_prediction, sentences_normalized, sentences_un_normalized)
print("- Accuracy: " + str(sentences_accuracy))

In [None]:
print("Token level evaluation...")
tokens_per_type = training_data_to_tokens(eval_input_data)
token_accuracy = {}
for token_type in tokens_per_type:
    print("- Token type: " + token_type)
    tokens_un_normalized, tokens_normalized = tokens_per_type[token_type]
    print("  - Data: " + str(len(tokens_un_normalized)) + " tokens")
    tokens_prediction = normalize.normalize_nemo(tokens_un_normalized)
    print("  - Normalized. Evaluating...")
    token_accuracy[token_type] = evaluate(tokens_prediction, tokens_normalized, tokens_un_normalized)
    print("  - Accuracy: " + str(token_accuracy[token_type]))
token_count_per_type = {token_type: len(tokens_per_type[token_type][0]) for token_type in tokens_per_type}
token_weighted_accuracy = [
    token_count_per_type[token_type] * accuracy for token_type, accuracy in token_accuracy.items()
]
print("- Accuracy: " + str(sum(token_weighted_accuracy) / sum(token_count_per_type.values())))

# Notes

The current system expects well-formed sentences and word boundaries. The default expects a semiotic token to be surrounded by a non-word token. E.g. `A & E` will be detected as `VERBATIM`, however `A&E` will not be detected due to missing spaces around `&`. As an exercise, adjust the word boundary definition in [tools/text_normalization/tagger.py](https://github.com/NVIDIA/NeMo/blob/main/tools/text_normalization/tagger.py) to accommodate this too.