# Text Normalization using Memory Augmented Neural Networks
---

This notebook and the accompanying paper <a href="http://arxiv.org/abs/1806.00044">Text Normalization using Memory Augmented Neural Networks</a>  demonstrates an accuracy of 99.4% (English) and 99.3% (Russian) on the Text Normalization Challenge by Richard Sproat and Navdeep Jaitly. The approach used here has secured the 6th position in the [Kaggle Russian Text Normalization Challenge](https://www.kaggle.com/c/text-normalization-challenge-russian-language) by Google's Text Normalization Research Group.

# Table of Contents
---
1. [Import Dependencies](#import)
2. [Global Config](#config)
3. [Load Dataset](#load)
4. [XGBoost Classification](#xgb)
5. [Encode Data](#encode)
6. [DNC Normalization](#dnc)
7. [Data Postprocessing](#post)
8. [Results Analysis](#result)
9. [Comparison](#comparison)  
10. [Conclusion](#conclusion)

___

## 1. Import Dependencies
<a id="import"></a>

### Import Libraries

In [1]:
import os
import gc
import sys

import pickle
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn.metrics import accuracy_score

from xgboost.sklearn import XGBClassifier

%matplotlib inline



In [2]:
%load_ext autoreload
%autoreload 2

### Import Utilities

In [3]:
sys.path.append("../src")

from utils import Encoder
from utils import Normalized2String
from XGBclassify import XGB
import DNCnormalize

**System Information**

In [4]:
%load_ext watermark
%watermark -v -n -m -p numpy,pandas,matplotlib,seaborn,sklearn,xgboost,tensorflow

Wed May 30 2018 

CPython 3.6.3
IPython 6.2.1

numpy 1.13.3
pandas 0.21.0
matplotlib 2.1.0
seaborn 0.8.1
sklearn 0.19.1
xgboost 0.6
tensorflow 1.3.0

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-1017-gcp
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit


## 2. Global Config
<a id="config"></a>


**Language : English or Russian?**

In [5]:
#lang = 'english'
lang = 'russian'

In [6]:
if lang == 'english':
    # input data
    data_directory = '../data/english/'
    data = data_directory+'output-00099-of-00100_processed.csv'
    vocab = data_directory+'en_vocab.data'
    # interim data
    encoded_file = data_directory+'en_encoded.npy'
    encoded_len_file = data_directory+'en_encoded_len.npy'
    normalized_file = data_directory+'en_normalized.npy'
    # model
    model_directory = '../models/english/'
    xgb_path = model_directory+'en_xgb_tuned-trained.pk'
    dnc_path = model_directory+'dnc_translator/ckpt'
    end_token = -1
    # results
    result_dir = '../results/english/'
    result_csv = 'normalized.csv'

elif lang == 'russian':
    # input data
    data_directory = '../data/russian/'
    data = data_directory+'output-00099-of-00100_processed.csv'
    vocab = data_directory+'ru_vocab.data'
    # interim data
    encoded_file = data_directory+'ru_encoded.npy'
    encoded_len_file = data_directory+'ru_encoded_len.npy'
    normalized_file = data_directory+'ru_normalized.npy'
    # model
    model_directory = '../models/russian/'
    xgb_path = model_directory+'ru_xgb_tuned-trained.pk'
    dnc_path = model_directory+'dnc_translator/ckpt'
    end_token = -1
    # results
    result_dir = '../results/russian/'
    result_csv = 'normalized.csv'

** Load DNC Configurations **

In [7]:
with open(vocab,'rb') as vf:        
        vocab_load=pickle.loads(vf.read())
start_token = vocab_load['output']['<GO>']
input_vocab_len=len(vocab_load['input'])+1
output_vocab_len=len(vocab_load['output'])+1

## 3. Load Dataset
<a id="load"></a>

**Dataset by Sproat and Jaitly (2016) - An RNN Model of Text Normalization**  
- English Source: https://storage.googleapis.com/text-normalization/en_with_types.tgz
- Russian Source: https://storage.googleapis.com/text-normalization/ru_with_types.tgz  
*The data is preprocessed for achieving results comparable to the ones presented in the above mentioned paper.*

**Read CSV as DataFrame**

In [8]:
raw_data = pd.read_csv(data)
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93196 entries, 0 to 93195
Data columns (total 5 columns):
sentence_id    93196 non-null int64
token_id       93196 non-null int64
semiotic       93196 non-null object
before         93196 non-null object
after          93196 non-null object
dtypes: int64(2), object(3)
memory usage: 3.6+ MB


In [9]:
raw_data.head()

Unnamed: 0,sentence_id,token_id,semiotic,before,after
0,0,0,PLAIN,Сбор,Сбор
1,0,1,PUNCT,),)
2,0,2,PUNCT,—,—
3,0,3,PLAIN,село,село
4,0,4,PLAIN,в,в


**Dropping the ground truth labels**

In [10]:
raw_data.drop(['after'], axis=1,inplace=True)

## 4. XGBoost Classification  
<a id="xgb"></a>
**ToBeNormalized or RemainSame**

In [11]:
# instantiate and load trained
# XGBoost Model for classification
xgb = XGB(xgb_path)

In [12]:
# Class of tokens in the data
raw_data['class'] = xgb.predict(data=raw_data)
# Raw to Classified Data
classified_data = raw_data.copy(deep=False)

Processed 100%        

In [13]:
classified_data.sample(n=10)

Unnamed: 0,sentence_id,token_id,semiotic,before,class
89132,6508,15,PLAIN,отсутствует,RemainSelf
10130,754,6,PLAIN,театра,RemainSelf
58451,4256,12,PUNCT,.,RemainSelf
27642,1999,18,PLAIN,Фильтровую,RemainSelf
40546,2950,21,PLAIN,методологий,RemainSelf
45070,3282,6,PLAIN,музыкант,RemainSelf
64527,4699,7,PUNCT,—,RemainSelf
91998,6722,5,PUNCT,.,RemainSelf
74773,5462,18,PLAIN,исторических,RemainSelf
34650,2526,9,PUNCT,(,RemainSelf


In [14]:
id_tobenormalized = classified_data.index[classified_data['class']=='ToBeNormalized'].tolist()
id_remainself = classified_data.index[classified_data['class']=='RemainSelf'].tolist()

Sanity Check...

In [15]:
print('Tokens to be normalized : {}'.format(len(id_tobenormalized)))
print('Tokens to remain self : {}'.format(len(id_remainself)))

Tokens to be normalized : 11312
Tokens to remain self : 81884


## 5. Encode Data
<a id="encode"></a>

In [16]:
# Instatiate encoder with the vocabulary
encoder = Encoder(vocab_file=vocab)

In [17]:
# use existing Encoder Decoder parameters
# to perform the decoding of the test data
enc_data, enc_len = encoder.encode(classified_data)

In [18]:
# encode 'tobenormalized' tokens only
tobenormalized_enc_data = enc_data[id_tobenormalized]
tobenormalized_enc_len = enc_len[id_tobenormalized]

Sanity check...

In [19]:
print('Tokens to be normalized : {}'.format(len(tobenormalized_enc_data)))

Tokens to be normalized : 11312


Saving encoded data

In [20]:
np.save(encoded_file, tobenormalized_enc_data)
np.save(encoded_len_file, tobenormalized_enc_len)

## 6. DNC Normalization
<a id="dnc"></a>
**Generate Normalized Form**

In [21]:
tobenormalized_enc_data = np.load(encoded_file)
tobenormalized_enc_len = np.load(encoded_len_file)

In [None]:
DNCnormalize.config['num_encoder_symbols'] = input_vocab_len
DNCnormalize.config['num_decoder_symbols'] = output_vocab_len 
DNCnormalize.config['start_token']= start_token
DNCnormalize.config['end_token']= end_token

In [None]:
normalized_data = DNCnormalize.normalize(tobenormalized_enc_data, tobenormalized_enc_len,dnc_path)

Using DNC model at ../models/russian/dnc_translator/ckpt
building model..
building encoder..
building decoder and attention..
building greedy decoder..
Reloading model parameters...
INFO:tensorflow:Restoring parameters from ../models/russian/dnc_translator/ckpt
model restored from ../models/russian/dnc_translator/ckpt
Number of batches: 56
Normalized 200 out of 11200


Sanity Check...

In [None]:
len(tobenormalized_enc_data)

In [None]:
len(normalized_data)

**Saving the normalized form**

In [None]:
np.save(normalized_file, normalized_data)

## 7. Data Postprocessing
<a id="post"></a>

**Load Normalized Data**

In [None]:
normalized_data = np.load(normalized_file)

**Encoded from to String form**

In [None]:
# Converting the numpy array to a list form
normalized_data = normalized_data[0:tobenormalized_enc_len.shape[0]]
normalized_data = np.split(normalized_data, normalized_data.shape[0])

Sanity check

In [None]:
print('Total instances : {}'.format(len(normalized_data)))
print('Shape of each instance : {}'.format(normalized_data[0].shape))

In [None]:
# Reshaping the nested numpy arrays
for i in range(len(normalized_data)):
    normalized_data[i] = np.reshape(normalized_data[i],
                                    normalized_data[i].shape[2])

In [None]:
# Converting encoded to string format
str_converter = Normalized2String(vocab)
for i in range(len(normalized_data)):
    normalized_data[i]=str_converter.to_str(normalized_data[i])

A sneak peek...

In [None]:
normalized_data[:10]

**Merging Normalized with Remain Self**

In [None]:
classified_data['after'] = ''

Normalized

In [None]:
classified_data.loc[id_tobenormalized, 'after'] = normalized_data


RemainSelf

In [None]:
classified_data.loc[id_remainself, 'after'] = classified_data.loc[id_remainself, 'before'] 

A sneak peek into the final results...

In [None]:
classified_data.loc[id_tobenormalized]

Store the results

In [None]:
classified_data.to_csv(result_dir+result_csv, index=False)

## 8. Results Analysis
<a id="result"></a>

In [None]:
results = pd.read_csv(result_dir+result_csv)
truth = pd.read_csv(data)

In [None]:
classified_data.loc[truth['before']==truth['after'],'truth']='RemainSelf'
classified_data.loc[truth['before']!=truth['after'],'truth']='ToBeNormalized'

In [None]:
classified_data[classified_data['class']==classified_data['truth']].shape[0]/classified_data.shape[0]

In [None]:
truth['class']=''
truth.loc[truth['before']!=truth['after'],'class']='ToBeNormalized'
truth.loc[truth['before']==truth['after'],'class']='RemainSelf'

In [None]:
classification_score = np.sum(truth['class']==results['class'])/truth.shape[0]
print('Classification Accuracy on {} language is {:.5f}'.format(lang, classification_score))

In [None]:
np.sum(truth['semiotic']!=results['semiotic'])

**Overall Accuracy**

In [None]:
score = accuracy_score(truth['after'].tolist(),
                       results['after'].tolist())

In [None]:
print('Accuracy on {} language is {:.5f}'.format(lang, score))

**Semiotic class-wise accuracy**

In [None]:
results_group = results.groupby('semiotic')
truth_group = truth.groupby('semiotic')

In [None]:
class_accuracy = pd.DataFrame(columns=['semiotic-class', 'accuracy', 'count', 'correct'])
row = {'semiotic-class': 'ALL',
       'accuracy': score,
       'count': results.shape[0],
       'correct': score*results.shape[0]}
class_accuracy = class_accuracy.append(row, ignore_index=True)

for results_items, truth_items in zip(results_group, truth_group):
    semiotic_class = results_items[0]
    results_items = results_items[1]
    truth_items = truth_items[1]
    score = accuracy_score(truth_items['after'].tolist(),
                          results_items['after'].tolist())
    row = {'semiotic-class': semiotic_class,
           'accuracy': score,
           'count': results_items.shape[0],
           'correct': score*results_items.shape[0]}
    class_accuracy = class_accuracy.append(row, ignore_index=True)
class_accuracy['correct'] = class_accuracy['correct'].astype(int)

In [None]:
class_accuracy

In [None]:
class_accuracy.plot(title='Semiotic Class-wise Accuracy',
                    y=['count', 'correct'], x='semiotic-class',
                    kind='bar', figsize=(20,10), grid=True)
plt.savefig(result_dir+'Semiotic_Class-wise_Accuracy.png')

In [None]:
class_accuracy.to_csv(result_dir+'classwise_accuracy.csv', index=None)

**Normalization Mistakes**

In [None]:
mistake_mask = (results['after'] != truth['after'])
mistakes = results[mistake_mask]
mistakes = mistakes.assign(truth = truth.loc[mistake_mask, 'after'])

In [None]:
mistakes[mistakes['semiotic']=='TIME']

**Class-wise mistakes**

In [None]:
mistakes_grouped = mistakes.groupby('semiotic')

In [None]:
mistakes_grouped.apply(lambda x: x.sample(n=3, replace=True))

In [None]:
mistakes.to_csv(result_dir+'mistakes.csv', index=None)

## 9. Comparison
<a id="comparison"></a>

In [None]:
base = pd.read_csv('../results/base-paper_classwise_accuracy.csv')
base

In [None]:
en_accuracy = pd.read_csv('../results/english/classwise_accuracy.csv')
en_accuracy

In [None]:
ru_accuracy = pd.read_csv('../results/russian/classwise_accuracy.csv')
ru_accuracy

In [None]:
en_base = pd.DataFrame(columns=['semiotic-class', 'base accuracy', 'base count'])
en_base['semiotic-class'] = base['Semiotic Class']
en_base['base accuracy'] = base[' En Accuracy']
en_base['base count'] = base[' En Count']

In [None]:
ru_base = pd.DataFrame(columns=['semiotic-class', 'base accuracy', 'base count'])
ru_base['semiotic-class'] = base['Semiotic Class']
ru_base['base accuracy'] = base[' Ru Accuracy']
ru_base['base count'] = base[' Ru Count']

In [None]:
en_compared = pd.merge(en_base, en_accuracy, on='semiotic-class')
en_compared[['semiotic-class', 'accuracy', 'base accuracy', 'count', 'base count']]

In [None]:
ru_compared = pd.merge(ru_base, ru_accuracy, on='semiotic-class')
ru_compared[['semiotic-class', 'accuracy', 'base accuracy', 'count', 'base count']]

**Latex Output**

In [None]:
en_compared

In [None]:
print(en_compared[['semiotic-class', 'base count','count','base accuracy','accuracy']].to_latex())

In [None]:
print(ru_compared[['semiotic-class', 'base count','count','base accuracy','accuracy']].to_latex())

## 10. Conclusion
<a id="conclusion"></a>

**English Normalization Accuracy: 99.4% **

**Russian Normalization Accuracy: 99.3% **

___