# Data Preprocessing
---

These notebooks and the accompanying paper [https://arxiv.org/?] demonstrates an accuracy of 99.4% (English) and 99.3% (Russian) on the Text Normalization Challenge by Richard Sproat and Navdeep Jaitly. To achieve comparable and objective results, we need to preprocess the data provided by Richard Sproat and Navdeep Jaitly at [https://github.com/rwsproat/text-normalization-data]. From the README of the dataset:
```
In practice for the results reported in the paper only the first 100,002 lines
of output-00099-of-00100 were used (for English), and the first 100,007 lines of
output-00099-of-00100 for Russian.
```
Hence, the 'output-00099-of-00100' file is extracted for further use. 
This notebook prepares the raw data for the next stage of normalization.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## Global Config
**Language : English or Russian?**

In [2]:
lang = 'english'
# lang = 'russian'

In [3]:
if lang == 'english':
    # input data
    data_directory = '../data/english/'
    data = 'output-00099-of-00100'
    # output
    out = 'output-00099-of-00100_processed.csv'
    # test size 
    test_rows = 100002
    
elif lang == 'russian':
    # input data
    data_directory = '../data/russian/'
    data = 'output-00099-of-00100'
    # output
    out = 'output-00099-of-00100_processed.csv'
    # test size
    test_rows = 100007

## Load Data

By default, Pandas treats double quote as enclosing an entry so it includes all tabs and newlines in that entry until it reaches the next quote. To escape it we need to have the quoting argument set to QUOTE_NONE or 3 as given in the documentation - [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html]


In [4]:
raw_data = pd.read_csv(data_directory+data, nrows=test_rows,
                       header=None, sep='\t', quoting = 3,
                       names=['semiotic', 'before', 'after'])
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100002 entries, 0 to 100001
Data columns (total 3 columns):
semiotic    100002 non-null object
before      100002 non-null object
after       92451 non-null object
dtypes: object(3)
memory usage: 2.3+ MB


In [5]:
raw_data.head(10)

Unnamed: 0,semiotic,before,after
0,PLAIN,It,<self>
1,PLAIN,can,<self>
2,PLAIN,be,<self>
3,PLAIN,summarized,<self>
4,PLAIN,as,<self>
5,PLAIN,an,<self>
6,PUNCT,"""",sil
7,PLAIN,error,<self>
8,PLAIN,driven,<self>
9,PLAIN,transformation,<self>


## Data Analysis

**What are the different type of semiotic classes available?**

In [6]:
raw_data['semiotic'].value_counts()

PLAIN         67894
PUNCT         17746
<eos>          7551
DATE           2832
LETTERS        1409
CARDINAL       1037
VERBATIM       1001
MEASURE         142
ORDINAL         103
DECIMAL          92
ELECTRONIC       49
DIGIT            44
MONEY            37
TELEPHONE        37
FRACTION         16
TIME              8
ADDRESS           4
Name: semiotic, dtype: int64

The semiotic classes mentioned in the paper are:

1. PLAIN
2. PUNCT
3. DATE
4. TRANS
5. LETTERS
6. CARDINAL
7. VERBATIM
8. MEASURE
9. ORDINAL
10. DECIMAL
11. ELECTRONIC
12. DIGIT
13. MONEY
14. FRACTION
15. TIME


## Data Preprocessing

**Generating sentence and word token ids**

Our text normalization approach requires sentence and token ids to encode and generate batches

In [7]:
# to avoid modifying something we are iterating over
data = pd.DataFrame(columns=['sentence_id',
                             'token_id',
                             'semiotic',
                             'before',
                             'after'])
# initialize columns and iterator
sentence_id = 0
token_id = -1

In [8]:
for row in raw_data.itertuples():
    # look for end of sentences
    if (row.semiotic == '<eos>' and row.before == '<eos>'):
        sentence_id += 1
        token_id = -1
        continue
    else:
        token_id += 1
        
    new_row = {'sentence_id': sentence_id,
               'token_id': token_id,
               'semiotic': row.semiotic,
               'before': row.before,
               'after': row.after}
    data = data.append(new_row, ignore_index=True)    

In [9]:
data.head(10)

Unnamed: 0,sentence_id,token_id,semiotic,before,after
0,0,0,PLAIN,It,<self>
1,0,1,PLAIN,can,<self>
2,0,2,PLAIN,be,<self>
3,0,3,PLAIN,summarized,<self>
4,0,4,PLAIN,as,<self>
5,0,5,PLAIN,an,<self>
6,0,6,PUNCT,"""",sil
7,0,7,PLAIN,error,<self>
8,0,8,PLAIN,driven,<self>
9,0,9,PLAIN,transformation,<self>


**Transforming 'after' tokens**  
From the above mentioned paper:
```
Semiotic class instances are verbalized as sequences
of fully spelled words, most ordinary words are left alone (rep-
resented here as <self>), and punctuation symbols are mostly
transduced to sil (for “silence”).
```
Hence we transform as follows:
1. sil is replaced with < self >
2. < self > is replaced with the before column


In [10]:
sil_mask = (data['after'] == 'sil')
data.loc[sil_mask, 'after'] = '<self>' 

In [11]:
self_mask = (data['after'] == '<self>')
data.loc[self_mask, ('after')] = data.loc[self_mask, 'before']

Sanity Check...

In [12]:
data[sil_mask].sample(5)

Unnamed: 0,sentence_id,token_id,semiotic,before,after
27604,2255,1,PUNCT,:,:
23472,1886,3,PUNCT,:,:
33683,2775,15,PUNCT,",",","
69723,5727,4,PUNCT,",",","
74352,6093,11,PUNCT,.,.


In [13]:
data[self_mask].sample(5)

Unnamed: 0,sentence_id,token_id,semiotic,before,after
27460,2242,11,PUNCT,.,.
9551,759,5,PLAIN,the,the
77947,6381,11,PLAIN,far,far
4412,348,7,PLAIN,in,in
42046,3427,7,PLAIN,Takayama,Takayama


## Exporting Data

In [14]:
data[30:40]

Unnamed: 0,sentence_id,token_id,semiotic,before,after
30,2,0,PLAIN,She,She
31,2,1,PLAIN,then,then
32,2,2,PLAIN,compelled,compelled
33,2,3,PLAIN,her,her
34,2,4,PLAIN,tenants,tenants
35,2,5,PLAIN,to,to
36,2,6,PLAIN,level,level
37,2,7,PLAIN,the,the
38,2,8,PLAIN,Royalist,Royalist
39,2,9,PLAIN,siege,siege


In [15]:
data.to_csv(data_directory+out, index=False)

___