# Text Normalization with Finite State Machines

- __author__: Evgeny A. Stepanov
- __e-mail__: stepanov.evgeny.a@gmail.com

## Going Character Level

Normalization of token in our input requires going from token to character level.
The algorithms for this is described [here](http://www.openfst.org/twiki/bin/view/FST/FstExamples). 
Let's follow the same approach.

- Let's first define functions to write *FST* and *symbol table* specifications in OpenFST format

In [1]:
def fst_write(fst, fname, fs=" "):
    with open(fname, 'w') as f:
        for arc in fst:
            f.write(fs.join(map(str, arc)) + "\n")


def st_write(st, fname, fs="\t"):
    with open(fname, 'w') as f:
        for symbol, idx in st.items():
            f.write(fs.join([symbol, str(idx)]) + "\n")

### Lexicon FST: words-to-chars
__Algorithm__:
- create FST to translate words to characters (like text to FSA)
    - the first arc translates first character to a word
    - the rest of arcs translate remaining characters to `<epsilon>`

__Requirements__:
- *Input*: token symbol table
- character symbol table (for compiling FST)

#### Character Symbol Tables
Working at character-level implies that we need a character symbol table. Let's create now.

In [2]:
import string

def st_ascii():
    """
    create ascii symbol table
    """
    syms = {"<epsilon>": 0}
    for i in range(128):
        char_str = chr(i)
        if char_str in string.whitespace:
            syms['<space>'] = i
        elif char_str in string.ascii_letters:
            syms[char_str] = i
        elif char_str in string.punctuation:
            syms[char_str] = i
        elif char_str in string.digits:
            syms[char_str] = i
        else:
            # Assume others are control characters.
            syms['<ctrl>'] = i
    return syms

In [3]:
st_write(st_ascii(), 'chars.txt')

In [4]:
def fst_token2char(symbol_table, eps='<epsilon>'):
    """
    create a character-level lexicon from corpus in list of lists format
    from OpenFST examples
    :param symbol_table: fst symbol table
    :param eps: epsilon transition symbol
    :return:
    """
    special = {'<epsilon>', '<unk>', '<s>', '</s>'}
    s = 0  # state
    arcs = []
    for line in open(symbol_table, 'r'):
        cols = line.split()
        if cols[1] == '0':
            continue  # epsilon
        if cols[0] in special:
            continue  # reserved tokens (automatically added)
        word = cols[0]
        chars = list(word)

        for i in range(len(chars)):
            if i == 0:
                # first character of a word
                arcs.append([0, s + 1, chars[i], word])
            else:
                s += 1
                arcs.append([s, s + 1, chars[i], eps])
        s += 1  # final state
        arcs.append([s])
    return arcs

In [5]:
fst_write(fst_token2char('isyms.txt'), 'lexicon.fst.txt')

In [10]:
%%bash
fstcompile \
    --isymbols=chars.txt \
    --osymbols=isyms.txt \
    --keep_isymbols \
    --keep_osymbols \
    lexicon.fst.txt | fstclosure > lexicon.fst

In [11]:
%%bash
fstinvert lexicon.fst lexicon.inv.fst

In [13]:
%%bash
fstcompose sent.fsa lexicon.inv.fst | fstprint --isymbols=isyms.txt

0	1	<epsilon>	<epsilon>
1	2	who	w
2	3	<epsilon>	h
3	4	<epsilon>	o
4	5	<epsilon>	<epsilon>
5	6	is	i
6	7	<epsilon>	s
7	8	<epsilon>	<epsilon>
8	9	in	i
9	10	<epsilon>	n
10	11	<epsilon>	<epsilon>
11	12	the	t
12	13	<epsilon>	h
13	14	<epsilon>	e
14	15	<epsilon>	<epsilon>
15	16	movie	m
16	17	<epsilon>	o
17	18	<epsilon>	v
18	19	<epsilon>	i
19	20	<epsilon>	e
20	21	<epsilon>	<epsilon>
21	22	the	t
22	23	<epsilon>	h
23	24	<epsilon>	e
24	25	<epsilon>	<epsilon>
25	26	campaign	c
26	27	<epsilon>	a
27	28	<epsilon>	m
28	29	<epsilon>	p
29	30	<epsilon>	a
30	31	<epsilon>	i
31	32	<epsilon>	g
32	33	<epsilon>	n
33
