# Exploring English Lexicon-based Tokenizer

- Top F1=0.99 with delimiting symbols added to the lexicon (same as Freedom-based top F1=0.99)
- Top F1 based on length-driven parsing, not weight-driven or "hybrid" (legth-times-logweight-driven)
- Errors (0.6) are due to unknown words (mostly like "it's" missed in the dictinary)
- Not adding delimiters to lexicon drops top F1 to 0.94
- Adding threshold on word freqency does not improve F1
- Lexicon-based tokenization on spaceless text has F1=0.79 (comparable fo Chinese F1=0.82), obtaied with "hybrid" (length-times-logweight-driven) parsing with results explainable by lack of word stress articulation and speech pauses (expectedly can be improved based on alternative tokenization-trees maximizing the weight across entire tree)
- Precision of word discovery of Freedom-peak-based tokenizer is 0.99 (after correction for out-of-refernce-lexicon words, except single issue with question mark, not separated from the words), comparable with delimiter-based (1.0)

| Language | Tokenizer | Tokenization F1 | Lexicon Discovery Precision |
|---|---|---|---|
| English | Freedom-based  | **0.99** | **0.99** (vs 1.0) |
| English | Lexicon-based  | 0.99 | - |
| English no spaces | Freedom-based | 0.42 | - |
| English no spaces | Lexicon-based | 0.79 | - |
| Russian | Freedom-based  | **1.0** | **1.0** (vs 1.0) |
| Russian | Lexicon-based  | 0.94 | - |
| Russian no spaces | Freedom-based | 0.26 | - |
| Russian no spaces | Lexicon-based | 0.72 | - |
| Chinese | Freedom-based  | **0.71** | **0.92** (vs 0.94) |
| Chinese | Lexicon-based  | 0.83 | - |



In [2]:
import os, sys
cwd = os.getcwd()
project_path = cwd[:cwd.find('pygents')+7]
if project_path not in sys.path: sys.path.append(project_path)
os.chdir(project_path) 

#from importlib import reload  # Python 3.4+

import pickle
import pandas as pd

#force reimport
if 'pygents.util' in sys.modules:
    del sys.modules['pygents.util']
if 'pygents.text' in sys.modules:
    del sys.modules['pygents.text']
if 'pygents.plot' in sys.modules:
    del sys.modules['pygents.plot']
if 'pygents.token' in sys.modules:
    del sys.modules['pygents.token']
if 'pygents.token_plot' in sys.modules:
    del sys.modules['pygents.token_plot']


from pygents.token import *
from pygents.text import *
from pygents.util import *
from pygents.plot import plot_bars, plot_dict, matrix_plot
from pygents.token_plot import *

In [3]:
path = '../../nlp/corpora/Chinese/'
test_df = pd.read_csv(os.path.join(path,'magicdata/zh_en_ru_100/CORPUS_ZH_EN_RU.txt'),delimiter='\t')
test_texts = list(test_df['en'])
print(len(test_texts))
test_df[['en']]

100


Unnamed: 0,en
0,What about medical insurance? As for my family...
1,"For those who have insurance, according to the..."
2,Need to realize the importance of having insur...
3,"In fact, this phenomenon is indeed very common..."
4,It is really necessary for this generation of ...
...,...
95,Ant Insurance does not only offer car insuranc...
96,"However, when buying a house, except for the d..."
97,This kind of financial investment has certain ...
98,"If your investment orientation is right, then ..."


In [4]:
for text in test_texts:
    print(text)

What about medical insurance? As for my family, either an adult or a child will buy insurance.
For those who have insurance, according to the insurance contract, they will get a compensation of 300 thousand yuan.
Need to realize the importance of having insurance.
In fact, this phenomenon is indeed very common, for instance, for personal accident insurance, the more you buy, the more you insure.
It is really necessary for this generation of parents to buy insurance.
Well, right now, it's really advisable to buy insurance.
A car must be bought in full, and a house can be bought with a loan.
You can buy insurance, insurance is of course divided into many categories.
Medical insurance is very important.
It's the insurance company that pays this part of the money.
Xianghubao, I don't know if you ever heard about it, it is insurance in Alipay.
Buying a house is actually an investment.
Have you ever learned about the training of Ping An Insurance?
If it is deposited in the bank, what is the 

In [5]:
del_tokenizer = DelimiterTokenizer()


In [6]:
#get raw lexicon list
en_lex = list(pd.read_csv("https://raw.githubusercontent.com/aigents/aigents-java/master/lexicon_english.txt",sep='\t',header=None,na_filter=False).to_records(index=False))
print(len(en_lex))

#debug raw lexicon
print(max(en_lex,key=lambda item:item[1]))
en_lex_dict = weightedlist2dict(en_lex,lower=False) # no case-insensitive merge
print(len(en_lex_dict))

# merge and get top weight
en_lex_dict = weightedlist2dict(en_lex,lower=True) # with case-insensitive merge
top_weight = max([en_lex_dict[key] for key in en_lex_dict],key=lambda item:item)
print(top_weight)

# add delimiters to the list
en_lex_delimited = en_lex + [(i, top_weight) for i in list(delimiters)]
print(len(delimiters))
print(len(en_lex_delimited)) 


97565
('the', 53097401)
97565
53097401
34
97599


In [7]:
# no delimiters
filter_thresholds = [0,0.00001,0.0001,0.001,0.01]
for t in filter_thresholds:
    lex = listofpairs_compress_with_loss(en_lex,t) if t > 0 else en_lex
    en_lex0_tokenizer = LexiconIndexedTokenizer(lexicon=lex,sortmode=0,cased=True)
    en_lex1_tokenizer = LexiconIndexedTokenizer(lexicon=lex,sortmode=1,cased=True)
    en_lex2_tokenizer = LexiconIndexedTokenizer(lexicon=lex,sortmode=2,cased=True)
    print(t,en_lex0_tokenizer.count_params())
    print(evaluate_tokenizer_f1(test_texts,del_tokenizer,en_lex0_tokenizer,debug=False))#sort by len
    print(evaluate_tokenizer_f1(test_texts,del_tokenizer,en_lex1_tokenizer,debug=False))#sort by freq
    print(evaluate_tokenizer_f1(test_texts,del_tokenizer,en_lex2_tokenizer,debug=False))#sort by len and freq
    print()


0 97565
0.94
0.48
0.93

1e-05 40382
0.94
0.48
0.93

0.0001 10122
0.92
0.48
0.92

0.001 1570
0.71
0.48
0.71

0.01 118
0.37
0.31
0.37



In [8]:
# with delimiters
filter_thresholds = [0,0.00001,0.0001,0.001,0.01]
for t in filter_thresholds:
    lex = listofpairs_compress_with_loss(en_lex_delimited,t) if t > 0 else en_lex_delimited
    en_lex0_tokenizer = LexiconIndexedTokenizer(lexicon=lex,sortmode=0,cased=True)
    en_lex1_tokenizer = LexiconIndexedTokenizer(lexicon=lex,sortmode=1,cased=True)
    en_lex2_tokenizer = LexiconIndexedTokenizer(lexicon=lex,sortmode=2,cased=True)
    print(t,en_lex0_tokenizer.count_params())
    print(evaluate_tokenizer_f1(test_texts,del_tokenizer,en_lex0_tokenizer,debug=False))#sort by len
    print(evaluate_tokenizer_f1(test_texts,del_tokenizer,en_lex1_tokenizer,debug=False))#sort by freq
    print(evaluate_tokenizer_f1(test_texts,del_tokenizer,en_lex2_tokenizer,debug=False))#sort by len and freq
    print()


0 97599
0.99
0.52
0.98

1e-05 40416
0.99
0.52
0.98

0.0001 10156
0.97
0.52
0.97

0.001 1604
0.75
0.52
0.75

0.01 152
0.58
0.54
0.58



In [9]:
en_lex0_tokenizer = LexiconIndexedTokenizer(lexicon=en_lex_delimited,sortmode=0,cased=True)
for text in test_texts:
    expected = del_tokenizer.tokenize(text)
    actual = en_lex0_tokenizer.tokenize(text)
    f1 = calc_f1(expected,actual)
    if f1 < 1:
        print(expected)
        print(actual)
        print(round(f1,2))


['Well', ',', ' ', 'right', ' ', 'now', ',', ' ', "it's", ' ', 'really', ' ', 'advisable', ' ', 'to', ' ', 'buy', ' ', 'insurance', '.']
['Well', ',', ' ', 'right', ' ', 'now', ',', ' ', 'it', "'", 's', ' ', 'really', ' ', 'advisable', ' ', 'to', ' ', 'buy', ' ', 'insurance', '.']
0.9
["It's", ' ', 'the', ' ', 'insurance', ' ', 'company', ' ', 'that', ' ', 'pays', ' ', 'this', ' ', 'part', ' ', 'of', ' ', 'the', ' ', 'money', '.']
['It', "'", 's', ' ', 'the', ' ', 'insurance', ' ', 'company', ' ', 'that', ' ', 'pays', ' ', 'this', ' ', 'part', ' ', 'of', ' ', 'the', ' ', 'money', '.']
0.91
['Xianghubao', ',', ' ', 'I', ' ', "don't", ' ', 'know', ' ', 'if', ' ', 'you', ' ', 'ever', ' ', 'heard', ' ', 'about', ' ', 'it', ',', ' ', 'it', ' ', 'is', ' ', 'insurance', ' ', 'in', ' ', 'Alipay', '.']
['Xiang', 'hub', 'ao', ',', ' ', 'I', ' ', 'don', "'", 't', ' ', 'know', ' ', 'if', ' ', 'you', ' ', 'ever', ' ', 'heard', ' ', 'about', ' ', 'it', ',', ' ', 'it', ' ', 'is', ' ', 'insurance', ' 

## Explore validity of the discovered lexicon


In [16]:
en_lex_delimited_dict = weightedlist2dict(en_lex_delimited,lower=True)

In [10]:
# use SOTA - I
base = FreedomTokenizer(name='data/models/brown_nolines_chars_7a',max_n=7,mode='chars',debug=False)
model_compress_with_loss(base.model,0.0001)
test_tokenizer = FreedomBasedTokenizer(base,'ddf-','ddf+')
test_tokenizer.set_options(nlist = [1], threshold=0.4) # expected F1=0.99


In [12]:
expected = {}
actual = {}
f1 = evaluate_tokenizer_f1(test_texts,del_tokenizer,test_tokenizer,expected_collector=expected,actual_collector=actual)
print(f1)


0.99


In [26]:
#collected tokens

print('total relevant false precision')

expected_count = sum([expected[key] for key in expected])
relevant_count = sum([expected[key] for key in expected if key.lower() in en_lex_delimited_dict])
irrelevant_count = sum([expected[key] for key in expected if not key.lower() in en_lex_delimited_dict])
print(expected_count,relevant_count,irrelevant_count,relevant_count/expected_count,(relevant_count+21)/expected_count)

actual_count = sum([actual[key] for key in actual])
relevant_count = sum([actual[key] for key in actual if key.lower() in en_lex_delimited_dict])
irrelevant_count = sum([actual[key] for key in actual if not key.lower() in en_lex_delimited_dict])
print(actual_count,relevant_count,irrelevant_count,relevant_count/actual_count,(relevant_count+20)/actual_count)


total relevant false precision
2698 2677 21 0.992216456634544 1.0
2694 2662 32 0.9881217520415738 0.9955456570155902


In [27]:
#delimiter-based tokenizer
misses = sorted([(key,expected[key]) for key in expected if not key.lower() in en_lex_delimited_dict],key = lambda x: x[1],reverse=True)
misses

[("It's", 2),
 ("don't", 2),
 ('500', 2),
 ('300', 1),
 ("it's", 1),
 ('Xianghubao', 1),
 ('Alipay', 1),
 ('high-risk', 1),
 ('2.8', 1),
 ("doesn't", 1),
 ('broker-dealer', 1),
 ('20', 1),
 ('80%', 1),
 ("can't", 1),
 ('150%', 1),
 ('tie-in', 1),
 ('30%', 1),
 ("Apple's", 1)]

In [29]:
#freedom-based tokenizer
misses = sorted([(key,actual[key]) for key in actual if not key.lower() in en_lex_delimited_dict],key = lambda x: x[1],reverse=True)
misses


[("It's", 2),
 ("don't", 2),
 ('500', 2),
 ('insurance?', 1),
 ('300', 1),
 ("it's", 1),
 ('Xianghubao', 1),
 ('Alipay', 1),
 ('Insurance?', 1),
 ('interest?', 1),
 ('full?', 1),
 ('securities?', 1),
 ('investment?', 1),
 ('banking?', 1),
 ('time?', 1),
 ('2', 1),
 ('.8', 1),
 ('right?', 1),
 ("doesn't", 1),
 ('not?', 1),
 ('20', 1),
 ('80%', 1),
 ("can't", 1),
 ('150%', 1),
 ('you?', 1),
 ('30%', 1),
 ("'s", 1),
 ('year?', 1),
 ('have?', 1)]

In [30]:
# use SOTA - II
base = FreedomTokenizer(name='data/models/gutenberg_brown_chars_7a',max_n=7,mode='chars',debug=False)
model_compress_with_loss(base.model,0.0001)
test_tokenizer = FreedomBasedTokenizer(base,'ddf-','ddf+')
test_tokenizer.set_options(nlist = [1], threshold=0.4) # expected F1=0.99


In [31]:
expected = {}
actual = {}
f1 = evaluate_tokenizer_f1(test_texts,del_tokenizer,test_tokenizer,expected_collector=expected,actual_collector=actual)
print(f1)


0.99


In [32]:
#collected tokens

print('total relevant false precision')

expected_count = sum([expected[key] for key in expected])
relevant_count = sum([expected[key] for key in expected if key.lower() in en_lex_delimited_dict])
irrelevant_count = sum([expected[key] for key in expected if not key.lower() in en_lex_delimited_dict])
print(expected_count,relevant_count,irrelevant_count,relevant_count/expected_count,(relevant_count+21)/expected_count)

actual_count = sum([actual[key] for key in actual])
relevant_count = sum([actual[key] for key in actual if key.lower() in en_lex_delimited_dict])
irrelevant_count = sum([actual[key] for key in actual if not key.lower() in en_lex_delimited_dict])
print(actual_count,relevant_count,irrelevant_count,relevant_count/actual_count,(relevant_count+20)/actual_count)


total relevant false precision
2698 2677 21 0.992216456634544 1.0
2694 2662 32 0.9881217520415738 0.9955456570155902


In [33]:
#freedom-based tokenizer
misses = sorted([(key,actual[key]) for key in actual if not key.lower() in en_lex_delimited_dict],key = lambda x: x[1],reverse=True)
misses


[("It's", 2),
 ("don't", 2),
 ('500', 2),
 ('insurance?', 1),
 ('300', 1),
 ("it's", 1),
 ('Xianghubao', 1),
 ('Alipay', 1),
 ('Insurance?', 1),
 ('interest?', 1),
 ('full?', 1),
 ('securities?', 1),
 ('investment?', 1),
 ('banking?', 1),
 ('time?', 1),
 ('2', 1),
 ('.8', 1),
 ('right?', 1),
 ("doesn't", 1),
 ('not?', 1),
 ('20', 1),
 ('80%', 1),
 ("can't", 1),
 ('150%', 1),
 ('you?', 1),
 ('30%', 1),
 ("'s", 1),
 ('year?', 1),
 ('have?', 1)]