In here multi-entity extractions was attempted using a phonetic matching approach on Cantonese text where the concerned entity is being mis-written.

For any good functioning NLU system, it would be ideal if the misspelled/mis-written words can be identified and corrected. In English, this is usually performed using some kind of edit distance operation (e.g. Levenshtein Distance) or phonetic correction (e.g. Soundex).

While misspelled words in English doesn't affect the word boundary, a Chinese word with incorrect choice of characters can often leads to tokenisation problem, since word boundaries in Chinese text is partly determined using statistical models (e.g. HMM), which is based on the transition probability between different states (e.g. BMES) and words, usually derived by data mining on text with mostly correctly written words.

As such miswritten Chinese word are much more troublesome to deal with when compared to misspelled words in English, since the standard misspelled correction operations used in English like Soundex and Levenshtein Distance wouldn't work on incorrectly tokenised words.

The below example illustrate the problem:


In [205]:
import jieba
# from nltk.tokenize.stanford import CoreNLPTokenizer
# sttok = CoreNLPTokenizer('http://localhost:9001')
#sttok.tokenize(sent)
sent = "尖沙咀有咩好嘢食"
tok = jieba.cut(sent, False)
print (list(tok))

sent = "尖沙嘴有咩好嘢食"
tok = jieba.cut(sent, False)
print (list(tok))



['尖沙咀', '有', '咩', '好', '嘢', '食']
['尖', '沙嘴', '有', '咩', '好', '嘢', '食']


Had the word "尖沙嘴" been tokenised correctly, correction could have been easily performed. Unfortunately, this almost certainly won't happen especially with name entities (The fact that it is in Cantonese makes it even worse since the transition probability table used in Jieba is constructed using corpus of Standard Chinese writing.).

Apart from using a machine-learning approach to deal with this problem (will be discussed in the end), in the following a rule-based matching approach is attempted and location entities were extracted by matching the corresponding jyutping romantisation of the entities. The below method does not required tokenisation and thus circumvented the issue mentioned earlier.

First, it is assumed that a list of the neccessary entities for the entity extraction task is obtainable.
In this case, the location entities were extracted from a wikipedia page with the help of BeautifulSoup and some pre-processing, 360 locations in Hong Kong were extracted.


In [206]:
import requests
from bs4 import BeautifulSoup
import re

In [207]:
r = requests.get("https://zh.wikipedia.org/wiki/%E9%A6%99%E6%B8%AF%E5%9C%B0%E6%96%B9%E5%88%97%E8%A1%A8")
content = r.content
soup = BeautifulSoup(content, "html.parser")

e = soup.find_all("li")
#print (e)
locs = []
for li in e:
    for a in li.find_all("a"):
        if a.get("title"):
            locs.append(a.get("title"))
                
print (len(locs))
locs = list(set(locs))
print (len(locs))

print (locs[:50])

en_char = re.compile("[A-Za-z]") # remove any location with alphabet e.g. 'Wikipedia:CC-BY-SA-3.0协议文本'

locs = [loc for loc in locs if not en_char.findall(loc)] 
print (len(locs))

brac1 = re.compile("(.*?) *\(.*?\)") # remove the things in bracket e.g. 大浪灣 (香港島)
brac2 = re.compile("(.*?) *（.*?）") # remove the things in bracket e.g. 草灣（页面不存在）

init_len = len(locs)
locs = [loc if not brac1.findall(loc) else brac1.findall(loc)[0] for loc in locs]
locs = [loc if not brac2.findall(loc) else brac2.findall(loc)[0] for loc in locs]
assert init_len == len(locs)

locs = [loc for loc in locs if len(loc)< 7] #remove unncessarily long string e.g. 关于本计划、你可以做什么、应该如何做'
print (len(locs))
locs = [loc if loc != "山尞" else "山寮" for loc in locs] # writing mistake in wikipedia
locs = [loc for loc in locs if loc not in ["寻求帮助", "沙咀", "港鐵東鐵線"]]# it's not an actual place in HK (just a short-term for 沙咀懲教所)
print (locs[:50])
print (len(locs))
#more pre-processing cleaning could be needed



568
408
['昂坪', '林村', '寻求帮助', '太平山 (香港)', 'Category:香港地方', '大浪灣 (香港島)', '对于来自此IP地址编辑的讨论[n]', '牛池灣村', '凹頭', '佐敦谷', '馬頭圍', '和合石', '髻山', '竹角', '葵芳', '赤柱', '牙鷹洲', '筆架山 (香港)', '小瀝源', '壁屋', '孟公屋', '梨木樹', '屏山 (香港)', '沙田圍', '上水', '大上托', '南灣 (香港島)', '觀塘', '牛頭角', '元朗市中心', '洪水橋', '山尞（页面不存在）', '鯉魚門', '長洲 (香港)', '新蒲崗', '蠔涌', '錦田', '打鼓嶺', '黃麻地', '打鼓嶺新村（页面不存在）', '中環', '港鐵東鐵線', '銅鑼灣 (沙田)', '西灣河', '提供当前新闻事件的背景资料', '官涌', '太和 (香港)', '深涌', '鎖羅盆', '油塘']
381
363
['昂坪', '林村', '太平山', '大浪灣', '牛池灣村', '凹頭', '佐敦谷', '馬頭圍', '和合石', '髻山', '竹角', '葵芳', '赤柱', '牙鷹洲', '筆架山', '小瀝源', '壁屋', '孟公屋', '梨木樹', '屏山', '沙田圍', '上水', '大上托', '南灣', '觀塘', '牛頭角', '元朗市中心', '洪水橋', '山寮', '鯉魚門', '長洲', '新蒲崗', '蠔涌', '錦田', '打鼓嶺', '黃麻地', '打鼓嶺新村', '中環', '銅鑼灣', '西灣河', '官涌', '太和', '深涌', '鎖羅盆', '油塘', '大埔', '牛池灣', '石湖墟', '禮頓山', '灣仔']
360


Next we need to find the corresponding jyutping romantisation of the location names. There are existing library such as Pycantonese to do that but it was found that there are many words that lacks jyutping information. Therefore in the end, a vocabulary of around 20,000 Chinese character with their corresponding jyutping romantisation was found online and subsequently used to create a jyutping dictionary. 

Note that many words has more than one jyutping representation and therefore the dictionary value is a list of jyutping.

In [208]:
# jyutping of around 20,000 chinese characters including cantonese characters
import os  
dir_ = 'tonghanma_yue'
l = []
file = open(dir_ + "/" + "tonghanma_yue-yp.txt", "r").read()
f = file.split()
l.extend(f)

l = l[12:]

word2jp ={}
for idx in range(len(l)):
    if len(l[idx]) == 1:
        i = 1
        jps = []
        #print (idx)
        try:
            while len(l[idx+i]) !=1:
                jps.append(l[idx+i])
                i +=1
        except IndexError:
            pass
        word2jp[l[idx]] = jps
    else:
        pass
word2jp
            

{'㐀': ['jau1'],
 '㐁': ['tim2'],
 '㐅': ['ng5'],
 '㐆': ['zaan2'],
 '㐌': ['zyu4'],
 '㐖': ['zip6'],
 '㐜': ['caau4'],
 '㐡': ['no6'],
 '㐤': ['kaau4'],
 '㐨': ['zeoi6'],
 '㐩': ['zing4'],
 '㐫': ['gun3', 'hung1', 'zung1'],
 '㐬': ['lau4'],
 '㐭': ['lam5'],
 '㐮': ['soeng1'],
 '㐯': ['jung4'],
 '㐰': ['seon3'],
 '㐱': ['caa5', 'caan2', 'dou3', 'zaan2'],
 '㐲': ['daai6'],
 '㐳': ['ngaat6'],
 '㐴': ['paan1'],
 '㐷': ['maa6'],
 '㐸': ['gim1', 'him3'],
 '㐹': ['go1', 'kwok3', 'ngaat6'],
 '㐺': ['zung3'],
 '㐻': ['fu1', 'mou5', 'noi6'],
 '㐼': ['cing2'],
 '㐽': ['fung1'],
 '㑁': ['zyut3', 'zyut6'],
 '㑂': ['fong2', 'pong4'],
 '㑃': ['au2', 'au3', 'paai1'],
 '㑄': ['mou5'],
 '㑅': ['zok3', 'zok6'],
 '㑇': ['zaau3'],
 '㑈': ['dung1'],
 '㑉': ['cuk1'],
 '㑊': ['zik6'],
 '㑋': ['kung4'],
 '㑌': ['hong1'],
 '㑍': ['leoi5'],
 '㑎': ['nou5'],
 '㑏': ['cyu5', 'zyu2'],
 '㑐': ['suk1'],
 '㑔': ['seoi2'],
 '㑗': ['saan1', 'saan6'],
 '㑘': ['gaai3'],
 '㑙': ['dip6'],
 '㑚': ['naa4', 'no4'],
 '㑛': ['cuk1'],
 '㑜': ['zaai6'],
 '㑝': ['lung6'],
 '㑞': ['

In [209]:
# making sure there are no characters without jyutping
for word in word2jp:
    if word2jp[word] == []:
        print (word)

Next is to compute the jyutping romantisation of the locations. Since some characters have more than 1 pronounciation, all possible sequences of pronounciation needs to be computed. This is done by using itertools. Note that the jyutping romantisation is consist of 4 parts, onset, nucleus, coda and tone. For this particular purpose, the tone is removed from the romantisation to account for the fact that mis-used character can have a different tone to the correct character. Also, removing the tone reduces the number of possible sequences and therefore improve computation speed.


In [211]:
import itertools
import pycantonese as pc
def sent2jyutping(sent, remove_tone = False):
    jps_seq = []
    for char in sent:
        try:
            jps = word2jp[char] # a list of jyutping for that character
            jps = [jp.replace("ngon", "on") for jp in jps] # unified as they are very similar
            jps_seq.append(list(set(jps))) 

        except KeyError:
            print (char) #print if that character is not in the dictionary
        possible_jps = list(itertools.product(*jps_seq)) 
        possible_jps = ["".join(jp) for jp in possible_jps] # construct all possible sequences of jyutping of that sentence
        possible_jps = [tuple(pc.parse_jyutping(jp)) for jp in possible_jps] # a list of a tuple of tuple
    if remove_tone == True:     
        possible_jps = [tuple(tup[:3] for tup in jp) for jp in possible_jps]
        possible_jps = set(possible_jps)         
    return (possible_jps)


loc2jp = {}
for loc in locs:
    loc2jp[loc] = sent2jyutping(loc, remove_tone = True)
# some entity have more than one pronounciation
print ("城門谷: ", loc2jp["城門谷"])

城門谷:  {(('s', 'i', 'ng'), ('m', 'u', 'n'), ('j', 'u', 'k')), (('s', 'e', 'ng'), ('m', 'u', 'n'), ('g', 'u', 'k')), (('s', 'i', 'ng'), ('m', 'u', 'n'), ('g', 'u', 'k')), (('s', 'e', 'ng'), ('m', 'u', 'n'), ('j', 'u', 'k'))}


In [212]:
# example
sent = "同鑼灣或者尖嘴有咩食"
q = sent2jyutping(sent,remove_tone = True)
print (len(q)) # this sentence has 3 possible ways of pronouncing it (with tone removed).

3


Before entity extraction, the sentence is first corrected. This is done using the correction and match function below.

Note that the occur_table is a fake occurence frequency table. In reality, this would be computed by counting the occurrence frequency of each entity in a large corpus. The occur_table help decide which entity is more likely when there are multiple entities that are very similar.

In [216]:
# fake prob table
occur_table = {"尖沙咀": 10,
               "尖鼻咀": 1,
               "中環": 10, 
               "中灣": 1,
               "爛角咀":1,
                "大角咀": 10,
                "黑角頭": 1,
                 "牛頭角": 10,
                "馬頭角": 3,
                "沙頭角": 5}

def match(focused_jp = list, entity_dict = dict):
    global jp2, loc2
    max_precision = -1
    most_probable_entity = None
    for entity in entity_dict:
        possible_jp_seq = entity_dict[entity] #a list of strings
        for entity_jp in possible_jp_seq:
            precision = sum([1 for tup in focused_jp if tup in entity_jp])/len(entity_jp)
            if precision > 0.6:
      #          entity_tags = entity_tags[:start_char] + len(focus_jp) * "B" + entity_tags[end_char+1:]
                if precision > max_precision:
                    max_precision = precision                                      
                    most_probable_entity = entity
                    break
                elif precision == max_precision:
                    #check occurance freq
              #      print (precision, potential_loc, loc2)
                    if occur_table[entity] > occur_table[most_probable_entity]:
                        most_probable_entity = entity
                        break

                    else:
                        pass # even if occur freq is the same
    return most_probable_entity


def correction(sent = str, entity_dict = dict):
    """
    sent is a string.
    entity_dict is a entity dictionary with values as tuple of tuples.
    return the corrected sentence as string.
    """
    ###First part is to find entity in the sentence by exact character match.
    ###Second part is to find entity in the sentence with high precision phonetic match.
    entity_tags = "o" * len(sent)
    sent_jps = sent2jyutping(sent, remove_tone = True)
    tag = "T" # an arbitary tag
    #1 
    # tag all entity with exact character match
    for entity in entity_dict.keys():
        if entity in sent:
        #    print (entity)
            start = sent.index(entity)
            end = start + len(entity)
            entity_tags = entity_tags[:start] + len(entity) * tag + entity_tags[end:]
            sent = sent[:start] + entity + sent[end:]
        #    print (sent)
    #2
    #  if precision > 0.6, match the word with the entity with highest precision
    #  correct it then tag it
    for entity in entity_dict:
        possible_jp_seq = entity_dict[entity]
        for entity_jp in possible_jp_seq: #loop through each jyutping sequence of the location
            len_window = min(len(entity_jp), len(sent))  #number of tuple 
            for sent_jp in sent_jps: #loop through each jyutping sequence of the sentence
                for idx in range(len(sent_jp)- len_window+1):
                    focused_window = sent_jp[idx:idx+len_window] # list of tuples of part of the sequence
                    if focused_window[0] == entity_jp[0] or focused_window[-1] == entity_jp[-1]:
                        if all([True if t == "o" else False for t in entity_tags[idx:idx+len_window]]): 
                            #only proceed if there is no existing tag
                            start_char = idx
                            end_char = idx + len_window-1

                            if focused_window[0] in entity_jp[0]:
                                end_char = start_char + len_window -1
                            elif focused_window[-1] in entity_jp[-1]:
                                start_char = end_char - len_window +1

                            precision = sum([1 for char in focused_window if char in entity_jp])/len_window                       
                            if precision > 0.6:

                                focused_window = sent_jp[start_char:end_char +1] ###
                         #       print (focused_window)
                         #       print (entity_jp)
                                correct_entity = match(focused_window, entity_dict= entity_dict)
                       #         print (correct_entity)

                                if correct_entity != None:
                                    entity_tags = entity_tags[:start_char] + len(correct_entity) * tag + entity_tags[end_char + 1:]
                                    sent = sent[:start_char] + correct_entity + sent[end_char + 1:]
                                    sent_jps = sent2jyutping(sent, remove_tone = True)
                                else:
                                    pass

                        else:
                            pass # have tag already                                      
    return sent    
def entities_extractor(sent, entity_dict = dict, tag = str, entity_tags = None):
    """
    sent is a string.
    entity_dict is a entity dictionary with values as tuple of tuples.
    tag is a one character string.
    entity_tags (Optional) is a string of length len(sent)
    return the corrected sentence and the corresponding tag as string.
    """    
    assert len(tag)== 1, "tag must only be 1 character long in length"
    if entity_tags == None:
        entity_tags = "o" * len(sent)
    else:
        assert type(entity_tags) == str and len(entity_tags) == len(sent), "entity_tags must be a string of the same length as sent"
    entities = []
    for entity in entity_dict.keys():
        if entity in sent:
            start = sent.index(entity)
            end = start + len(entity)
            if all(True for tag in entity_tags[start:end] if tag == "o"):
                entity_tags = entity_tags[:start] + len(entity) * tag + entity_tags[end:]
                entities.append(entity)
    return entities, entity_tags

sent1 = "點樣可以去士瓜灣"
sent2 = "點樣可以去尖嘴"
sent3 = "點樣可以去同鑼灣"
sent4 = "同鑼灣點樣去"
sent5 = "完朗有無得打邊爐"
sent6 = "點樣可以由火碳去油唐"
sent7 = "天水偉邊度有壽司食"

sent8 = "同鑼灣或者全灣或者筲機灣或者牛投角或者中環有咩食" 

for sent in [sent1, sent2,sent3, sent4, sent5, sent6,sent7,sent8]:
    print ("original: " + sent)
    corrected_sent = correction(sent,loc2jp)
    print ("corrected: " + corrected_sent)
    entities, entity_tags = entities_extractor(corrected_sent, loc2jp, "L")
    print ("entities", "entity_tag:")
    print (entities, entity_tags)


original: 點樣可以去士瓜灣
corrected: 點樣可以去土瓜灣
entities entity_tag:
['土瓜灣'] oooooLLL
original: 點樣可以去尖嘴
corrected: 點樣可以尖沙咀
entities entity_tag:
['尖沙咀'] ooooLLL
original: 點樣可以去同鑼灣
corrected: 點樣可以去銅鑼灣
entities entity_tag:
['銅鑼灣'] oooooLLL
original: 同鑼灣點樣去
corrected: 銅鑼灣點樣去
entities entity_tag:
['銅鑼灣'] LLLooo
original: 完朗有無得打邊爐
corrected: 元朗有無得打邊爐
entities entity_tag:
['元朗'] LLoooooo
original: 點樣可以由火碳去油唐
corrected: 點樣可以由火炭去油塘
entities entity_tag:
['油塘', '火炭'] oooooLLoLL
original: 天水偉邊度有壽司食
corrected: 天水圍邊度有壽司食
entities entity_tag:
['天水圍'] LLLoooooo
original: 同鑼灣或者全灣或者筲機灣或者牛投角或者中環有咩食
corrected: 銅鑼灣或者荃灣或者筲箕灣或者牛頭角或者中環有咩食
entities entity_tag:
['牛頭角', '中環', '銅鑼灣', '筲箕灣', '荃灣'] LLLooLLooLLLooLLLooLLooo


it can been seen that the above code can extract multiple entities from a text despite the entity is mis-written and each character corresponding to the entity in the text is tagged with "L" to indicate that it's a location entity.

Note that a strong assumption is made that mis-written word are solely due to mis-used choice of characters as opposed to missing character. Specifically, if a word is missing a character, the correction function breaks down (see 點樣可以去尖嘴 above). 

The correction can get complicated if it needs to take into account the possibility of word with missing characters (althought it won't affect entity extraction result). In reality, most mis-written words are due to mis-used choice of character as opposed to missing character. Therefore, I decided to only deal with the prior issue for now. Taking this location entity extraction task as an example, 尖咀 strictly speaking isn't a miswritten word but more of a slang, so it might be more appropriate to add it into the list of entity.

In summary, a phonetic matching-based approach to handle entity extraction in Cantonese text with mis-written words is shown. This method circumvent the troublesome problem manifested from incorrect tokenisation that exist in Chinese text when characters are wrongly used. Judging by observation, the method has a high precision rate as with any other rule-based NLU method. 

A potential problem with this method is scalability. For example, in reality people might want to ask for restaurant choices near a very specific location (e.g. providing a street name), a location that doesn't exist in our list of entity. 

An alternative approach would be to train a supervised machine learning model to tag entities. This however is not straightforward as written Cantonese text is not easy to find. With the available ~50,000 Cantonese Wikipedia articles and coupled with comments that can be found in Cantonese forum, it might be possible to extract adequate amount of Cantonese sentences containing locations. These could thus be used as labelled data to train a ML-model specifically for location entity tagging.
