Skip to content
This repository has been archived by the owner on Jul 14, 2022. It is now read-only.

Chinese romanization in Hong Kong and Macau area #35

Open
c933103 opened this issue Nov 22, 2019 · 15 comments
Open

Chinese romanization in Hong Kong and Macau area #35

c933103 opened this issue Nov 22, 2019 · 15 comments

Comments

@c933103
Copy link

c933103 commented Nov 22, 2019

Currently, the transliteration process in this project defaulted to use Mandarin to transliterate all Chinese texts. However, In Hong Kong and Macau, Cantonese is the common language being used by people to read Chinese instead of Mandarin and place name romanization in both places usually use Cantonese to romanize the names instead of using Mandarin, and thus please try to transliterate Chinese name in both places using Cantonese instead of Mandarin Chinese.

@giggls
Copy link
Owner

giggls commented Nov 22, 2019

Shame on me, that I was not aware of this issue.
I am already doing something very similar in Japan using libkakasi.
Do you know of a free transliteration library for Cantonese or is it possible to configure libicu in a suitable way?

Currently I use any-latin from libicu by default.

@giggls
Copy link
Owner

giggls commented Nov 22, 2019

I just had a look at the lat/long to country mapping code and just recognized, that I currently have no option to distinguish between Mandarin and Cantonese speaking areas. Currently all I can get is a country code which is cn in Hong Kong as well as mainland China. We need to check with https://github.com/openstreetmap/Nominatim/tree/master/data-sources/country-grid where this mapping table originates.

@giggls
Copy link
Owner

giggls commented Nov 27, 2019

OK, fixed the second issue, thus all I need to get this up and running would be a Cantonese transcription library. Any idea?

@chatelao
Copy link
Contributor

I'm not at all a Chinese expert at all, but maybe the last answer (search "Cantonese") may help:

https://chinese.stackexchange.com/questions/21035/api-for-transliterating-a-traditional-character-writing-the-pinyin

@giggls
Copy link
Owner

giggls commented Nov 27, 2019

Hm probably this one does what we need:
https://github.com/lucwastiaux/python-pinyin-jyutping-sentence

@c933103
Copy link
Author

c933103 commented Nov 30, 2019

Sorry for late response, 1. Hong Kong have a country code of HK and Macau have a country code of MO. 2. Yes, that jyupting transliteration tool would work. Note that this is a bit different from the most commonly used place name romanization system in Hong Kong/Macau, but since that most commonly used romanization system is not a fully established system and have quite a bit arbitrary ambiguity that I don't think there is a full one to one transliteration tool available on the internet, I guess this is close enough. On the other hand, I notice the linked transliteration tool used its own custom method to represent Cantonese tones, which isn't widely used, and place name romanization in Hong Kong and Macau usually don't show the tones anyway, show I would recommend normalizing the output and remove all the diacritics.

@chatelao
Copy link
Contributor

What would be normalizing output? Just remove the diacrites and use plain ASCII?

@c933103
Copy link
Author

c933103 commented Nov 30, 2019 via email

@chatelao
Copy link
Contributor

So "ngǒ déi dongsāt zó" goes "ngo dei dongsat zo" without expansion to double letters or similar?

@giggls
Copy link
Owner

giggls commented Nov 30, 2019

In the meantime I got the impression, that my current approach does not work very well.
Re-Initialisation of transcription methods are too expensive to do them again and again.

Unfortunately the initialisation sequence of the python-pinyin-jyutping-sentence library seems to be even slower than the Thai transcription library which is already too slow.

I propose a daemon written in python which will do the actual transcription of a string in the requested source language. An advantage of this approach would be that the slow constructors of transcription instances will not matter any more.

I will hopefully have a bit of time coding this in the time around the upcoming holidays.

@c933103
Copy link
Author

c933103 commented Nov 30, 2019

@chatelao correct.

@chatelao
Copy link
Contributor

chatelao commented Nov 30, 2019

@giggls Maybe this strange GD/SD approches could be good enough?

CREATE FUNCTION test() RETURNS text
LANGUAGE plpythonu
AS $$
if 'json' in SD:
    json = SD['json']
else:
    import json
    SD['json'] = json

 return json.dumps(...)
$$;

So maybe for "tltk" something like this helps?:

if 'tltk' in SD:
    tltk = SD['tltk']
else:
    import tltk
    SD['tltk'] = tltk

@giggls
Copy link
Owner

giggls commented Nov 30, 2019

Well unfortunately I guess you missed my point at least in some regard.

While the import system call might be slow and your solution might mitigate the problem this is in fact only one part of the problem.

The Problem is that instantiation of classes (also true for libicu tranlisteration) is done on every single transliteration call as the instance gets destroyed after the psql function call is finished.

Instead an already created instance should be re-used instead.

Looks like I need to do some performance tests to find out how expensive these calls really are.

@chatelao
Copy link
Contributor

chatelao commented Dec 1, 2019

Maybe classes can be stored & reused to the "SD" space? This may depend on the multi-thread capabilities of the libraries.

@giggls
Copy link
Owner

giggls commented Dec 1, 2019

Unfortunately this approach does not seem to change anything regarding the speed of loading
pinyin_jyutping_sentence at all.
On my Desktop this currently takes 20 seconds on the first call in a PostgresSQL session and is fast on subsequent calls regardless if I use this GD/SD stuff or not.

I do not consider 20 seconds acceptable.

CREATE or REPLACE FUNCTION pyfunc() RETURNS float AS $$
  import time
   
  start = time.time()
  import pinyin_jyutping_sentence
  end = time.time()
  return(end - start)
$$ LANGUAGE plpython3u STABLE;

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants