Chinese romanization in Hong Kong and Macau area #35

c933103 · 2019-11-22T15:14:28Z

Currently, the transliteration process in this project defaulted to use Mandarin to transliterate all Chinese texts. However, In Hong Kong and Macau, Cantonese is the common language being used by people to read Chinese instead of Mandarin and place name romanization in both places usually use Cantonese to romanize the names instead of using Mandarin, and thus please try to transliterate Chinese name in both places using Cantonese instead of Mandarin Chinese.

giggls · 2019-11-22T16:25:19Z

Shame on me, that I was not aware of this issue.
I am already doing something very similar in Japan using libkakasi.
Do you know of a free transliteration library for Cantonese or is it possible to configure libicu in a suitable way?

Currently I use any-latin from libicu by default.

giggls · 2019-11-22T16:42:30Z

I just had a look at the lat/long to country mapping code and just recognized, that I currently have no option to distinguish between Mandarin and Cantonese speaking areas. Currently all I can get is a country code which is cn in Hong Kong as well as mainland China. We need to check with https://github.com/openstreetmap/Nominatim/tree/master/data-sources/country-grid where this mapping table originates.

giggls · 2019-11-27T12:26:50Z

OK, fixed the second issue, thus all I need to get this up and running would be a Cantonese transcription library. Any idea?

chatelao · 2019-11-27T19:18:54Z

I'm not at all a Chinese expert at all, but maybe the last answer (search "Cantonese") may help:

https://chinese.stackexchange.com/questions/21035/api-for-transliterating-a-traditional-character-writing-the-pinyin

giggls · 2019-11-27T19:25:28Z

Hm probably this one does what we need:
https://github.com/lucwastiaux/python-pinyin-jyutping-sentence

c933103 · 2019-11-30T12:47:11Z

Sorry for late response, 1. Hong Kong have a country code of HK and Macau have a country code of MO. 2. Yes, that jyupting transliteration tool would work. Note that this is a bit different from the most commonly used place name romanization system in Hong Kong/Macau, but since that most commonly used romanization system is not a fully established system and have quite a bit arbitrary ambiguity that I don't think there is a full one to one transliteration tool available on the internet, I guess this is close enough. On the other hand, I notice the linked transliteration tool used its own custom method to represent Cantonese tones, which isn't widely used, and place name romanization in Hong Kong and Macau usually don't show the tones anyway, show I would recommend normalizing the output and remove all the diacritics.

chatelao · 2019-11-30T15:27:20Z

What would be normalizing output? Just remove the diacrites and use plain ASCII?

c933103 · 2019-11-30T15:37:36Z

Yes

chatelao · 2019-11-30T16:01:52Z

So "ngǒ déi dongsāt zó" goes "ngo dei dongsat zo" without expansion to double letters or similar?

giggls · 2019-11-30T17:03:35Z

In the meantime I got the impression, that my current approach does not work very well.
Re-Initialisation of transcription methods are too expensive to do them again and again.

Unfortunately the initialisation sequence of the python-pinyin-jyutping-sentence library seems to be even slower than the Thai transcription library which is already too slow.

I propose a daemon written in python which will do the actual transcription of a string in the requested source language. An advantage of this approach would be that the slow constructors of transcription instances will not matter any more.

I will hopefully have a bit of time coding this in the time around the upcoming holidays.

c933103 · 2019-11-30T19:39:14Z

@chatelao correct.

chatelao · 2019-11-30T21:05:32Z

@giggls Maybe this strange GD/SD approches could be good enough?

https://stackoverflow.com/questions/15023080/how-are-import-statements-in-plpython-handled

CREATE FUNCTION test() RETURNS text
LANGUAGE plpythonu
AS $$
if 'json' in SD:
    json = SD['json']
else:
    import json
    SD['json'] = json

 return json.dumps(...)
$$;

So maybe for "tltk" something like this helps?:

if 'tltk' in SD:
    tltk = SD['tltk']
else:
    import tltk
    SD['tltk'] = tltk

giggls · 2019-11-30T23:47:48Z

Well unfortunately I guess you missed my point at least in some regard.

While the import system call might be slow and your solution might mitigate the problem this is in fact only one part of the problem.

The Problem is that instantiation of classes (also true for libicu tranlisteration) is done on every single transliteration call as the instance gets destroyed after the psql function call is finished.

Instead an already created instance should be re-used instead.

Looks like I need to do some performance tests to find out how expensive these calls really are.

chatelao · 2019-12-01T14:51:45Z

Maybe classes can be stored & reused to the "SD" space? This may depend on the multi-thread capabilities of the libraries.

giggls · 2019-12-01T14:56:32Z

Unfortunately this approach does not seem to change anything regarding the speed of loading
pinyin_jyutping_sentence at all.
On my Desktop this currently takes 20 seconds on the first call in a PostgresSQL session and is fast on subsequent calls regardless if I use this GD/SD stuff or not.

I do not consider 20 seconds acceptable.

CREATE or REPLACE FUNCTION pyfunc() RETURNS float AS $$
  import time
   
  start = time.time()
  import pinyin_jyutping_sentence
  end = time.time()
  return(end - start)
$$ LANGUAGE plpython3u STABLE;

giggls mentioned this issue Feb 6, 2020

osml10n_get_streetname_from_tags seems to be slower since v2.5.7 #40

Closed

This was referenced May 21, 2020

Optimize SQL functions #43

Merged

Some thoughts about the future of this project #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese romanization in Hong Kong and Macau area #35

Chinese romanization in Hong Kong and Macau area #35

c933103 commented Nov 22, 2019

giggls commented Nov 22, 2019

giggls commented Nov 22, 2019 •

edited

Loading

giggls commented Nov 27, 2019

chatelao commented Nov 27, 2019

giggls commented Nov 27, 2019

c933103 commented Nov 30, 2019

chatelao commented Nov 30, 2019

c933103 commented Nov 30, 2019 via email

chatelao commented Nov 30, 2019

giggls commented Nov 30, 2019 •

edited

Loading

c933103 commented Nov 30, 2019 •

edited

Loading

chatelao commented Nov 30, 2019 •

edited

Loading

giggls commented Nov 30, 2019

chatelao commented Dec 1, 2019

giggls commented Dec 1, 2019

Chinese romanization in Hong Kong and Macau area #35

Chinese romanization in Hong Kong and Macau area #35

Comments

c933103 commented Nov 22, 2019

giggls commented Nov 22, 2019

giggls commented Nov 22, 2019 • edited Loading

giggls commented Nov 27, 2019

chatelao commented Nov 27, 2019

giggls commented Nov 27, 2019

c933103 commented Nov 30, 2019

chatelao commented Nov 30, 2019

c933103 commented Nov 30, 2019 via email

chatelao commented Nov 30, 2019

giggls commented Nov 30, 2019 • edited Loading

c933103 commented Nov 30, 2019 • edited Loading

chatelao commented Nov 30, 2019 • edited Loading

giggls commented Nov 30, 2019

chatelao commented Dec 1, 2019

giggls commented Dec 1, 2019

giggls commented Nov 22, 2019 •

edited

Loading

giggls commented Nov 30, 2019 •

edited

Loading

c933103 commented Nov 30, 2019 •

edited

Loading

chatelao commented Nov 30, 2019 •

edited

Loading