Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex engine: allow unicode character ranges #332

Open
maxerickson opened this issue Oct 24, 2016 · 10 comments
Open

Regex engine: allow unicode character ranges #332

maxerickson opened this issue Oct 24, 2016 · 10 comments

Comments

@maxerickson
Copy link

On IRC someone wondered about finding objects with mixed language names and I thought the very simple approach might work:

[out:json][timeout:25];
way["highway"]["name"~"[a-z]",i]["name"~"[\\ue400-\\u9fff]"]({{bbox}});
out center;

The answer is:

Error: line 2: static error: Invalid regular expression: "[-鿿]"

So I guess the regex implementation doesn't support such character ranges?

@maxerickson
Copy link
Author

A reasonable workaround is to (roughly) invert the ascii character class

[out:json][timeout:25];
way["highway"]["name"~"[a-z]",i]["name"~"[^a-z0-9 \.-]",i]({{bbox}});
out center;

So I guess if the issue is in the library, doing nothing is a very reasonable resolution.

@drolbr
Copy link
Owner

drolbr commented Oct 27, 2016

The service does indeed delegate the regular expression filtering to the POSIX library. It uses a UTF-8 locale if available. Should we close the issue or mark it as invalid or do something else?

@drolbr drolbr changed the title Unicode character ranges. Regex engine: allow unicode character ranges Oct 31, 2016
@drolbr
Copy link
Owner

drolbr commented Oct 31, 2016

It might be worth to have our own regex parser. This would allow to correct behaviour like the above. It would also solve some potential security issues. Hence, I keep this one with this direction as "enhancement".

@mmd-osm
Copy link
Contributor

mmd-osm commented Jan 2, 2017

I don't see much reason for a home grown solution, which has to implement http://www.unicode.org/reports/tr18/ in the end. There are already widely adopted solutions available for this particular requirement:

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.

ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2.

See:

@mmd-osm
Copy link
Contributor

mmd-osm commented Jan 3, 2017

On IRC someone wondered about finding objects with mixed language names

It's demo time! http://overpass-turbo.eu/s/kZx

Commit: in feature/regexp_icu branch: master...mmd-osm:feature/regexp_icu

@mmd-osm
Copy link
Contributor

mmd-osm commented Mar 7, 2017

BTW: Another interesting use case which is currently not supported: find nodes with addr:* nodes only:

[bbox:{{bbox}}][regexp:ICU];

(
   node[~"^addr:.*"~"."]; 
   - 
   node[~"^(?!addr:).*$"~"."];   
);  
out meta;

@westnordost
Copy link

A use case for proper Unicode support would be to easily find those streets where the name is not written in the country's script, i.e. in Myanmar, Thailand, Laos, China, Japan, Korea etc. etc. which would qualify at least as a warning in QA tools.

@SomeoneElseOSM
Copy link

For info re Myanmar there have been conflicts in OSM re which coding system to use (see https://en.wikipedia.org/wiki/Burmese_Wikipedia#Challenges in wikipedia for some info). This was about 14 months ago. I can't say what the current state of the OSM data is there; just mentioning it to provide a bit of background.

@westnordost
Copy link

westnordost commented Jan 9, 2018

It's still an issue. I asked the Yangonese OpenStreetMap community about it a month ago. If I understood them correctly, Burmese smartphone vendors even modify Android in a way that it is compatible to this non-Unicode Zawgyi font, apparently making it incompatible to Unicode then.

I am still hoping that the issue will solve itself over time as they slowly migrate to Unicode. The alternative would be to add into all the editors for OSM an automatic conversion from Zawgyi to Unicode (and to be honest, I don't think it can be detected automatically if the input is Zawgyi).

@maxerickson
Copy link
Author

Another use case here, analyzing unlikely name:en tags coming from Maps.me users: mapsme/omim#7262

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants