Regex engine: allow unicode character ranges #332

maxerickson · 2016-10-24T15:09:43Z

On IRC someone wondered about finding objects with mixed language names and I thought the very simple approach might work:

[out:json][timeout:25];
way["highway"]["name"~"[a-z]",i]["name"~"[\\ue400-\\u9fff]"]({{bbox}});
out center;

The answer is:

Error: line 2: static error: Invalid regular expression: "[-鿿]"

So I guess the regex implementation doesn't support such character ranges?

The text was updated successfully, but these errors were encountered:

maxerickson · 2016-10-24T20:36:35Z

A reasonable workaround is to (roughly) invert the ascii character class

[out:json][timeout:25];
way["highway"]["name"~"[a-z]",i]["name"~"[^a-z0-9 \.-]",i]({{bbox}});
out center;

So I guess if the issue is in the library, doing nothing is a very reasonable resolution.

drolbr · 2016-10-27T16:04:23Z

The service does indeed delegate the regular expression filtering to the POSIX library. It uses a UTF-8 locale if available. Should we close the issue or mark it as invalid or do something else?

drolbr · 2016-10-31T06:26:48Z

It might be worth to have our own regex parser. This would allow to correct behaviour like the above. It would also solve some potential security issues. Hence, I keep this one with this direction as "enhancement".

mmd-osm · 2017-01-02T08:43:54Z

I don't see much reason for a home grown solution, which has to implement http://www.unicode.org/reports/tr18/ in the end. There are already widely adopted solutions available for this particular requirement:

ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications. ICU is widely portable and gives applications the same results on all platforms and between C/C++ and Java software.

Regular Expression: ICU's regular expressions fully support Unicode while providing very competitive performance.

ICU Regular Expressions conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1, and in addition include Default Word boundaries and Name Properties from level 2.

See:

mmd-osm · 2017-01-03T12:38:28Z

On IRC someone wondered about finding objects with mixed language names

It's demo time! http://overpass-turbo.eu/s/kZx

Commit: in feature/regexp_icu branch: master...mmd-osm:feature/regexp_icu

mmd-osm · 2017-03-07T14:51:02Z

BTW: Another interesting use case which is currently not supported: find nodes with addr:* nodes only:

[bbox:{{bbox}}][regexp:ICU];

(
   node[~"^addr:.*"~"."]; 
   - 
   node[~"^(?!addr:).*$"~"."];   
);  
out meta;

westnordost · 2018-01-09T13:58:46Z

A use case for proper Unicode support would be to easily find those streets where the name is not written in the country's script, i.e. in Myanmar, Thailand, Laos, China, Japan, Korea etc. etc. which would qualify at least as a warning in QA tools.

SomeoneElseOSM · 2018-01-09T14:48:24Z

For info re Myanmar there have been conflicts in OSM re which coding system to use (see https://en.wikipedia.org/wiki/Burmese_Wikipedia#Challenges in wikipedia for some info). This was about 14 months ago. I can't say what the current state of the OSM data is there; just mentioning it to provide a bit of background.

westnordost · 2018-01-09T18:37:18Z

It's still an issue. I asked the Yangonese OpenStreetMap community about it a month ago. If I understood them correctly, Burmese smartphone vendors even modify Android in a way that it is compatible to this non-Unicode Zawgyi font, apparently making it incompatible to Unicode then.

I am still hoping that the issue will solve itself over time as they slowly migrate to Unicode. The alternative would be to add into all the editors for OSM an automatic conversion from Zawgyi to Unicode (and to be honest, I don't think it can be detected automatically if the input is Zawgyi).

maxerickson · 2018-01-09T19:45:55Z

Another use case here, analyzing unlikely name:en tags coming from Maps.me users: mapsme/omim#7262

drolbr changed the title ~~Unicode character ranges.~~ Regex engine: allow unicode character ranges Oct 31, 2016

drolbr added the enhancement label Oct 31, 2016

mmd-osm mentioned this issue Jul 22, 2017

Could you support PCRE regular expression ? #146

Open

tyrasd mentioned this issue Apr 23, 2018

Unicode support in regualar expression is broken? tyrasd/overpass-turbo#373

Closed

mmd-osm mentioned this issue Apr 15, 2023

Query for unicode range \u036E-\u036F returns non-matching results #688

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex engine: allow unicode character ranges #332

Regex engine: allow unicode character ranges #332

maxerickson commented Oct 24, 2016

maxerickson commented Oct 24, 2016

drolbr commented Oct 27, 2016

drolbr commented Oct 31, 2016

mmd-osm commented Jan 2, 2017 •

edited

Loading

mmd-osm commented Jan 3, 2017 •

edited

Loading

mmd-osm commented Mar 7, 2017 •

edited

Loading

westnordost commented Jan 9, 2018

SomeoneElseOSM commented Jan 9, 2018

westnordost commented Jan 9, 2018 •

edited

Loading

maxerickson commented Jan 9, 2018

Regex engine: allow unicode character ranges #332

Regex engine: allow unicode character ranges #332

Comments

maxerickson commented Oct 24, 2016

maxerickson commented Oct 24, 2016

drolbr commented Oct 27, 2016

drolbr commented Oct 31, 2016

mmd-osm commented Jan 2, 2017 • edited Loading

mmd-osm commented Jan 3, 2017 • edited Loading

mmd-osm commented Mar 7, 2017 • edited Loading

westnordost commented Jan 9, 2018

SomeoneElseOSM commented Jan 9, 2018

westnordost commented Jan 9, 2018 • edited Loading

maxerickson commented Jan 9, 2018

mmd-osm commented Jan 2, 2017 •

edited

Loading

mmd-osm commented Jan 3, 2017 •

edited

Loading

mmd-osm commented Mar 7, 2017 •

edited

Loading

westnordost commented Jan 9, 2018 •

edited

Loading