Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XSD regular expression flavor #255

Closed
atomczak opened this issue Mar 6, 2024 · 5 comments
Closed

XSD regular expression flavor #255

atomczak opened this issue Mar 6, 2024 · 5 comments
Labels
audit tool documentation Improvements or additions to documentation
Milestone

Comments

@atomczak
Copy link
Contributor

atomczak commented Mar 6, 2024

<xs:pattern value="^(0*(\.\d+)|[1-9]\d*(\.\d+)?)$"/>

Found it in sample files and this doesn't look like a correct regular expression in the XSD pattern flavor. By default, all XSD patterns look at the whole phrase, so ^...$ are not needed (or even supported).

I'm not sure about the shorthand \d. I think it is supported by XSD and matches all Unicode digits: 0-9¹¾六௰Ⅹ೬Дに... but it would be good if someone could confirm.

@atomczak atomczak added documentation Improvements or additions to documentation Question labels Mar 6, 2024
@CBenghi
Copy link
Contributor

CBenghi commented Mar 6, 2024

Off memory I think I've probably removed that regex in the current Development branch, because it was conflicting with the datatype, anyway:

IDS/Development/IDS_oma.ids

Lines 149 to 161 in 6d71cdf

<ids:property cardinality="optional" dataType="IFCLENGTHMEASURE" instructions="Derived length, for example length of the corridor.">
<ids:propertySet>
<ids:simpleValue>Qto_SpaceBaseQuantities</ids:simpleValue>
</ids:propertySet>
<ids:baseName>
<ids:simpleValue>NominalLength</ids:simpleValue>
</ids:baseName>
<ids:value>
<xs:restriction base="xs:string">
<xs:pattern value="^(0*(\.\d+)|[1-9]\d*(\.\d+)?)$"/>
</xs:restriction>
</ids:value>
</ids:property>

My view is that IFCLENGTHMEASURE requires xs:double in the base type, which in turn disallows the pattern node.

Your point is of course still valid with respect to the need of documentation on regex flavour. My hope is to enforce it appropriately via the audit tool.

@CBenghi CBenghi added this to the 1.0 milestone Mar 6, 2024
@janbrouwer
Copy link
Contributor

I think I made that regex, an experiment to see if it is possible to validate a positivelengthmeasure, I believe the regex validation site mentioned in the IDS docs thought it ok, but they're are probably better ways to do this

@gverduci
Copy link

gverduci commented Mar 7, 2024

I'm not sure about the shorthand \d. I think it is supported by XSD and matches all Unicode digits: 0-9¹¾六௰Ⅹ೬Дに... but it would be good if someone could confirm.

@atomczak I think the shorthand \d is valid: this link shows all supported multi-character escapes:

https://www.w3.org/TR/2012/REC-xmlschema11-2-20120405/datatypes.html#cces-mce

and matches only \p{Nd} (Number of decimal digits - General category properties https://www.unicode.org/reports/tr18/#General_Category_Property).

Using the unicode database it is possible to find all characters in this set:

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

@aothms
Copy link

aothms commented Mar 7, 2024

Great suggestions @gverduci, this indeed confirms @atomczak's suspicion:

$ grep ';Nd;' UnicodeData.txt | cut -d\; -f1 | xargs -I{} printf \\U000{} 2> /dev/null
𐒠𐒡𐒢𐒣𐒤𐒥𐒦𐒧𐒨𐒩𐴰𐴱𐴲𐴳𐴴𐴵𐴶𐴷𐴸𐴹𑁦𑁧𑁨𑁩𑁪𑁫𑁬𑁭𑁮𑁯𑃰𑃱𑃲𑃳𑃴𑃵𑃶𑃷𑃸𑃹𑄶𑄷𑄸𑄹𑄺𑄻𑄼𑄽𑄾𑄿𑇐𑇑𑇒𑇓𑇔𑇕𑇖𑇗𑇘𑇙𑋰𑋱𑋲𑋳𑋴𑋵𑋶𑋷𑋸𑋹...

(these are just a couple of them, I couldn't quickly figure out how to generically get the hex formatted code points to printable characters)

@atomczak
Copy link
Contributor Author

atomczak commented Mar 7, 2024

Thanks all, I mainly wanted to be sure if I'm not mistaken. And yes, this example is already removed from latest Dev branch.

My hope is to enforce it appropriately via the audit tool.

I see a potential problem with auditing regex - ^ABC$ is not an invalid pattern. But it is checking for literal strings starting with caret and ending with dollar, and the user probably only wanted to allow 'ABC' value. So not an error but a soft warning :)

Using the unicode database it is possible to find all characters in this set

Thanks! If I read this right, \d in XSD represents 100 allowed digits. While this is fine for most cases, for my purpose [0-9] serves better, as I only want those 10.

@berlotti berlotti closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit tool documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants