-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix XSD regex for fqcn and tablename patterns #7324
Conversation
|
doctrine-mapping.xsd
Outdated
@@ -424,7 +424,7 @@ | |||
|
|||
<xs:simpleType name="fqcn" id="fqcn"> | |||
<xs:restriction base="xs:token"> | |||
<xs:pattern value="[a-zA-Z_u01-uff][a-zA-Z0-9_u01-uff]+" id="fqcn.pattern"> | |||
<xs:pattern value="([a-zA-Z_-][a-zA-Z0-9_-]*\\)*[a-zA-Z_-][a-zA-Z0-9_-]*" id="fqcn.pattern"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class names must follow ^[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]+$
(https://github.com/Ocramius/ProxyManager/blob/99542a2c0fef18e0b460cbed93c5cde63cc2be91/tests/ProxyManagerTest/Generator/Util/UniqueIdentifierGeneratorTest.php#L42)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, they must follow ^[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*$
(with a *
, not a +
, on the character class for after the first char) according to http://php.net/manual/en/language.oop5.basic.php and verified at https://3v4l.org/klVjj.
=> Do you want to consider single-char names as invalid?
Second, that is a mono-byte regex, as explained at https://stackoverflow.com/a/5358944 (modulo the "accidentally") and verified at https://3v4l.org/XqJGf which shows that for UTF-8 encoded strings, the equivalent of
/^[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*$/
is
/^[a-zA-Z_\x7f-\x{10ffff}][a-zA-Z0-9_\x7f-\x{10ffff}]*$/u
.
Now, XML Regex is Unicode mode (and all the files are in UTF-8 encoding), so I translated \x7f-\x{10ffff}
to -
, because -ÿ
(translated from \x7f-\xff
) only matches Unicode characters in the range U+7F..U+FF, and I have indeed tested that it would then consider non-Latin1 characters as invalid (e.g. "The value 'Слон' is not accepted by the pattern ...").
=> Do you want to consider non-Latin1 (or even non-ASCII) names as invalid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, indeed about *
(probably lucky that nobody hit the pattern before - the library always generates 2+ chars for that particular use-case).
As for the multi-byte, I'm inclined to say that supporting multibyte strings for class names is a no-go for a lot of stuff, because most of the get_class()
-based operations (not just in ORM, but in other libraries) simply don't operate under those circumstances, and you will get someone doing a nice unsafe substr()
somewhere.
I'd say that we should keep those out of support: that would be an opinionated stance (by me), not the "full support for everything".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the -
part should become just 
(because -ÿ
would still accept e.g. the character é
= é
which is 2-bytes \xC3\xA9
in UTF-8), or even be removed entirely (do you accept the DEL
char)?
And I should remove the test file pattern-fqcn-valid-unicode.xml
, or alternatively rename it to pattern-fqcn-invalid-multibyte.xml
and keep only the "Éléphant"
case?
Edit: And then what about the tablename
pattern, for quoted and for unquoted identifiers?
* @dataProvider dataInvalidSchema | ||
* @group 6389 | ||
*/ | ||
public function testInvalidateXmlSchema($xmlMappingFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: void
Add type declarations too (can remove them in docblock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for the new functions, or also for testValidateXmlSchema
and dataValidSchema
?
*/ | ||
public function testInvalidateXmlSchema($xmlMappingFile) | ||
{ | ||
// Avoid PHP Warning errors for expected invalid mappings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What errors did come up here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking because this part should probably be moved into the XML driver itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ ./vendor/bin/phpunit --group 6389
PHPUnit 6.5.9 by Sebastian Bergmann and contributors.
Runtime: PHP 7.1.20-1+0~20180725103315.2+stretch~1.gbpd5b650
Configuration: /z/phpunit.xml.dist
...........................................E 44 / 44 (100%)
Time: 314 ms, Memory: 54.00MB
There was 1 error:
1) Doctrine\Tests\ORM\Mapping\XmlMappingDriverTest::testInvalidateXmlSchema with data set #0 ('/z/tests/Doctrine/Tests/ORM/M...cm.xml')
DOMDocument::schemaValidate(): Element '{http://doctrine-project.org/schemas/orm/doctrine-mapping}class': This element is not expected. Expected is one of ( {http://doctrine-project.org/schemas/orm/doctrine-mapping}mapped-superclass, {http://doctrine-project.org/schemas/orm/doctrine-mapping}entity, {http://doctrine-project.org/schemas/orm/doctrine-mapping}embeddable, ##other{http://doctrine-project.org/schemas/orm/doctrine-mapping}* ).
/z/tests/Doctrine/Tests/ORM/Mapping/XmlMappingDriverTest.php:189
/z/tests/Doctrine/Tests/ORM/Mapping/XmlMappingDriverTest.php:170
ERRORS!
Tests: 44, Assertions: 43, Errors: 1.
Alternatively it works with @expectedException \PHPUnit\Framework\Error\Warning
(or do you prefer $this->expectException(\PHPUnit\Framework\Error\Warning::class);
?) and reducing the body to the simple call
$this->doValidateXmlSchema($xmlMappingFile);
but that's depending both on DOMDocument::schemaValidate()
emitting an E_WARNING
and on PHPUnit translating the Warning to an exception...
This is only for testing XSD validation (actually testing that the schema definition is correct), how would you move it into the XML driver (without merging #6728)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greg0ire ping ^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guilliamxavier You could reuse the libxml_use_internal_errors
yadda yadda trick from https://github.com/doctrine/doctrine2/pull/6728/files#diff-f96b6283e2653400c7e5fbb845f50748R867 here.
@Ocramius not 100% sure why you ping me, if it is to know the status of #6728, it is blocked by #7199, which is blocked until further notice
} | ||
} | ||
|
||
private function doValidateXmlSchema($xmlMappingFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type declarations
return self::doDataValidSchema(true); | ||
} | ||
|
||
static public function dataInvalidSchema() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sort of API is quite confusing: what's going on? Why is doDataValidSchema()
used also for invalid data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be enough to rename it, or do you prefer that I factorize the common parts differently (like the test functions)?
BTW, would it be a BC-break to rename an existing public test function?
E.g. testValidateXmlSchema
to, say, testValidateXmlSchemaWithValidMapping
(so that testInvalidateXmlSchema
could be testValidateXmlSchemaWithInvalidMapping
)? (maybe you have better ideas?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having completely separate methods, even if some code is duplicated, is preferable here
doctrine-mapping.xsd
Outdated
@@ -202,7 +202,7 @@ | |||
|
|||
<xs:simpleType name="tablename" id="tablename"> | |||
<xs:restriction base="xs:token"> | |||
<xs:pattern value="[a-zA-Z_u01-uff.]+" id="tablename.pattern"> | |||
<xs:pattern value="([^`.][^.]*\.)?([a-zA-Z0-9_$#€-]+|`.+`)" id="tablename.pattern"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the proposed part for an unquoted table name, [a-zA-Z0-9_$#€-]+
, is (sort of) the union of formats allowed by different vendors (mainly MySQL, PostgreSQL, Oracle).
Would you on the contrary want the intersection, e.g. [a-zA-Z][a-zA-Z0-9_]*
(very restrictive)?
Do you also want that multibyte characters be considered as invalid (like for fqcn
), even in quoted table names?
And maybe also limit the maximum length? (to e.g. 128 or even 30 chars?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we perform binary-unsafe operations on table names in the codebase (specifically: naming strategy, quoting), so being restrictive is OK.
Maximum length is to be avoided, because it is highly platform-dependent, and I've already seen DBs with horrendous generated table names in OracleDB (ascii-only): those would be surely over 64 chars. Unsure if any of the supported platforms goes over 128 chars though.
@mikeSimonson can you take over here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my point of view.
We should remove the mutlibyte char entirely I think.
I am not sure a maximum length is necessary as it's probably super vendor/engine dependent.
Rename the test with valid unicode to invalid unicode.
Do we keep the "é" as valid is an open question.
You can add the type delcaration in your function in one commit and if you fell like it you can add them everywhere else in another commit or preferably PR.
#6728 should be finished and merged so that you don't need that logic in the tests.
Thanks for all the tests
@Ocramius Thank you very much for your review and for your well-explained replies. @mikeSimonson Thank you for coming in. About two concerns:
Logically, to "remove the multibyte char entirely" implies that "é" be invalid (in UTF-8). Besides, the current patterns (which, although "semantically wrong", at least accept practically every valid "real-world" value) do not allow any non-ASCII chars. So I think actually the most conservative option is to stay ASCII-only (it will still be possible to add more characters later if wanted).
The referenced PR is targeting Now, I think that the schema definition is mainly for IDE code-completion, and that validation of XML mappings against the XSD is not enforced (yet), so maybe this could wait until 2.7 [or 3.0] (although I would prefer not)... |
doctrine-mapping.xsd
Outdated
@@ -202,7 +202,7 @@ | |||
|
|||
<xs:simpleType name="tablename" id="tablename"> | |||
<xs:restriction base="xs:token"> | |||
<xs:pattern value="[a-zA-Z_u01-uff.]+" id="tablename.pattern"> | |||
<xs:pattern value="([ --[\.]]+\.)?([a-zA-Z][a-zA-Z0-9_]*|`[ -]+`)" id="tablename.pattern"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the backslash before the dot in the subtracted character class is "redundant", but without it PhpStorm flags a syntax error "Unclosed character class"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, wouldn't "([a-zA-Z][a-zA-Z0-9_]*\.)?[a-zA-Z][a-zA-Z0-9_]*|([ --[`\.]][ --[\.]]*\.)?`[ -]+`"
be better? (the schema part cannot be backquoted, and will be SQL-quoted only if the table part is backquoted)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened a separate #7329 about the peculiar schema part quoting behavior. But for now this PR should be fine as it is.
@Ocramius & @mikeSimonson I have reworked the commits according to your reviews and updated the PR, is it OK now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels over-engineered at first, but valid. Would use more verbose variable names next time.
Sorry, only now do I understand that this is blocked by #6728 because merging both PRs would lead to logic duplication in the XML driver and in the test class ( I also agree that this feels over-engineered (and as for the too-terse variable names, I guess I let myself influenced by But since #6728 is itself blocked by #7199 which in turn is "blocked until further notice", and this PR maybe mixes too many things, I am going to extract independent parts into separate (and hopefully immediately-applicable) PRs 🙂 |
@@ -153,7 +153,7 @@ public function testInvalidMappingFileException() | |||
public function testValidateXmlSchema($xmlMappingFile) | |||
{ | |||
$xsdSchemaFile = __DIR__ . '/../../../../../doctrine-mapping.xsd'; | |||
$dom = new \DOMDocument('UTF-8'); | |||
$dom = new \DOMDocument('1.0', 'UTF-8'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually better here: just new \DOMDocument();
(done in #7345)
], | ||
]; | ||
|
||
// Convert basic sprintf-style formats to PCRE patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe better: use assertStringMatchesFormat()
instead of assertRegeExp()
at line 180 (and e.g. %S
, %a
or %A
rather than %s
in the format)
Follow-up to #6389's discussion. Mainly: "
[u01-uff]
" is just wrong (in any regex flavor, BTW).Recap of the currently incorrect cases (and inconsistencies):
tablename
:`foo-bar`
[does not allow the hyphen even if quoted] (while accepting`foo;bar`
)foo;bar
[allows the semicolon although unquoted] (while rejectingfoo-bar
)fqcn
:C
[requires at least 2 chars]Foo;Bar
[allows the semicolon] (while rejectingFoo-Bar
)3lephant
[allows digits for the first char]Note: the pattern for
fqcn
is rather straightforward, but not so fortablename
:myschema.mytable
", wheremytable
may be quoted whereasmyschema
may not but will become ifmytable
is (here -> here, then here -> there)...Also, this adds many small XML test files (mostly because invalid cases have to be tested separately [and I split the valid cases a bit], but not all are strictly necessary)...
(But still, I think that at least those "
u01-uff
" should be fixed.)Found related: