Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix XSD regex for fqcn and tablename patterns #7324

Closed
wants to merge 9 commits into from
Closed

Fix XSD regex for fqcn and tablename patterns #7324

wants to merge 9 commits into from

Conversation

guilliamxavier
Copy link
Contributor

@guilliamxavier guilliamxavier commented Jul 26, 2018

Follow-up to #6389's discussion. Mainly: "[u01-uff]" is just wrong (in any regex flavor, BTW).

Recap of the currently incorrect cases (and inconsistencies):

  • tablename:
    • rejects `foo-bar` [does not allow the hyphen even if quoted] (while accepting `foo;bar`)
    • accepts foo;bar [allows the semicolon although unquoted] (while rejecting foo-bar)
  • fqcn:
    • rejects C [requires at least 2 chars]
    • accepts Foo;Bar [allows the semicolon] (while rejecting Foo-Bar)
    • accepts 3lephant [allows digits for the first char]

Note: the pattern for fqcn is rather straightforward, but not so for tablename:

  • different SQL database vendors allow different extra characters for [un]quoted identifiers...
  • it seems that Doctrine supports the form "myschema.mytable", where mytable may be quoted whereas myschema may not but will become if mytable is (here -> here, then here -> there)...
  • cannot be merged into master as-is, because Doctrine 3.x will auto-quote everything...

Also, this adds many small XML test files (mostly because invalid cases have to be tested separately [and I split the valid cases a bit], but not all are strictly necessary)...

(But still, I think that at least those "u01-uff" should be fixed.)

Found related:

@Ocramius
Copy link
Member

Ocramius commented Jul 26, 2018

@guilliamxavier 2.5 is closed for bugfixing - if this applies to 2.6 or newer we can likely merge it, but 2.5 only receives security fixes. not sure why I read 2.5 - sorry, must be the heat.

@Ocramius Ocramius added the Bug label Jul 26, 2018
@@ -424,7 +424,7 @@

<xs:simpleType name="fqcn" id="fqcn">
<xs:restriction base="xs:token">
<xs:pattern value="[a-zA-Z_u01-uff][a-zA-Z0-9_u01-uff]+" id="fqcn.pattern">
<xs:pattern value="([a-zA-Z_&#x7F;-&#x10FFFF;][a-zA-Z0-9_&#x7F;-&#x10FFFF;]*\\)*[a-zA-Z_&#x7F;-&#x10FFFF;][a-zA-Z0-9_&#x7F;-&#x10FFFF;]*" id="fqcn.pattern">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, they must follow ^[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*$ (with a *, not a +, on the character class for after the first char) according to http://php.net/manual/en/language.oop5.basic.php and verified at https://3v4l.org/klVjj.
=> Do you want to consider single-char names as invalid?

Second, that is a mono-byte regex, as explained at https://stackoverflow.com/a/5358944 (modulo the "accidentally") and verified at https://3v4l.org/XqJGf which shows that for UTF-8 encoded strings, the equivalent of
/^[a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*$/ is
/^[a-zA-Z_\x7f-\x{10ffff}][a-zA-Z0-9_\x7f-\x{10ffff}]*$/u.
Now, XML Regex is Unicode mode (and all the files are in UTF-8 encoding), so I translated \x7f-\x{10ffff} to &#x7F;-&#x10FFFF;, because &#x7F;-&#xFF; (translated from \x7f-\xff) only matches Unicode characters in the range U+7F..U+FF, and I have indeed tested that it would then consider non-Latin1 characters as invalid (e.g. "The value 'Слон' is not accepted by the pattern ...").
=> Do you want to consider non-Latin1 (or even non-ASCII) names as invalid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, indeed about * (probably lucky that nobody hit the pattern before - the library always generates 2+ chars for that particular use-case).

As for the multi-byte, I'm inclined to say that supporting multibyte strings for class names is a no-go for a lot of stuff, because most of the get_class()-based operations (not just in ORM, but in other libraries) simply don't operate under those circumstances, and you will get someone doing a nice unsafe substr() somewhere.

I'd say that we should keep those out of support: that would be an opinionated stance (by me), not the "full support for everything".

Copy link
Contributor Author

@guilliamxavier guilliamxavier Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the &#x7F;-&#x10FFFF; part should become just &#x7F; (because &#x7F;-&#xFF; would still accept e.g. the character é = &#xE9; which is 2-bytes \xC3\xA9 in UTF-8), or even be removed entirely (do you accept the DEL char)?
And I should remove the test file pattern-fqcn-valid-unicode.xml, or alternatively rename it to pattern-fqcn-invalid-multibyte.xml and keep only the "Éléphant" case?

Edit: And then what about the tablename pattern, for quoted and for unquoted identifiers?

* @dataProvider dataInvalidSchema
* @group 6389
*/
public function testInvalidateXmlSchema($xmlMappingFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

: void

Add type declarations too (can remove them in docblock)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for the new functions, or also for testValidateXmlSchema and dataValidSchema?

*/
public function testInvalidateXmlSchema($xmlMappingFile)
{
// Avoid PHP Warning errors for expected invalid mappings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What errors did come up here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking because this part should probably be moved into the XML driver itself

Copy link
Contributor Author

@guilliamxavier guilliamxavier Jul 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ ./vendor/bin/phpunit --group 6389
PHPUnit 6.5.9 by Sebastian Bergmann and contributors.

Runtime:       PHP 7.1.20-1+0~20180725103315.2+stretch~1.gbpd5b650
Configuration: /z/phpunit.xml.dist

...........................................E                      44 / 44 (100%)

Time: 314 ms, Memory: 54.00MB

There was 1 error:

1) Doctrine\Tests\ORM\Mapping\XmlMappingDriverTest::testInvalidateXmlSchema with data set #0 ('/z/tests/Doctrine/Tests/ORM/M...cm.xml')
DOMDocument::schemaValidate(): Element '{http://doctrine-project.org/schemas/orm/doctrine-mapping}class': This element is not expected. Expected is one of ( {http://doctrine-project.org/schemas/orm/doctrine-mapping}mapped-superclass, {http://doctrine-project.org/schemas/orm/doctrine-mapping}entity, {http://doctrine-project.org/schemas/orm/doctrine-mapping}embeddable, ##other{http://doctrine-project.org/schemas/orm/doctrine-mapping}* ).

/z/tests/Doctrine/Tests/ORM/Mapping/XmlMappingDriverTest.php:189
/z/tests/Doctrine/Tests/ORM/Mapping/XmlMappingDriverTest.php:170

ERRORS!
Tests: 44, Assertions: 43, Errors: 1.

Alternatively it works with @expectedException \PHPUnit\Framework\Error\Warning (or do you prefer $this->expectException(\PHPUnit\Framework\Error\Warning::class);?) and reducing the body to the simple call
$this->doValidateXmlSchema($xmlMappingFile);
but that's depending both on DOMDocument::schemaValidate() emitting an E_WARNING and on PHPUnit translating the Warning to an exception...

This is only for testing XSD validation (actually testing that the schema definition is correct), how would you move it into the XML driver (without merging #6728)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greg0ire ping ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guilliamxavier You could reuse the libxml_use_internal_errors yadda yadda trick from https://github.com/doctrine/doctrine2/pull/6728/files#diff-f96b6283e2653400c7e5fbb845f50748R867 here.

@Ocramius not 100% sure why you ping me, if it is to know the status of #6728, it is blocked by #7199, which is blocked until further notice

}
}

private function doValidateXmlSchema($xmlMappingFile)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type declarations

return self::doDataValidSchema(true);
}

static public function dataInvalidSchema()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of API is quite confusing: what's going on? Why is doDataValidSchema() used also for invalid data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be enough to rename it, or do you prefer that I factorize the common parts differently (like the test functions)?

BTW, would it be a BC-break to rename an existing public test function?
E.g. testValidateXmlSchema to, say, testValidateXmlSchemaWithValidMapping (so that testInvalidateXmlSchema could be testValidateXmlSchemaWithInvalidMapping)? (maybe you have better ideas?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having completely separate methods, even if some code is duplicated, is preferable here

@@ -202,7 +202,7 @@

<xs:simpleType name="tablename" id="tablename">
<xs:restriction base="xs:token">
<xs:pattern value="[a-zA-Z_u01-uff.]+" id="tablename.pattern">
<xs:pattern value="([^`.][^.]*\.)?([a-zA-Z0-9_$#&#x80;-&#x10FFFF;]+|`.+`)" id="tablename.pattern">
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the proposed part for an unquoted table name, [a-zA-Z0-9_$#&#x80;-&#x10FFFF;]+, is (sort of) the union of formats allowed by different vendors (mainly MySQL, PostgreSQL, Oracle).
Would you on the contrary want the intersection, e.g. [a-zA-Z][a-zA-Z0-9_]* (very restrictive)?

Do you also want that multibyte characters be considered as invalid (like for fqcn), even in quoted table names?

And maybe also limit the maximum length? (to e.g. 128 or even 30 chars?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we perform binary-unsafe operations on table names in the codebase (specifically: naming strategy, quoting), so being restrictive is OK.

Maximum length is to be avoided, because it is highly platform-dependent, and I've already seen DBs with horrendous generated table names in OracleDB (ascii-only): those would be surely over 64 chars. Unsure if any of the supported platforms goes over 128 chars though.

@Ocramius
Copy link
Member

@mikeSimonson can you take over here?

Copy link
Contributor

@mikeSimonson mikeSimonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my point of view.

We should remove the mutlibyte char entirely I think.

I am not sure a maximum length is necessary as it's probably super vendor/engine dependent.

Rename the test with valid unicode to invalid unicode.

Do we keep the "é" as valid is an open question.

You can add the type delcaration in your function in one commit and if you fell like it you can add them everywhere else in another commit or preferably PR.

#6728 should be finished and merged so that you don't need that logic in the tests.

Thanks for all the tests

@guilliamxavier
Copy link
Contributor Author

guilliamxavier commented Jul 27, 2018

@Ocramius Thank you very much for your review and for your well-explained replies.

@mikeSimonson Thank you for coming in. About two concerns:

Do we keep the "é" as valid is an open question.

Logically, to "remove the multibyte char entirely" implies that "é" be invalid (in UTF-8). Besides, the current patterns (which, although "semantically wrong", at least accept practically every valid "real-world" value) do not allow any non-ASCII chars. So I think actually the most conservative option is to stay ASCII-only (it will still be possible to add more characters later if wanted).

#6728 should be finished and merged so that you don't need that logic in the tests.

The referenced PR is targeting 2.7 (as a new feature) [and it even seems it may have to rather target 3.0 if BC-breaking consequences cannot be avoided], but this PR is targeting 2.6 (as a bug fix). I have added a recap of the currently incorrect cases (excluding Unicode ones) into the description at top.

Now, I think that the schema definition is mainly for IDE code-completion, and that validation of XML mappings against the XSD is not enforced (yet), so maybe this could wait until 2.7 [or 3.0] (although I would prefer not)...

@@ -202,7 +202,7 @@

<xs:simpleType name="tablename" id="tablename">
<xs:restriction base="xs:token">
<xs:pattern value="[a-zA-Z_u01-uff.]+" id="tablename.pattern">
<xs:pattern value="([&#x20;-&#x7F;-[\.]]+\.)?([a-zA-Z][a-zA-Z0-9_]*|`[&#x20;-&#x7F;]+`)" id="tablename.pattern">
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the backslash before the dot in the subtracted character class is "redundant", but without it PhpStorm flags a syntax error "Unclosed character class"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, wouldn't "([a-zA-Z][a-zA-Z0-9_]*\.)?[a-zA-Z][a-zA-Z0-9_]*|([&#x20;-&#x7F;-[`\.]][&#x20;-&#x7F;-[\.]]*\.)?`[&#x20;-&#x7F;]+`" be better? (the schema part cannot be backquoted, and will be SQL-quoted only if the table part is backquoted)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have opened a separate #7329 about the peculiar schema part quoting behavior. But for now this PR should be fine as it is.

@guilliamxavier
Copy link
Contributor Author

@Ocramius & @mikeSimonson I have reworked the commits according to your reviews and updated the PR, is it OK now?

Copy link
Member

@ostrolucky ostrolucky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels over-engineered at first, but valid. Would use more verbose variable names next time.

@guilliamxavier
Copy link
Contributor Author

Sorry, only now do I understand that this is blocked by #6728 because merging both PRs would lead to logic duplication in the XML driver and in the test class (libxml_use_internal_errors() etc.), and since enforced XSD validation is a desirable feature (for 3.0), the tests should wait for it and then probably use XmlDriver::loadMetadataForClass() rather that low-level DOM/libxml stuff (BTW, even the current testValidateXmlSchema() should be refactored then).

I also agree that this feels over-engineered (and as for the too-terse variable names, I guess I let myself influenced by $list/$invalid/$item, and agree should find better).

But since #6728 is itself blocked by #7199 which in turn is "blocked until further notice", and this PR maybe mixes too many things, I am going to extract independent parts into separate (and hopefully immediately-applicable) PRs 🙂

@@ -153,7 +153,7 @@ public function testInvalidMappingFileException()
public function testValidateXmlSchema($xmlMappingFile)
{
$xsdSchemaFile = __DIR__ . '/../../../../../doctrine-mapping.xsd';
$dom = new \DOMDocument('UTF-8');
$dom = new \DOMDocument('1.0', 'UTF-8');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually better here: just new \DOMDocument(); (done in #7345)

],
];

// Convert basic sprintf-style formats to PCRE patterns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better: use assertStringMatchesFormat() instead of assertRegeExp() at line 180 (and e.g. %S, %a or %A rather than %s in the format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants