Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text::tokenize don't tokenize correctly when specified multi-byte character as separator #6998

Closed
tt512 opened this issue Jul 12, 2015 · 1 comment
Assignees
Milestone

Comments

@tt512
Copy link

tt512 commented Jul 12, 2015

I used Text::tokenize to tokenize by multi-byte character such as double width whitespace.
But It don't tokenize correctly.

I added test case to end of TextTest::testTokenize and the test become failure.

diff --git a/tests/TestCase/Utility/TextTest.php b/tests/TestCase/Utility/TextTest.php
index 1df51bf..cd58880 100644
--- a/tests/TestCase/Utility/TextTest.php
+++ b/tests/TestCase/Utility/TextTest.php
@@ -315,6 +315,10 @@ class TextTest extends TestCase
         $result = Text::tokenize('tagA "single tag" tagB', ' ', '"', '"');
         $expected = ['tagA', '"single tag"', 'tagB'];
         $this->assertEquals($expected, $result);
+
+        $result = Text::tokenize('tagA "single tag" tagB', ' ', '"', '"');
+        $expected = ['tagA', '"single tag"', 'tagB'];
+        $this->assertEquals($expected, $result);
     }

     public function testReplaceWithQuestionMarkInString()
 phpunit tests/TestCase/Utility/TextTest.php 
PHPUnit 4.7.6 by Sebastian Bergmann and contributors.

.....F...........................................

Time: 275 ms, Memory: 17.25Mb

There was 1 failure:

1) Cake\Test\TestCase\Utility\TextTest::testTokenize
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    0 => 'tagA'
-    1 => '"single tag"'
-    2 => 'tagB'
+    0 => 'tagA "single tag" tagB'
 )

/tmp/workspace/tests/TestCase/Utility/TextTest.php:321

FAILURES!
Tests: 49, Assertions: 735, Failures: 1.
@dereuromark dereuromark added this to the 3.0.9 milestone Jul 12, 2015
@markstory markstory self-assigned this Jul 12, 2015
markstory added a commit that referenced this issue Jul 12, 2015
String offset slicing is done bytewise and not characterwise which is
necessary for multibyte characters to be used as separators.

Refs #6998
@markstory
Copy link
Member

Pull request open now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants