Text::tokenize don't tokenize correctly when specified multi-byte character as separator #6998

tt512 · 2015-07-12T01:30:11Z

I used Text::tokenize to tokenize by multi-byte character such as double width whitespace.
But It don't tokenize correctly.

I added test case to end of TextTest::testTokenize and the test become failure.

diff --git a/tests/TestCase/Utility/TextTest.php b/tests/TestCase/Utility/TextTest.php
index 1df51bf..cd58880 100644
--- a/tests/TestCase/Utility/TextTest.php
+++ b/tests/TestCase/Utility/TextTest.php
@@ -315,6 +315,10 @@ class TextTest extends TestCase
         $result = Text::tokenize('tagA "single tag" tagB', ' ', '"', '"');
         $expected = ['tagA', '"single tag"', 'tagB'];
         $this->assertEquals($expected, $result);
+
+        $result = Text::tokenize('tagA　"single　tag"　tagB', '　', '"', '"');
+        $expected = ['tagA', '"single　tag"', 'tagB'];
+        $this->assertEquals($expected, $result);
     }

     public function testReplaceWithQuestionMarkInString()

 phpunit tests/TestCase/Utility/TextTest.php 
PHPUnit 4.7.6 by Sebastian Bergmann and contributors.

.....F...........................................

Time: 275 ms, Memory: 17.25Mb

There was 1 failure:

1) Cake\Test\TestCase\Utility\TextTest::testTokenize
Failed asserting that two arrays are equal.
--- Expected
+++ Actual
@@ @@
 Array (
-    0 => 'tagA'
-    1 => '"single　tag"'
-    2 => 'tagB'
+    0 => 'tagA　"single　tag"　tagB'
 )

/tmp/workspace/tests/TestCase/Utility/TextTest.php:321

FAILURES!
Tests: 49, Assertions: 735, Failures: 1.

The text was updated successfully, but these errors were encountered:

String offset slicing is done bytewise and not characterwise which is necessary for multibyte characters to be used as separators. Refs #6998

markstory · 2015-07-12T02:25:26Z

Pull request open now.

dereuromark added this to the 3.0.9 milestone Jul 12, 2015

markstory added defect utility labels Jul 12, 2015

markstory self-assigned this Jul 12, 2015

markstory added a commit that referenced this issue Jul 12, 2015

Fix multibyte issues in Text::tokenize()

1e2d1b8

String offset slicing is done bytewise and not characterwise which is necessary for multibyte characters to be used as separators. Refs #6998

markstory mentioned this issue Jul 12, 2015

Fix multibyte issues in Text::tokenize() #7000

Merged

markstory closed this as completed Jul 12, 2015

tt512 mentioned this issue Jul 12, 2015

OSS Hack Weekend: tatarhy: cakephp: php: 作業ログ clear-code/sezemi-2015#42

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text::tokenize don't tokenize correctly when specified multi-byte character as separator #6998

Text::tokenize don't tokenize correctly when specified multi-byte character as separator #6998

tt512 commented Jul 12, 2015

markstory commented Jul 12, 2015

Text::tokenize don't tokenize correctly when specified multi-byte character as separator #6998

Text::tokenize don't tokenize correctly when specified multi-byte character as separator #6998

Comments

tt512 commented Jul 12, 2015

markstory commented Jul 12, 2015