CSVDataSet does not read UTF-8 files when file.encoding is UTF-8 #2512

asfimport · 2011-08-10T19:48:30Z

Jacob Zwiers (Bug 51645):
CSV Data Sets which are encoded in UTF-8 do not work on platforms where the default file.encoding is UTF-8.

UTF-8 is used to illustrate here, but this would presumably apply to other non-8bit character sets as well.

Reason: The use of ByteArrayOutputStream in the CSVSaveService.csvReadFile() method. Specifically, the boas.write(ch) call is implemented (internally in ByteArrayOutputStream) with a cast to the byte primitive type ( buf[count] = (byte)b; in my JVM).

Later, the ByteArrayOutputStream is interpreted according to the platform default (via baos.toString()) and if the content of the array are then interpreted according to the platform's default char set. If that charset (eg. ISO-8859-1) is 8-bit, everything is fine. However, unpredictable results/unmapped chars result for other charsets (like UTF-8).

For example, the character \u0027 (LATIN SMALL LETTER C WITH CEDILLA) with decimal code point 231. When put into boas, it becomes (7 bit signed) -25. When converted via toString() with UTF-8 as the default char set, the value is not recognized as a valid code point and the value \ufffd (decimal code point 65533 == Unicodes "REPLACEMENT CHARACTER" placeholder) is placed in the return string instead.

Fix: patche attached. Simply replace ByteArrayOutputStream with CharArrayWriter and the UTF-8 files work regardless of the value for file.encoding.

Created attachment CSVSaveService.java.patch: Patch to fix issue. Variable not renamed to show just a matter of replacing class.

CSVSaveService.java.patch

Index: src/core/org/apache/jmeter/save/CSVSaveService.java
===================================================================
--- src/core/org/apache/jmeter/save/CSVSaveService.java	(revision 1155546)
+++ src/core/org/apache/jmeter/save/CSVSaveService.java	(working copy)
@@ -19,7 +19,7 @@
 package org.apache.jmeter.save;
 
 import java.io.BufferedReader;
-import java.io.ByteArrayOutputStream;
+import java.io.CharArrayWriter;
 import java.io.FileReader;
 import java.io.FileWriter;
 import java.io.IOException;
@@ -937,7 +937,7 @@
         int ch;
         int state = INITIAL;
         List<String> list = new ArrayList<String>();
-        ByteArrayOutputStream baos = new ByteArrayOutputStream(200);
+        CharArrayWriter baos = new CharArrayWriter(200);        
         boolean push = false;
         while(-1 != (ch=infile.read())){
             push = false;

Severity: major
OS: All

asfimport · 2011-08-10T19:56:50Z

Jacob Zwiers (migrated from Bugzilla):
Tests will execute successfully if default file.encoding is ISO-8859-1 (or other 8bit that can handle the chars in the test). However, run with -Dfile.encoding=UTF-8 VM arg and tests will fail. Requires new bin/testfiles/testutf8.csv (attached next).

Created attachment 51645-testcases.patch: Test cases to expose bug. Run with file.encoding=UTF-8

51645-testcases.patch

Index: test/src/org/apache/jmeter/config/TestCVSDataSet.java
===================================================================
--- test/src/org/apache/jmeter/config/TestCVSDataSet.java	(revision 1155546)
+++ test/src/org/apache/jmeter/config/TestCVSDataSet.java	(working copy)
@@ -92,7 +92,41 @@
         assertEquals("b1",threadVars.get("b"));
         assertEquals("c1",threadVars.get("c"));
     }
+    
+    public void testutf8() throws Exception {
+    	
+        CSVDataSet csv = new CSVDataSet();
+        csv.setFilename(findTestPath("testfiles/testutf8.csv"));
+        csv.setVariableNames("a,b,c,d");
+        csv.setDelimiter(",");
+        csv.setQuotedData( true );
+        csv.setFileEncoding( "UTF-8" );
+        
+        csv.iterationStart(null);
+        assertEquals("a1",threadVars.get("a"));
+        assertEquals("b1",threadVars.get("b"));
+        assertEquals("\u00e71",threadVars.get("c"));
+        assertEquals("d1",threadVars.get("d"));
 
+        csv.iterationStart(null);
+        assertEquals("a2",threadVars.get("a"));
+        assertEquals("b2",threadVars.get("b"));
+        assertEquals("\u00e72",threadVars.get("c"));
+        assertEquals("d2",threadVars.get("d"));
+
+        csv.iterationStart(null);
+        assertEquals("a3",threadVars.get("a"));
+        assertEquals("b3",threadVars.get("b"));
+        assertEquals("\u00e73",threadVars.get("c"));
+        assertEquals("d3",threadVars.get("d"));
+
+        csv.iterationStart(null);
+        assertEquals("a4",threadVars.get("a"));
+        assertEquals("b4",threadVars.get("b"));
+        assertEquals("\u00e74",threadVars.get("c"));
+        assertEquals("d4",threadVars.get("d"));
+    }
+
     // Test CSV file with a header line
     public void testHeaderOpen(){
         CSVDataSet csv = new CSVDataSet();
Index: test/src/org/apache/jmeter/save/TestCSVSaveService.java
===================================================================
--- test/src/org/apache/jmeter/save/TestCSVSaveService.java	(revision 1155546)
+++ test/src/org/apache/jmeter/save/TestCSVSaveService.java	(working copy)
@@ -60,6 +60,10 @@
         checkSplitString("a,bc,,",   ',', new String[]{"a","bc","",""});
         checkSplitString("a,,,",     ',', new String[]{"a","","",""});
         checkSplitString("a,bc,d,\n",',', new String[]{"a","bc","d",""});
+        
+        // \u00e7 = LATIN SMALL LETTER C WITH CEDILLA
+        // \u00e9 = LATIN SMALL LETTER E WITH ACUTE
+        checkSplitString("a,b\u00e7,d,\u00e9", ',', new String[]{"a","b\u00e7","d","\u00e9"}); 
     }
 
     public void testSplitQuoted() throws Exception {
@@ -75,6 +79,10 @@
         checkSplitString("a,bc,d,",      ',', new String[]{"a","bc","d",""});
         checkSplitString("a,bc,d,\"\"",  ',', new String[]{"a","bc","d",""});
         checkSplitString("a,bc,d,\"\"\n",',', new String[]{"a","bc","d",""});
+
+        // \u00e7 = LATIN SMALL LETTER C WITH CEDILLA
+        // \u00e9 = LATIN SMALL LETTER E WITH ACUTE
+        checkSplitString("\"a\",\"b\u00e7\",\"d\",\"\u00e9\"", ',', new String[]{"a","b\u00e7","d","\u00e9"}); 
     }
 
     public void testSplitBadQuote() throws Exception {

asfimport · 2011-08-10T19:58:17Z

Jacob Zwiers (migrated from Bugzilla):
Required for one of the previously attached tests. Belongs in bin/testfiles

Created attachment testutf8.csv: .csv file for previous test patch

asfimport · 2011-08-11T00:37:08Z

Sebb (migrated from Bugzilla):
Thanks very much.

Patch applied:

URL: http://svn.apache.org/viewvc?rev=1156416&view=rev
Log:
#2512 - CSVDataSet does not read UTF-8 files when file.encoding is UTF-8

Added:
jakarta/jmeter/trunk/bin/testfiles/testutf8.csv (with props)
Modified:
jakarta/jmeter/trunk/src/core/org/apache/jmeter/save/CSVSaveService.java
jakarta/jmeter/trunk/test/src/org/apache/jmeter/config/TestCVSDataSet.java
jakarta/jmeter/trunk/test/src/org/apache/jmeter/save/TestCSVSaveService.java
jakarta/jmeter/trunk/xdocs/changes.xml

asfimport closed this as completed Aug 11, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSVDataSet does not read UTF-8 files when file.encoding is UTF-8 #2512

CSVDataSet does not read UTF-8 files when file.encoding is UTF-8 #2512

asfimport commented Aug 10, 2011

asfimport commented Aug 10, 2011

asfimport commented Aug 10, 2011

asfimport commented Aug 11, 2011

CSVDataSet does not read UTF-8 files when file.encoding is UTF-8 #2512

CSVDataSet does not read UTF-8 files when file.encoding is UTF-8 #2512

Comments

asfimport commented Aug 10, 2011

asfimport commented Aug 10, 2011

asfimport commented Aug 10, 2011

asfimport commented Aug 11, 2011