Skip to content
Permalink
Browse files

[SPARK-23649][SQL] Skipping chars disallowed in UTF-8

## What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

## How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk <maxim.gekk@databricks.com>

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information...
MaxGekk authored and cloud-fan committed Mar 20, 2018
1 parent c854b6c commit 0b880db65b647e549b78721859b1712dff733ec9
@@ -57,12 +57,43 @@
public Object getBaseObject() { return base; }
public long getBaseOffset() { return offset; }

private static int[] bytesOfCodePointInUTF8 = {2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5,
6, 6};
/**
* A char in UTF-8 encoding can take 1-4 bytes depending on the first byte which
* indicates the size of the char. See Unicode standard in page 126, Table 3-6:
* http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf
*
* Binary Hex Comments
* 0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
* 10xxxxxx 0x80..0xBF Continuation bytes (1-3 continuation bytes)
* 110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
* 1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
* 11110xxx 0xF0..0xF7 First byte of a 4-byte character encoding
*
* As a consequence of the well-formedness conditions specified in
* Table 3-7 (page 126), the following byte values are disallowed in UTF-8:
* C0–C1, F5–FF.
*/
private static byte[] bytesOfCodePointInUTF8 = {
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x00..0x0F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x10..0x1F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x20..0x2F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x30..0x3F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x40..0x4F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x50..0x5F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x60..0x6F
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, // 0x70..0x7F
// Continuation bytes cannot appear as the first byte
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x80..0x8F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0x90..0x9F
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xA0..0xAF
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, // 0xB0..0xBF
0, 0, // 0xC0..0xC1 - disallowed in UTF-8
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0xC2..0xCF
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, // 0xD0..0xDF
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, // 0xE0..0xEF
4, 4, 4, 4, 4, // 0xF0..0xF4
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 // 0xF5..0xFF - disallowed in UTF-8
};

private static final boolean IS_LITTLE_ENDIAN =
ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN;
@@ -187,8 +218,9 @@ public void writeTo(OutputStream out) throws IOException {
* @param b The first byte of a code point
*/
private static int numBytesForFirstByte(final byte b) {
final int offset = (b & 0xFF) - 192;
return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
final int offset = b & 0xFF;
byte numBytes = bytesOfCodePointInUTF8[offset];
return (numBytes == 0) ? 1: numBytes; // Skip the first byte disallowed in UTF-8
}

/**
@@ -58,8 +58,12 @@ private static void checkBasic(String str, int len) {
@Test
public void basicTest() {
checkBasic("", 0);
checkBasic("hello", 5);
checkBasic("¡", 1); // 2 bytes char
checkBasic("ку", 2); // 2 * 2 bytes chars
checkBasic("hello", 5); // 5 * 1 byte chars
checkBasic("大 千 世 界", 7);
checkBasic("︽﹋%", 3); // 3 * 3 bytes chars
checkBasic("\uD83E\uDD19", 1); // 4 bytes char
}

@Test
@@ -791,4 +795,21 @@ public void trimRightWithTrimString() {
assertEquals(fromString(""), fromString("头a???/").trimRight(fromString("数?/*&^%a")));
assertEquals(fromString(""), fromString("头数b数数 [").trimRight(fromString(" []数b")));
}

@Test
public void skipWrongFirstByte() {
int[] wrongFirstBytes = {
0x80, 0x9F, 0xBF, // Skip Continuation bytes
0xC0, 0xC2, // 0xC0..0xC1 - disallowed in UTF-8
// 0xF5..0xFF - disallowed in UTF-8
0xF5, 0xF6, 0xF7, 0xF8, 0xF9,
0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF
};
byte[] c = new byte[1];

for (int i = 0; i < wrongFirstBytes.length; ++i) {
c[0] = (byte)wrongFirstBytes[i];
assertEquals(fromBytes(c).numChars(), 1);
}
}
}

0 comments on commit 0b880db

Please sign in to comment.
You can’t perform that action at this time.