Encoding conversion #4

GoogleCodeExporter · 2016-01-31T20:50:52Z

Currently, the input and output of Reader uses the same encoding.

It is often needed to read a stream of one encoding (e.g. UTF-8), and output 
string of another encoding (e.g. UTF-16). Or in the other way, stringify a DOM 
from one encoding (e.g. UTF-16) to an output stream of another encoding (e.g. 
UTF-8)

The most simple solution is converting the stream into a memory buffer of 
another encoding. This requires more memory storage and memory access.

Another solution is to convert the input stream into another encoding before 
sending it to the parser. However, only characters in JSON string type are 
really the ones necessary to be converted. Conversion of other characters just 
wastes time.

The third solution is letting the parser distinguish the input and output 
encoding. It uses an encoding converter to convert characters of JSON string 
type. However, since the output length may longer than the original length, in 
situ parsing cannot be permitted.

Try to design a mechanism to generalize encoding conversion. And it should 
support UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. It can also support 
automatic encoding detection with BOM, while incurring some overheads in 
dynamic dispatching.

Original issue reported on code.google.com by milo...@gmail.com on 26 Nov 2011 at 4:33

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2016-01-31T20:50:53Z

Reader/Writer can now perform transcoding with Transcoder.
New EncodedInputStream can decode characters from byte input stream
New EncodedOutputStream can encode characters to byte output stream
New AutoUTFInputStream can specify an UTF encoding in runtime, or detect UTF 
encoding from the beginning of stream (BOM and RFC4627). And then it can 
dynamically delicate operations to the actual UTF encoding.
New AutoUTFOutputStream can specify an UTF encoding in runtime, optionally 
writes BOM.
New AutoUTF can do operations according to UTF encoding type in the 
input/output stream.
All AutoXXX classes can handle UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

Original comment by milo...@gmail.com on 3 Dec 2011 at 4:43

Changed state: Fixed

GoogleCodeExporter added Priority-Medium Type-Enhancement auto-migrated labels Jan 31, 2016

GoogleCodeExporter closed this as completed Jan 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding conversion #4

Encoding conversion #4

GoogleCodeExporter commented Jan 31, 2016

GoogleCodeExporter commented Jan 31, 2016

Encoding conversion #4

Encoding conversion #4

Comments

GoogleCodeExporter commented Jan 31, 2016

GoogleCodeExporter commented Jan 31, 2016