Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding conversion #4

Closed
GoogleCodeExporter opened this issue Jan 31, 2016 · 1 comment
Closed

Encoding conversion #4

GoogleCodeExporter opened this issue Jan 31, 2016 · 1 comment

Comments

@GoogleCodeExporter
Copy link

Currently, the input and output of Reader uses the same encoding.

It is often needed to read a stream of one encoding (e.g. UTF-8), and output 
string of another encoding (e.g. UTF-16). Or in the other way, stringify a DOM 
from one encoding (e.g. UTF-16) to an output stream of another encoding (e.g. 
UTF-8)

The most simple solution is converting the stream into a memory buffer of 
another encoding. This requires more memory storage and memory access.

Another solution is to convert the input stream into another encoding before 
sending it to the parser. However, only characters in JSON string type are 
really the ones necessary to be converted. Conversion of other characters just 
wastes time.

The third solution is letting the parser distinguish the input and output 
encoding. It uses an encoding converter to convert characters of JSON string 
type. However, since the output length may longer than the original length, in 
situ parsing cannot be permitted.

Try to design a mechanism to generalize encoding conversion. And it should 
support UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. It can also support 
automatic encoding detection with BOM, while incurring some overheads in 
dynamic dispatching.

Original issue reported on code.google.com by milo...@gmail.com on 26 Nov 2011 at 4:33

@GoogleCodeExporter
Copy link
Author

Reader/Writer can now perform transcoding with Transcoder.
New EncodedInputStream can decode characters from byte input stream
New EncodedOutputStream can encode characters to byte output stream
New AutoUTFInputStream can specify an UTF encoding in runtime, or detect UTF 
encoding from the beginning of stream (BOM and RFC4627). And then it can 
dynamically delicate operations to the actual UTF encoding.
New AutoUTFOutputStream can specify an UTF encoding in runtime, optionally 
writes BOM.
New AutoUTF can do operations according to UTF encoding type in the 
input/output stream.
All AutoXXX classes can handle UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

Original comment by milo...@gmail.com on 3 Dec 2011 at 4:43

  • Changed state: Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant