Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input Method to compose complex characters #2430

Closed
Myriads opened this issue Nov 9, 2014 · 10 comments · Fixed by #5967

Comments

@Myriads
Copy link

@Myriads Myriads commented Nov 9, 2014

There are 2 bugs to use Korean characters:

  1. If any ASCII character such as white space, comma, dot, or number is typed in during composing the Korean character, the currently composed Korean character is disappeared as like deleted in the screen of the Edit and the Serial Monitor windows.
  2. Korean characters are commonly consisted of 3 components: 1) beginning consonant, 2) vowel, and 3) final consonant. Sometimes the final consonant of the previous composed character becomes the beginning consonant of the following composed character. If the previous final consonant becomes the beginning consonant of the next composed character, then the previous composed Korean character should be redrawed with the final completely composed Korean character, but it is still displayed with the final consonant which becomes the beginning consonant of the next Korean character.
  • Same in 1.5.x
@matthijskooijman

This comment has been minimized.

Copy link
Collaborator

@matthijskooijman matthijskooijman commented Nov 10, 2014

Hmm, I wonder if this is something that the Arduino code can influence, or if we're just dependent on Java to do the right thing here...

@Myriads

This comment has been minimized.

Copy link
Author

@Myriads Myriads commented Nov 10, 2014

I believe these 2 problems are related with sources(source codes) processing Java Input Method. So it will not influence the Arduino code.

@ffissore

This comment has been minimized.

Copy link
Contributor

@ffissore ffissore commented May 12, 2015

This should be fixed with the new editor, available with the latest hourly build http://www.arduino.cc/en/Main/Software#hourly

@ffissore ffissore self-assigned this May 12, 2015
@Myriads

This comment has been minimized.

Copy link
Author

@Myriads Myriads commented May 12, 2015

It works great. Thanks a lot.

But still there is a bug in the Serial monitor.

ide-1 6 5 01
This sketch directly prints out the Korean characters "한글" to serial monitor and after that displays the inputString received from the serial monitor.

ide-1 6 5 02
The inputString is displayed well but the directly printed Korean characters "한글" are broken. Korean characters are displayed as like as the following attached picture.

ide-1 6 5 03

@Myriads Myriads closed this May 12, 2015
@Myriads Myriads reopened this May 12, 2015
@cmaglie

This comment has been minimized.

Copy link
Member

@cmaglie cmaglie commented May 12, 2015

May you cut&paste your sketch here?

@ffissore ffissore assigned cmaglie and unassigned ffissore May 12, 2015
@cmaglie

This comment has been minimized.

Copy link
Member

@cmaglie cmaglie commented May 12, 2015

Nevermind, I reproduced (more or less) the issue with this sketch on an Arduino Due:

void setup() {  Serial.begin(9600); }

void loop() {  Serial.println("한글");  delay(1000); }

Before giving false expectations let me say that the strings functions in Arduino are designed to work with plain ASCII characters, so if you try to use UTF8 characters it may work in simple cases but you may encounter random faulty behaviours on more complex sketches for example if you try to concatenate two strings or extract a substring from a bigger one.

Said that, it happens that the above sketch works if I connect to the serial port with an external terminal program like Putty but it prints random garbage with the serial monitor of the Arduino IDE. So my conclusion is that something weird is happening on the Arduino Serial Monitor.

My guess is that the issue is in how the incoming chars are buffered here:
https://github.com/arduino/Arduino/blob/master/arduino-core/src/processing/app/Serial.java#L156

        byte[] buf = port.readBytes(serialEvent.getEventValue());
        if (buf.length > 0) {
          String msg = new String(buf);
          char[] chars = msg.toCharArray();
          message(chars, chars.length);
        }

an UTF8 char may be composed of many bytes, and the String object can extract the correct UTF8 char only if a complete UTF8 char is received in one single read. If the a multi-byte UTF8 char is fragmented the two consecutive calls to String constructor are not able to build the correct character.

This is a tricky issue, because JSSC doesn't implement the InputStream interface but, instead, has this weird readBytes() method that returns an array of bytes. See https://github.com/scream3r/java-simple-serial-connector/issues/17

The best fix would be to implement an InputStream interface in JSSC and feed the InputStream into an InputStreamReader or a BufferedReader that will do all the correct buffering and decoding.

An alternative is to write an anonymous-InputStream wrappen around the JSSC's Serial object to obtain the same result.

@Myriads

This comment has been minimized.

Copy link
Author

@Myriads Myriads commented May 12, 2015

Here is the sketch:

/*
  Serial Event example
 
 When new serial data arrives, this sketch adds it to a String.
 When a newline is received, the loop prints the string and 
 clears it.
 
 A good test for this is to try it with a GPS receiver 
 that sends out NMEA 0183 sentences. 
 
 Created 9 May 2011
 by Tom Igoe
 
 This example code is in the public domain.
 
 http://www.arduino.cc/en/Tutorial/SerialEvent
 
 */

String inputString = "";         // a string to hold incoming data
boolean stringComplete = false;  // whether the string is complete

void setup() {
  // initialize serial:
  Serial.begin(115200);
  // reserve 200 bytes for the inputString:
  inputString.reserve(200);
}

void loop() {
  // print the string when a newline arrives:
  if (stringComplete) {
    Serial.print("한글:"); 
    Serial.print(inputString); 
    // clear the string:
    inputString = "";
    stringComplete = false;
  }
}

/*
  SerialEvent occurs whenever a new data comes in the
 hardware serial RX.  This routine is run between each
 time loop() runs, so using delay inside loop can delay
 response.  Multiple bytes of data may be available.
 */
void serialEvent() {
  while (Serial.available()) {
    // get the new byte:
    char inChar = (char)Serial.read(); 
    // add it to the inputString:
    inputString += inChar;
    // if the incoming character is a newline, set a flag
    // so the main loop can do something about it:
    if (inChar == '\n') {
      stringComplete = true;
    } 
  }
}


@cousteaulecommandant

This comment has been minimized.

Copy link
Contributor

@cousteaulecommandant cousteaulecommandant commented Apr 29, 2016

If the multi-byte UTF8 char is fragmented the two consecutive calls to String constructor are not able to build the correct character.

With UTF-8 it is possible to detect whether a "chunk" of bytes ends in a single-byte (ASCII) character or a multi-byte sequence, and it is relatively easy to manually check whether this multi-byte sequence is complete or not (also the number of bytes that are in this chunk and the number of bytes that are missing). Therefore if a chunk ends in an incomplete multi-byte sequence, this sequence could be stripped and "saved for later", either "pushed back" with something like C's ungetc() if available, or by just saving it in an internal variable that will be prepended to the next chunk.

This involves the serial monitor being a bit smart though; plus the fix I'm mentioning is specific to UTF-8. If the InputStream solution is easy to implement and already takes care of this, it's probably a better solution.

@PaulMurrayCbr

This comment has been minimized.

Copy link

@PaulMurrayCbr PaulMurrayCbr commented May 12, 2016

Yeah - the issue is a design one. The serial monitor uses this "message" interface that works with strings, because when sending stuff via the monitor you type something and then hit return. But this doesn't work for receiving bytes. The "message" model is inappropriate for the serial monitor altogether.

@aknrdureegaesr

This comment has been minimized.

Copy link

@aknrdureegaesr aknrdureegaesr commented Feb 5, 2017

As noted over at 4452:

The String-constructor documentation advises to use a CharsetDecoder instead, if better control is needed.

I think that's good advice. This would give control over the encoding used, which is the point of #4452 .

Clean UTF-8 decoding even in the split character case is also a feature included in CharsetDecoder. It has the appropriate buffer that holds back the few bytes that belong to a not-yet completed character. See its documentation.

So using this would be an easy fix, with no need to completely redo the "message" model.

(FWIW: I think that model is not that bad a choice, actually.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.