Splitting text to messages should be UTF-16 aware #95

nijel · 2015-03-03T07:34:57Z

From #8, reported by @browndav:

Preface: This is more of a request for comments than anything else, but you're welcome to do whatever you want with this patch. Saw the mention of UTF-16 and figured I should post.

A while back, I noticed that the multipart splitting code (in the outgoing Unicode-required case) wasn't UTF-16 aware – specifically, it would happily split a UTF-16 surrogate pair in the middle, which meant that the character is visually destroyed if the recipient is unable to reassemble the message (say, if the UDH was stripped by a telco – I've seen this happen in the wild between the US carriers AT&T and T-Mobile). Most emoji ends up as a four-byte UTF-16 surrogate pair, so this issue was pretty easy for us to hit.

There was also a similar set of issues with a subset of joinable characters, specifically where the non-joined components don't make much sense in isolation: for instance, the "regional indicator symbol" flags, or the "combining diacritical marks"; all cases where multiple characters really only make sense when viewed as a single glyph.

This is admittedly a bit of an edge case, but we've been using libgammu in developing-country situations where telco behavior/filtering is highly unpredictable, so it made sense at the time for us to patch it. What we've done is decrease the segment size for a single message segment iff splitting it at exactly 70 bytes would cut a (UTF-16) character in half. It seems to be working well for us so far.

Here's a raw diff against an old version, just so you can see what we're doing:
https://raw.githubusercontent.com/browndav/medic-os/master/platform/source/medic-core-1.6.0/patches/gammu-utf16-sms-multipart.diff

I need to port this forward to the latest version for our internal use regardless; if you're at all interested, I'd be happy to create a new issue and send a PR, or fix it up in whatever way's needed to merge.

Thanks!

The text was updated successfully, but these errors were encountered:

nijel · 2015-03-03T07:35:23Z

It probably makes sense, but on the other side it can lead to situation where one more message would be created than actually needed. Not sure what's better approach in such corner case.

codemonkeyforever · 2015-06-16T11:35:10Z

Hello,

I seem to have a problem, which might be related to this problem, but I am not sure. In any case, I am using gammu with an old phone (samsung qbowl), and sometimes messages are split into multiple messages by gammu, even though its a single message. In each case the message contained some unicode characters, some special smiley which is often used by smartphone users. The character itself could not be displayed by my phone itself, it just displayed a rectangle instead of the actual character, but displayed it as a single message. Gammu instead broke the messages into multiple sms, split at the point where the unicode character was.

I am not so well informed about unicode, maybe you can tell me if this is some well known problem, or I should dig deeper into the problem and give more details about it.

Thank you for any comments what the problem could be :-)

nijel · 2015-06-16T11:51:26Z

This patch has been merged, so try with Gammu 1.36.2: http://wammu.eu/download/gammu/1.36.2/

melones · 2017-02-15T13:09:19Z

Hi nijel,

I'm coming back to this issue, since we are encountering it in Gammu 1.38.0. When gammu-smsd receives multipart message that include emojis (UTF8/UTF16 characters) it fails to decode it properly and ends up with:
Error: ERROR: invalid byte sequence for encoding "UTF8": 0xeda0bd

The smsd log:

Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Multipart message 0x86, 2 parts of 2
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Execute SQL: INSERT INTO inbox ("ReceivingDateTime", "Text", "SenderNumber", "Coding", "SMSCNumber", "UDH", "Class", "TextDecoded", "RecipientID") VALUES (now(), '005400680061006E006B00200079006F0075002E002000570065002000770069006C006C00200063006F006E007400610063007400200079006F007500200073006F006F006E00200078006F0078006F00200078006F0078006F00200061006E0064002000490020D83DDE18D83DDE18D83DDE18D83DDE18D83DDE18D83DDE18D83DDE18D83D', '+48661123456', 'Unicode_No_Compression', '+48790998250', '050003860201', -1, 'Thank you. We will contact you soon xoxo xoxo and I đź��đź��đź��đź��đź��đź��đź��í ˝', 'phone1')
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Error: ERROR:  invalid byte sequence for encoding "UTF8": 0xeda0bd

Wed 2017/02/15 13:55:12 gammu-smsd[17250]: SQL failure: 79
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Error writing to database (SMSDSQL_SaveInboxSMS)
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Error processing SMS: Error in executing SQL query. (SQL[79])

We use PostgreSQL database backend. The problem is so nasty, that gammu goes into infinite loop and cannot process any incoming messages (even correct ones) after the problem occurs.

nijel · 2017-03-01T11:56:51Z

@melones I've created separate issue for that #281. Please do so next time automatically instead of commenting on years closed issue, thanks.

nijel added the enhancement label Mar 3, 2015

nijel mentioned this issue Mar 3, 2015

unfinite error loop on inserting non-UTF8 character string to DB #8

Closed

ghost mentioned this issue Mar 28, 2015

Add character or SMS count in Send Message dialogue box medic/cht-core#859

Closed

nijel closed this as completed Nov 24, 2015

nijel mentioned this issue Mar 1, 2017

Invalid UTF-8 for multipart messages in SMSD SQL #281

Closed

melones mentioned this issue Feb 14, 2018

Invalid UTF-8 for multipart messages in SMSD SQL (still exists) #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting text to messages should be UTF-16 aware #95

Splitting text to messages should be UTF-16 aware #95

nijel commented Mar 3, 2015

nijel commented Mar 3, 2015

codemonkeyforever commented Jun 16, 2015

nijel commented Jun 16, 2015

melones commented Feb 15, 2017

nijel commented Mar 1, 2017

Splitting text to messages should be UTF-16 aware #95

Splitting text to messages should be UTF-16 aware #95

Comments

nijel commented Mar 3, 2015

nijel commented Mar 3, 2015

codemonkeyforever commented Jun 16, 2015

nijel commented Jun 16, 2015

melones commented Feb 15, 2017

nijel commented Mar 1, 2017