Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting text to messages should be UTF-16 aware #95

Closed
nijel opened this issue Mar 3, 2015 · 5 comments
Closed

Splitting text to messages should be UTF-16 aware #95

nijel opened this issue Mar 3, 2015 · 5 comments

Comments

@nijel
Copy link
Member

nijel commented Mar 3, 2015

From #8, reported by @browndav:

Hi @nijel,

Preface: This is more of a request for comments than anything else, but you're welcome to do whatever you want with this patch. Saw the mention of UTF-16 and figured I should post.

A while back, I noticed that the multipart splitting code (in the outgoing Unicode-required case) wasn't UTF-16 aware – specifically, it would happily split a UTF-16 surrogate pair in the middle, which meant that the character is visually destroyed if the recipient is unable to reassemble the message (say, if the UDH was stripped by a telco – I've seen this happen in the wild between the US carriers AT&T and T-Mobile). Most emoji ends up as a four-byte UTF-16 surrogate pair, so this issue was pretty easy for us to hit.

There was also a similar set of issues with a subset of joinable characters, specifically where the non-joined components don't make much sense in isolation: for instance, the "regional indicator symbol" flags, or the "combining diacritical marks"; all cases where multiple characters really only make sense when viewed as a single glyph.

This is admittedly a bit of an edge case, but we've been using libgammu in developing-country situations where telco behavior/filtering is highly unpredictable, so it made sense at the time for us to patch it. What we've done is decrease the segment size for a single message segment iff splitting it at exactly 70 bytes would cut a (UTF-16) character in half. It seems to be working well for us so far.

Here's a raw diff against an old version, just so you can see what we're doing:
https://raw.githubusercontent.com/browndav/medic-os/master/platform/source/medic-core-1.6.0/patches/gammu-utf16-sms-multipart.diff

I need to port this forward to the latest version for our internal use regardless; if you're at all interested, I'd be happy to create a new issue and send a PR, or fix it up in whatever way's needed to merge.

Thanks!

@nijel
Copy link
Member Author

nijel commented Mar 3, 2015

It probably makes sense, but on the other side it can lead to situation where one more message would be created than actually needed. Not sure what's better approach in such corner case.

@codemonkeyforever
Copy link

Hello,

I seem to have a problem, which might be related to this problem, but I am not sure. In any case, I am using gammu with an old phone (samsung qbowl), and sometimes messages are split into multiple messages by gammu, even though its a single message. In each case the message contained some unicode characters, some special smiley which is often used by smartphone users. The character itself could not be displayed by my phone itself, it just displayed a rectangle instead of the actual character, but displayed it as a single message. Gammu instead broke the messages into multiple sms, split at the point where the unicode character was.

I am not so well informed about unicode, maybe you can tell me if this is some well known problem, or I should dig deeper into the problem and give more details about it.

Thank you for any comments what the problem could be :-)

@nijel
Copy link
Member Author

nijel commented Jun 16, 2015

This patch has been merged, so try with Gammu 1.36.2: http://wammu.eu/download/gammu/1.36.2/

@nijel nijel closed this as completed Nov 24, 2015
@melones
Copy link

melones commented Feb 15, 2017

Hi nijel,

I'm coming back to this issue, since we are encountering it in Gammu 1.38.0. When gammu-smsd receives multipart message that include emojis (UTF8/UTF16 characters) it fails to decode it properly and ends up with:
Error: ERROR: invalid byte sequence for encoding "UTF8": 0xeda0bd

The smsd log:

Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Multipart message 0x86, 2 parts of 2
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Execute SQL: INSERT INTO inbox ("ReceivingDateTime", "Text", "SenderNumber", "Coding", "SMSCNumber", "UDH", "Class", "TextDecoded", "RecipientID") VALUES (now(), '005400680061006E006B00200079006F0075002E002000570065002000770069006C006C00200063006F006E007400610063007400200079006F007500200073006F006F006E00200078006F0078006F00200078006F0078006F00200061006E0064002000490020D83DDE18D83DDE18D83DDE18D83DDE18D83DDE18D83DDE18D83DDE18D83D', '+48661123456', 'Unicode_No_Compression', '+48790998250', '050003860201', -1, 'Thank you. We will contact you soon xoxo xoxo and I đź��đź��đź��đź��đź��đź��đź��í ˝', 'phone1')
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Error: ERROR:  invalid byte sequence for encoding "UTF8": 0xeda0bd

Wed 2017/02/15 13:55:12 gammu-smsd[17250]: SQL failure: 79
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Error writing to database (SMSDSQL_SaveInboxSMS)
Wed 2017/02/15 13:55:12 gammu-smsd[17250]: Error processing SMS: Error in executing SQL query. (SQL[79])

We use PostgreSQL database backend. The problem is so nasty, that gammu goes into infinite loop and cannot process any incoming messages (even correct ones) after the problem occurs.

@nijel
Copy link
Member Author

nijel commented Mar 1, 2017

@melones I've created separate issue for that #281. Please do so next time automatically instead of commenting on years closed issue, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants