Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode (utf8 actually) characters not handled correctly #4

Closed
thommey opened this issue Feb 2, 2010 · 10 comments
Closed

unicode (utf8 actually) characters not handled correctly #4

thommey opened this issue Feb 2, 2010 · 10 comments

Comments

@thommey
Copy link
Member

thommey commented Feb 2, 2010

Trac Data
Ticket 4
Reported by Arkadiusz Miskiewicz
Status assigned
Component Core
Priority blocker
Milestone 1.8.0
Keywords utf unicode
Version 1.8.0 CVS

My setup:

eggdrop 1.6.17
tcl 8.5a6
Linux 2.6 with latest version of libraries and all stuff
freenode network
irssi client

Testing is simple, tcl script that echoes data back to channel:
bind pub - "utf" pub_proc
proc pub_proc { nick idx handle channel szoveg } {
putmsg $channel "$szoveg"
}

Now on original 1.6.17 when entering utf8 characters this happens:
21:39 < arekm8> utf óąłńś
21:39 < utftest> �EBD[
(some crap is echoed)

After patching src/tcl.c utf_convert() with:

  • byteptr = (char *) Tcl_GetByteArrayFromObj(objv[i], &len);
  • byteptr = (char *) Tcl_GetStringFromObj(objv[i], &len);

I get:
21:57 < arekm8> utf óąłńś
21:57 < utftest> óąłńś

It works - proper characters are echoed back.

No idea why ByteArray is used in utf_convert() so I'm not sure if the fix is
correct. Is it?

@thommey
Copy link
Member Author

thommey commented Mar 30, 2012

Comment by @skralg
putlog may handle utf-8 separately and need additional changes

@thommey
Copy link
Member Author

thommey commented Feb 5, 2016

todo: fix encoding of language files

@vanosg vanosg modified the milestones: v1.8.1, v1.8.0 Jul 9, 2016
@fred0r
Copy link

fred0r commented Aug 27, 2016

Just checked out and still needed to edit tcl.c to get utf-8 output.
Please default to utf-8 - its 2016.

@thommey thommey removed this from the v1.8.1 milestone Dec 15, 2016
@vanosg vanosg added this to the v1.9.0 milestone Mar 4, 2017
@ghost
Copy link

ghost commented Aug 22, 2018

[Edit//Note: these tests are on TCL 8.6.8 bots. This may be a TCL version-specific issue.]

I can't duplicate this with the current version (1.8.3 / 1080303 [RC?])

<@Ami-Laptop> ## bind pub - utf >utf ; proc >utf [list 1 2 3 4 5] { putquick "PRIVMSG $4 :$5" ; return 1 }
<&Samantha> [TCL (0ms)] 
<@Ami-Laptop> utf 水野 舞
<&Samantha> 水野 舞
<%Agnes> 水野 舞
<@Ami-Laptop> ## encoding system
<&Samantha> [TCL (0ms)] utf-8
<@Ami-Laptop> >> encoding system
<%Agnes> [TCL (0ms)] iso8859-1

The original poster used PUTMSG, so:

<@Ami-Laptop> ## putmsg #StormBot {水野 舞}
<&Samantha> 水野 舞
<@Ami-Laptop> ## putmsg #StormBot "水野 舞"
<&Samantha> 水野 舞

<@Ami-Laptop> >> putmsg #StormBot {水野 舞}
<%Agnes> 水野 舞
<@Ami-Laptop> >> putmsg #StormBot "水野 舞"
<%Agnes> 水野 舞

Seems fine, using braces or double-quotes, raw output or through PUTMSG.

@makk-mma
Copy link

makk-mma commented Jan 2, 2020

I think I'm having a similar/related issue, and it's easy to reproduce. Try the commands below from an eggdrop console and notice how TCL seems to properly set and return the skull-and-crossbones emoji ( https://www.fileformat.info/info/unicode/char/1f571/index.htm ). However, the moment you try to output it with putlog or putserv, the character becomes mangled.

.tcl set z \xF0\x9F\x95\xB1
Tcl: 🕱
.tcl set y foo${z}bar
Tcl: foo🕱bar
.tcl putlog $z
[1/2 14:17:28] �
Tcl:
.tcl putlog $y
[1/2 14:37:21] foo�bar
Tcl:

I'm guessing that putlog/putserv are mangling the UTF-8 string by encoding it with iso8859-1 or something before output. I'm using the latest stable eggdrop, version 1.8.4, and the tcl version is 8.6.
I also tried recompiling eggdrop by changing all instances of "iso8859-1" in src/tcl.c to "utf-8". I even tried recompiling after changing the shell's locale to different variations (first POSIX, then en_US.utf8, and finally C.UTF-8). None of these attempts fixed the problem. They only seemed to change the default TCL [encoding system], which simply caused it to mangle the emoji character differently.

This might be of interest, too:

.tcl encoding system
Tcl: iso8859-1
.tcl set z
Tcl: 🕱
.tcl encoding system utf-8
Tcl:
.tcl set z
Tcl: �
.tcl encoding system iso8859-1
Tcl:
.tcl set z
Tcl: 🕱

@vanosg
Copy link
Member

vanosg commented Jan 5, 2020

Hi @makk-mma thanks for taking the time to ask about this here. Someone with more knowledge on the topic than I should hopefully get a chance to give you a better answer here soon, but I believe it has to do with Tcl itself not supporting emoji-range unicode characters without some special compilation features in it. I don't have a fix for it handy here, but hang on for a little bit and someone else should chime in soon. Thanks!

@vanosg vanosg removed this from the v1.9.0 milestone Mar 14, 2020
@vanosg
Copy link
Member

vanosg commented Apr 3, 2020

So here's what we've found out, after looking into this (finally!) in a somewhat-robust fashion:

The encoding scheme used by Eggdrop's Tcl interface is set based on the locale settings of the host machine. You can check which locale your host machine is using by running the locale command. Eggdrop takes that locale setting of the host machine and compares it to the locales available within Tcl's installed libraries. If it finds one in Tcl that matches (or is close to matching), that is the encoding scheme that is used. If a matching encoding scheme is not found, only then does eggdrop default to ISO 8859-1 encoding.

So in short, the popular patch at http://eggwiki.org/Bugs/Utf-8 only works if the locale is not set/found on the host machine.

If you want Eggdrop to use a specific encoding scheme that it is not currently using, you can view the availabe locales on your machine via the locale -a command, or and then set the one you want to use for that user by export LANG=en_US.UTF-8 (or whichever scheme you want to use).

If your experience with this differs, please don't just flame a response here- find us at #eggdrop on Freenode so we can learn more about your system environment and better address this issue.

Another comment raised by @makk-mma talked about Emoji's that were not supported- that is a "feature" of Tcl. The very helpful Tcl wiki page on the subject simply states "emoji support isn't enabled by default, recompiling with TCL_UTF_MAX=6 is needed". I am unaware of a package-manager install that would remedy this at this time; if I find one I will update this post or hopefully one of you genius's out there will post a better/more thorough set of steps below. Edit: This proposal provides additional information on the 'why' behind this: https://core.tcl-lang.org/tips/doc/trunk/tip/389.md

So to summarize that last paragraph- if you think UTF-8 isn't working for you, try some of the lower-numbered characters like 👍 - or, simply try .tcl encoding system on the partyline. If that works, then your UTF-8 encoding is working. It's just that not all UTF characters are supported by Tcl without a different compile flag set.

I hope this helps, and am looking forward to feedback that may clarify or enhance this post. I'll add something to the wiki's/docs on this subject as well, to help get the word out. If no other comments are raised to the contrary, we'll (finally!) close this issue shortly.

Edit: To recompile Tcl, download the source and edit generic/tcl.h . Look for the line #define TCL_UTF_MAX 3 and change it to #define TCL_UTF_MAX 6. Then follow the instructions to compile and install Tcl. You'll need to recompile Eggdrop, and may need to specify --with-tcllib and --with-tclinc to point to the new location you installed Tcl to.

Anecdotally, a user used the following line to compile Eggdrop and found success:

export LD_LIBRARY_PATH=/usr/local/lib; ./configure --with-tclinc=/usr/local/include/tcl.h --with-tcllib=/usr/local/lib/libtcl8.6.so

@vanosg
Copy link
Member

vanosg commented Apr 25, 2020

For others stumbling on to this thread, another cause of issues (incorrectly attributed to Eggdrop) was the user using putty to connect to the shell/eggdrop, with a terminal that did not support UTF-8 codes (either terminal, or font, I was unable to deduce for sure from the troubleshooting). Switching to a different terminal program resolved the issue.

There were also issues copy/pasting unicode characters instead of using ctrl-shift u [code] (unix) or [code] alt-X (windows) to create the unicode character.

@tlcu
Copy link

tlcu commented May 2, 2020

For those interested in emoji support, the Tcl KitCreator can build Tcl, libraries, etc. compiled with TCL_UTF_MAX=6: https://kitcreator.rkeene.org/kitcreator

@vanosg
Copy link
Member

vanosg commented May 5, 2020

@tlcu Thanks a ton for that discovery - for those who are interested, you can download a compiled library and SDK from rkeene (he's good people). Select your OS (probably Linux/amd64), pick your version of Tcl, and make your selections- for Emoji support, definitely choose "TCL_UTF_MAX=6 (incompatibility with standard Tcl)" as an option, and you can add in things like TLS and Tcllib if you think you'll need those (good to have, just in case...) but you'll also want to make sure you select the "Build Library (KitDLL)" option as well. Once that builds for you, grab the .so but also click the link at the top of the page that says "SDK URL" - that will give you things like TclConfig.sh and tcl.h, which you'll want to compile against. Put those up on your shell and use the ./configure options listed above to point Eggdrop at that library, and you should be good to go. Thanks to @rkeene for an awesome build system for those without the means to compile by themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants