New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Second try for utf support. #532

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@mepeisen

mepeisen commented Mar 18, 2018

No description provided.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 18, 2018

Resource pack for Unicode-Font:
https://github.com/mepeisen/CCraft-Gnu-Unifont/files/1823094/CCraft-Gnu-Unifont.zip

Sample lua script:
test.lua.txt

Custom build including the patch:
ComputerCraft-1.80pr1.jar.zip
build is based on forge 1.12-14.21.1.2387

Extensions:

  • A config Option for enabling utf8 automatically
  • Each Computer has ist own state if utf8 is enabled.
  • New methods: os.isUtf8() and os.setUtf8()
  • String api works on bytes (disabled utf) or on utf characters (utf enabled)
  • Fonts can be loaded via resource packs
  • New methods: term.getFontName() and term.setFontName()
  • Each Computer and Monitor contains ist own font
  • Full working clipboard

Known limitations/ bugs:

  • Switching utf mode on terminals that already contain data may result in errors (window/term.blint may return Errors)
  • Monitors not yet fully tested
  • No valid fallback if resource pack is found but inactive
  • No config option for default font name.

Hope you enjoy

@mepeisen mepeisen referenced this pull request Mar 18, 2018

Closed

Beginning unicode support #531

@SquidDev

This comment has been minimized.

Contributor

SquidDev commented Mar 19, 2018

Finally got round to looking at this, sorry. It looks much nicer. 👍 Just a couple of questions/comments:

New methods: term.getFontName() and term.setFontName()
Each Computer and Monitor contains ist own font

I'm rather confused about the whole font name implementation and the rationale behind it. It seems a little overkill - I'd personally just have two fonts (CC's default one and a unicode supporting one) and a flag to switch between the two. Supporting anything else introduces all sorts of issues if the server is asking for a font the client doesn't have.

I'd personally remove any storing fonts on the tile entity: it should just be stored on the terminal object and configured on computer startup, not persisted across saves.

String api works on bytes (disabled utf) or on utf characters (utf enabled)

I'm a bit flakey about this. Most of Lua's IO system (and thus ComputerCraft's) rather assumes strings are nothing more than byte arrays. For instance, file handles' just return always return strings, even in binary mode. I'd rather expose something like Lua5.3's utf8 library: it means you don't have any inconsistency between the string metatable and string global table (grumbles about LuaJ).

I think it's also worth noting that indexing strings is now O(n) instead of O(1), so we might want to review how the window API is implemented. Not that I think it'd be a major issue given CC's small screen sizes. I'm generally inclined to shift as much work to the Lua side as possible, as it's less finicky then messing with the "beauty" which is LuaJ.


It's worth noting that I'm just spitballing here. It's probably worth waiting for Dan to comment before making any major changes, as he's the only one with authority round here :).

Other people: please chip in! I know I'm absolutely awful at API design.


Edit: And another thing (sorry.

  • It might also be worth thinking about how to handle some of the weirder parts of unicode (and Java's handling of it). IIRC Java doesn't really like codepoints outside the 16 bit range (🙌 is two characters) so we either ignore them or do some funky stuff inside the font renderer.

    Similarly, we should probably think about the immesurable joy which is zero-width-joiners/spaces, skin tone modifiers, etc... I'd argue we can just ignore them (we don't really need a full unicode implementation after all), but still worth bearing in mind.

  • I'm not sure how you built the modified LuaJ file, but I had to apply the following diff to the build.xml file. Grumbles about Ant.

  • Even with UTF support disabled, this reintroduces "the string bug". Namely sending binary data as a string entirely garbles it (os.queueEvent("x", "\200") print(os.pullEvent("x")) will result in "\239\191\189"). I suspect the LuaStringString conversion methods will also have to obey the global utf8 setting.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 19, 2018

I'm rather confused about the whole font name implementation and the rationale behind it. It seems a little overkill

Yes. But that assumes the client has a know fontset we know of. Becaue of licening poblem we do not know the font the client likes to use. And maybe someone likes to use multiple Fonts on different Monitors, some monospaced adventure font set to display a "film" etc.
Yes, a bit overkill,. A flag would do the same. :-)

I'd personally remove any storing fonts on the tile entity: it should just be stored on the terminal object and configured on computer startup, not persisted across saves.

I would have a closer look at it. The saveDescription is used to distribute it to client over Messages. So I have chosen it. I will review it. It is not meant to be stored persistent at all.

For instance, file handles' just return always return strings

This assumes a convention where all api developers and program developers always use this utf library. The main problem are the blint methods in terminals. They assume that the color strings are of same length than the given text.
If a library sees a string of byte length 3. And prints it using color strings of three bytes and the terminal or Java methods see a string of length 1, what should the methods do? How should we Interpret the call? Which color to take?
So either we won't be backward compatible because we say "never use the string-methods again but use utf compatible methods" or we try to find a compromise for older apis etc.

so we either ignore them or do some funky stuff inside the font renderer

Three aspects:

  1. Java always gives us a length of two for characters not Fitting in "char" space. Javaa char are at the Moment two bytes long and encoding in utf-16. On modern operating systems/ fonts it will work because the Java awt library etc. know how to handle it but Java itself will have the same problem than lua only within other code ranges :O)

  2. Yes, having a full unicode implementation for all funny Things would be really a big change. Example: Ḧ̈̈̈äl̈l̈ö
    This is made of classic ASCII characters including combining diaeresis. Characters that influence apperance of previous character. Supporting thonse funny things will be very cool but also a big pain.

  3. Displaying all as characters build from png textures is a bit of pain. I found out that OpenGL or something else has problems if the texture grows too big. Packaging more than 65536 characters won't be possible at all. And still unicode has limitations. If we decide to Support those characters at all we should do a review how the FixedFontRenderer works.

I'm not sure how you built the modified LuaJ file

I modified build.gradle to include the luja source folders. And removed luaj jar from libs and instead add the libraries luaj requires into libs folder. It is not using the ant build at all.
I did not commit this one. But maybe dan can give us the answer how he built it. I see some build scripts but maybe he built it not via this scripts :-)

Even with UTF support disabled, this reintroduces "the string bug"

I will review it. But why this is a problem at all? If a script really likes to send binary values over events in this way it ran into Trouble even with old versions of the lib. They eat the values and send '?' If I remember right. So one always gets not the expected result right?
A better solution would be to send integers over pushEvent...

@dmarcuse

This comment has been minimized.

Contributor

dmarcuse commented Mar 20, 2018

I agree with squid with regards to the Lua 5.3 utf8 library. I think it would be a much better idea to implement a new API for UTF-8 rather than trying to implement it on top of the existing string type, which as we can see, is prone to lots of issues with the way LuaJ works.

@Wilma456

This comment has been minimized.

Contributor

Wilma456 commented Mar 20, 2018

But what's with the existing Input methods like the char or the paste event? They works with a normal String. If utf8 is only a API, it would not possible to integrate new Chars from UTF8 in these events.

@dmarcuse

This comment has been minimized.

Contributor

dmarcuse commented Mar 20, 2018

A utf8_paste event or similar could be added and used when utf8 content is pasted.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 20, 2018

To sum up:

  1. UTF8 escape strings would be possible if we customize lexer; see https://www.lua.org/manual/5.3/manual.html#3.1
    1.1) Since LuaJ supports it out of the box any source file saved as utf8 implicit creates a valid utf byte sequence without need of escape sequences.

  2. Add utf API: https://www.lua.org/manual/5.3/manual.html#6.5
    2.1) For better usage I would like to add more functions (sub, gmatch etc.). Basically everything that is present in default string library.

  3. We add following new functions:
    3.1 printutf8
    3.2 term.writeutf8
    3.3 term.blintutf8

  4. We customize the bios.lua etc. (the CraftOS) to support utf encodings.

  5. pressed keys fire both, a "char"/"key_up"/"key" Event as well as "utf8_char"/"utf8_key_up"/"utf8_key" Event

  6. Clipboard fires both, a "paste" event as well as "utf8_paste" Event.

May work. I will have to think a while if we get into trouble. I am still afraid what happens if an "old script" gets utf8 strings (wherever they are from) and tries to print them. It won't be fully backward compatible.
The old hacks of dan reduce utf8 multibyte characters to one '?'.
Where the new code MUST handle them as multiple '?' (one '?' for each byte).

Still open question what to do for multi-char sequences. Do we depend on java strings or do we "fully support" utf8 and even treat multi-char-sequences as single character?

@mepeisen

This comment has been minimized.

mepeisen commented Mar 20, 2018

  1. Maybe we will need a "readUtf8" too. So that older scripts will continue get non utf strings.
@Wojbie

This comment has been minimized.

Contributor

Wojbie commented Mar 20, 2018

It's starting to sound like those functions would fit in their own api rather in all apis piecemeal.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 20, 2018

Why? At least term/window/monitor are objects. Putting their utf methods to some other place would be confusing.
"read" is a global function. Putting the "readUtf8" somewhere else would be confusing too.

@Wojbie

This comment has been minimized.

Contributor

Wojbie commented Mar 20, 2018

I missed point 2 somehow.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 20, 2018

Ah ok :-) The string and the new utf8 API are defined by lua standard itself. So we should indeed follow it. It's easier to document and easier to adapt existing utf aware scripts and code samples for users ;-)

@SquidDev

This comment has been minimized.

Contributor

SquidDev commented Mar 20, 2018

You could arguably have something like io.utf8 or utf8.io, which acts like the default io library but on utf8 strings: this provides .write and .read functionality. I dunno, if feels nicer than polluting the global table even more.

With respect to events, one shouldn't need special unicode variants of key or key_up, as that doesn't include any character specific information. It might be nicer to just add an additional value to the char and paste events, so you've got local event, latin1_str, utf8_str = os.pullEvent("char") -- or "paste". That way you're not firing two events, though it's a little uglier.

I am still afraid what happens if an "old script" gets utf8 strings (wherever they are from) and tries to print them.

This shouldn't be as much of a problem, as they'll should just be printed using CC's latin1 codepage instead of the full unicode one. It may result in some funky output, but it won't be fundamentally broken per-say.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 20, 2018

This shouldn't be as much of a problem, as they'll should just be printed using CC's latin1 codepage instead of the full unicode one.

Yes, but on Java side we will receive a string with valid utf characters. We still need on java side the "hack" in old terminal api to replace the utf codes by '?'. The newer utf versions of the functions will use the java strings without modification.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 22, 2018

A last decision before I change the code. currently the following charset is used by computer craft:
image
As far as I can see this is a variant of iso-8859 or windows cp-1252.
However: If we have full utf8 support enabled the characters above 0x7F would be considered illegal for old term.print api (because part of utf8 multibyte characters).

iso-8859 and windows cp-1252 are not compatible to utf8. Dan hacked some things into LuaJ (casting integers to characters and back again) instead using the java string getBytes functions with correct charsets etc.

A small example with character 'ü'.

In cp1252 this is made of one byte: 0xFC.
In UTF-8 this is made of two bytes: 0xC3 and 0xBC.

What do you think? Is it ok to break backward compatibility at this point? In the meaning that old print api only supports ASCII (7 bit) and if one wants to use more he/she should change the script to new utf variant?
I currently do not know if we are ever fully backward compatible :-(

@Wojbie

This comment has been minimized.

Contributor

Wojbie commented Mar 22, 2018

Wasn't the whole point of having 2 functions for all utf8 stuff so that current ones would work with this charset and utf8 ones would work with utf8?

@mepeisen

This comment has been minimized.

mepeisen commented Mar 22, 2018

Yes, of course. But think of following simple script:

local s = string.char(252)
print(#s)
print(s)

What do you expect? In current ccraft version you will get a "1" and a "ü". Because of dans hacks.
If we change it we will get a "2" and a "??" because the LuaString will hold two bytes (for utf8).

@SquidDev

This comment has been minimized.

Contributor

SquidDev commented Mar 22, 2018

If we change it we will get a "2" and a "??" because the LuaString will hold two bytes (for utf8).

It shouldn't. The original string library must still funtion as before, otherwise you're going to break any funtion which operates on binary data. Dan has always been verf strict in has backwards compatibility requirements and, whilst I think you can be more flexible here, breaking fundamental Lua semantics is not an option.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 22, 2018

ok, I will try to find a solution. But it may not be very nice.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 24, 2018

And now the third try. Now we are fully backward compatible!
That means: If you do nothing you will have the same behaviour than previous CCraft versions.
To use utf8 support in strings you can use the utf8 library as well as various utf functions (f.e. printutf8).
To use utf8 support for display add a font to resource packs (notice: newer version at https://github.com/mepeisen/CCraft-Gnu-Unifont/releases/tag/v0.0.2 )

Custom build can be found here:
ComputerCraft-1.80pr1.jar.zip
sources in my fork: https://github.com/mepeisen/ComputerCraft

And last but not least a test:
test.lua.txt

Screenshot:
small

As you can see it now even works if one function prints classic ASCII and another one prints utf characters.
We got one problem at all: loading utf files will destroy the utf characters... But I decided to NOT remove the hacks into file loading. So that old scripts behave the same.
However this leaded me to add more utf functions we might to get. And there are still some more Things to do. Will finish for today. If you are happy with it I will start a new pull request and we may qwork together for testing and rewriting the rom programs, shell etc.:

See my current working list:
changes-utf-support.txt

@mepeisen

This comment has been minimized.

mepeisen commented Mar 24, 2018

And yes, in screenshot you see a smiley although not very nicely drawn (U+1F600, http://www.utf8-chartable.de/unicode-utf8-table.pl)
And a line of klingon, see CSUR from https://www.evertype.com/standards/csur/

@SquidDev

This comment has been minimized.

Contributor

SquidDev commented Mar 24, 2018

Judging by the changelog and images that's looking good! I'll have a play with it later tonight. Just a couple of nitpicky things and some questions:

  • I'd chuck Utf8API into the LuaJ sources and rebuild the jar. It's not technically LuaJ code, but it's what I was instructed to do in #298 so probably worth getting ahead of the game.

  • I think it might be nicer to have separate utf8.* (where * is io, textutils) libraries rather than adding new functions to the existing ones. Heck, I'd even go as far as to make them requireable instead of using global APIs. Personally I'd scrap printUtf8/printErrorUtf8 and require people to go through the io library - I don't think we need to duplicate every piece of functionality, as this is rather and advanced and complex feature as-is.

Function "dump" from string api

I'd dump this (ha-ha) - string.dump just operates on raw byte arrays so it shouldn't function any differently whatever encoding you're using.

Function "lower" to lower case all utf characters; depending on java locale

As someone in an English speaking country, I'd prefer to make this Locale.ENGLISH (or Locale.NONE if it exists, I can't remember) as it means code runs consistently on all platforms. I can image non-English people might find this less appealing though - opinions welcome :).

Function "find" a utf variant of string.find, etc...

Blughr. I'm stuck between "this is good for support" and "oh goodness, even more code". I haven't looked at the implementation, but I'd suggest copying the pattern matching and formatting code from Cobalt, as that's much less broken then LuaJ's implementation.

FSAPI: Supporting new modes: "ru", "wu", "au". Similar to original text modes

There's a bit of me which is reluctant to add more complexity to the file system and HTTP APIs, but I guess binary mode doesn't really cut it as it doesn't have readLine. I wonder if it'd be possible to binary handles instead.

HTTPAPI: new variants of arg4

Sadly we still need to handle the old values boolean ones as well (I wasn't imagining ever needing to extend this, sorry). I'd personally use magic strings ("b" for binary, "u" for unicode or something) instead of magic numbers.

TermAPI: new functions "getFontName", "setFontName"

I'm still blurghr about this, but there you go :P. Ideally we could have "one unified font" which looks consistent. It might be possible to fall back to Minecraft's built-in font renderer (or a fixed width variant) for non-latin1 characters.

new global function loadutf8, loadfileutf8, dofileutf8, os.*, etc...

Is loadutf8 needed: I'm not entirely sure how it'd differ from the current implementation - it just delegates straight off to LuaJ. I'd personally switch the file related functions to use binary mode instead: this means we can also use it on binary chunks, which is always a perk.

there seems to be an old copy of HTTPRequest

And this is why you shouldn't rebase when tired! This is entirely my fault, but it didn't seem worth putting a PR together for something so minor.

utf variant for ModemPeripheral.transmit, TileCable.Peripheral.callRemote

Nope. As long as the encoding is consistent between the entry and exit points it shouldn't matter.

@mepeisen

This comment has been minimized.

mepeisen commented Mar 24, 2018

I'd chuck Utf8API into the LuaJ sources and rebuild the jar. It's not technically LuaJ code, but it's what I was instructed to do in #298 so probably worth getting ahead of the game.

Hmmm. That means to put some kind of utf logic and utf string inside the LuaJ Code too. And I don't even know if the LuaJ classes are available at client (for font renderer etc.) I will check this.

I think it might be nicer to have separate utf8.* (where * is io, textutils) libraries rather than adding new functions to the existing ones.

I don't prefer global globals or placing them alongside into utf8 API. However someone should make a decision. :-)

opinions welcome :).

Hmmm. Maybe a solution with optional locale parameter? As "lower" and "upper" are not part of the official lua 5.3 utf8 api we could add a custom function doing whatever we like to do... Again someone has to make a decision.

I'm stuck between "this is good for support" and "oh goodness, even more code"

If there are some other custom APIs for utf8 we should add we can put them into the code. Hmmm. Someone has to make a decision ;-)

There's a bit of me which is reluctant to add more complexity to the file system and HTTP APIs, but I guess binary mode doesn't really cut it as it doesn't have readLine. I wonder if it'd be possible to binary handles instead.

Yeah, that could be an alternative. Treat binary streams with string reading functions. For the Moment to solve the issue with corrupt characters I added those new modes.

Sadly we still need to handle the old values boolean ones as well

I dod not remove that ;-)

I'd personally use magic strings ("b" for binary, "u" for unicode or something) instead of magic numbers.

If thats the decisio I will change it :-)

I'm still blurghr about this, but there you go :P. Ideally we could have "one unified font" which looks consistent.

I did not find some other one. And Gnu Unifont has some incompatible license. So there must be some System where we do not depend on installed Fonts. It already falls back to the old font if the font Name is not known by the client. Resulting in '?' for every Extended utf character...

Is loadutf8 needed:

If we like to be fully backward compatible we need it. Because the dan hacks corrupt the utf characters. You can see it in screenshot. The first and ssecond print gives us 3 '?' characters. The computer craft wiki says that CCraft is compatible with utf8. But actually it isn't.

Either we break compatibility to let "load" Support all utf characters or we have a second function "loadutf8".

You find some changes from dan inside LuaJ that breaks the load function or more concret: That changes unwanted utf multi byte characters to '?'.
Someone has to decide if we want to be compatible to older versions or if we break it and enable utf Support on all core functions like "load" etc.

Nope. As long as the encoding is consistent between the entry and exit points it shouldn't matter.

That was a todo marker for me ;-) Did not yet review it. I only find some string handling inside. Will have to check it the next days.

@SquidDev

This comment has been minimized.

Contributor

SquidDev commented Mar 24, 2018

Is loadutf8 needed:

If we like to be fully backward compatible we need it. Because the dan hacks corrupt the utf characters. Either we break compatibility to let "load" Support all utf characters or we have a second function "loadutf8".

Dan's modifications should only matter when bridging between Java strings and Lua ones. I'm fairly sure the lexer, parser and compiler (and so load) only operate on LuaString and so one doesn't need to worry in the same way about encoding (for instance load("return \"\226\134\146\"") should do what one expects (generates return '→').

I guess the it's possible the parser may break on some multi-byte sequences, but I short of creating an entire unicode-aware copy of the parser, I don't think that's avoidable.

Someone has to make a decision ;-)

Yes, one rather need to get Dan's opinion on this, as it is rather the only opinion which matters. Sadly he's not really had time/willpower to look at the repo lately :(

@mepeisen

This comment has been minimized.

mepeisen commented Mar 25, 2018

or instance load("return "\226\134\146"") should do what one expects

Yes, it works.

But saving files in IDE as UTF8 and using the character unescaped directly does not work (see my example lua script above, which is actually doing this). Maybe this is even the lua standard way to not be able to read utf source files in a proper way. But it's strange for me as a non lua evangelist.

Yes, one rather need to get Dan's opinion on this, as it is rather the only opinion which matters. Sadly he's not really had time/willpower to look at the repo lately

So the community has to decide and to vote. Maybe everyone has to vote for how many utf functions will be needed and where they reside and how they work. We will vote for and hopefully dan shares our opinion at the end. I peronally do not care where functions reside as long as it works.

As far as I can see. We basically require:

  • utf8 api
  • term/monitor/window: writeutf8, blintutf8
  • extension of char/paste events
  • font selection because of unifont license (however this is designed)
    All other methods or APIs are bonus work to make everything nice. Specially:
  • utf file io to prevent getting utf strings corrupt (at least this may even be falled back to binary file handling)
  • Maybe labeling, file system names etc. (I still did not have time to check this out)

Anyone that can contact dan directly to ask him for sharing his opinion with us?

@mepeisen

This comment has been minimized.

mepeisen commented Mar 25, 2018

Fixed base line for smileys (almost) and played around with different font types:

  1. low res non alpha:
    small
  2. low res with alpha channel:
    blending
  3. high definition:
    hd

And I guess there is one big benefit in loading fonts from resource packs: One could choose the font he/she likes and that fits the best in RAM etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment