Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new AtU8 beam chunk #1078

Merged
merged 1 commit into from Feb 1, 2017

Conversation

@josevalim
Copy link
Contributor

commented May 31, 2016

The new chunk stores atoms encoded in UTF-8.

This is still work in progress and the following tasks are still pending:

Feedback request:

  1. beam_lib has been modified to support a new 'utf8_atoms' chunk that maps to the new "AtU8" chunk. This means accessing the old "Atom" chunk can now be missing although it is straight-forward to handle it with a case statement.
  2. list_to_atom has not been modified and it will still fail if given a character more than 255. The reason I have decided to not change list_to_atom is because its current documentation does not say it may support characters more than 255 in the future (while the binary_to_atom has a very explicit warning about such).
@ferd

This comment has been minimized.

Copy link
Contributor

commented May 31, 2016

What is meant by an UTF8 codepoint? I thought that by the time you're in UTF-8 you no longer deal in codepoints, just in bytes?

The Unicode consortium does specify that lengths can be counted in one of 4 ways. From the FAQ using the string aनि亜𐂃 (U+0061, U+0928, U+093F, U+4E9C, U+10083)

  • bytes: UTF-8: 14, UTF-16: 12, UTF-32: 20
  • code units (position in a string with the given size): UTF-8: 14, UTF-16: 6, UTF-32: 5
  • code points (how many unicode code points are in there): 5 for all, since there is a combining mark
  • grapheme clusters: 4 for all of them, since this is a logical grouping for the length based on 'characters' as we people usually think of them, rather than the implementation of them.

The two latter are not encoding-specific. What is meant exactly by UTF-8 code points? Just code points, or it's a bad term for code units (and hence bytes)?

@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2016

Hey @ferd, I meant literally code points as in your definition:

code points (how many unicode code points are in there): 5 for all, since there is a combining mark

It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example:

3> binary_to_atom(binary:copy(<<"é"/utf8>>, 255), utf8).
ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé
4> byte_size(binary:copy(<<"é"/utf8>>, 255)).
510
5> binary_to_atom(binary:copy(<<"é"/utf8>>, 256), utf8).
** exception error: a system limit has been reached
     in function  binary_to_atom/2
        called as binary_to_atom(<<"éééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé"/utf8...>>,
                                 utf8)

So this pull request is currently backwards incompatible (hence the pending tasks). The question is mostly how to implement the new binary_to_atom efficiently.

@@ -6332,7 +6355,7 @@ erts_make_stub_module(Process* p, Eterm Mod, Eterm Beam, Eterm Info)
goto error;
}
define_file(stp, "atom table", ATOM_CHUNK);
if (!load_atom_table(stp)) {
if (!load_atom_table(stp, ERTS_ATOM_ENC_LATIN1)) {

This comment has been minimized.

Copy link
@margnus1

margnus1 May 31, 2016

Contributor

You'll need to go through the same if (stp->chunks[UTF8_ATOM_CHUNK].size > 0) ... charade down here, or code:make_stub_module/3 (and thus HiPE) will break. There's some tests for it in erts/emulator/test/code_SUITE*

This comment has been minimized.

Copy link
@josevalim

josevalim Jun 1, 2016

Author Contributor

Thank you. I have run the code suite and fixed this and the remaining failing tests, I have updated this PR.

@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from 0ad2fdc to 8b5121e Jun 1, 2016
@OTP-Maintainer

This comment has been minimized.

Copy link

commented Jun 1, 2016

Patch has passed first testings and has been assigned to be reviewed


I am a script, I am not human


@bjorng bjorng self-assigned this Jun 3, 2016
@bjorng

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2016

We will not have time to do a thorough review until OTP 19.0 has been released, but here is some quick feedback:

The test case lc_SUITE:effect/1 fails.

binary_to_atom(): Solution i, that is keep the existing error semantics and implement it efficiently.

Not modifying list_to_atom/1 is an interesting idea that we have not considered.

@nox

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2016

@bjorng Can we name the chunk 'JOSÉ'? PLS.

@psyeugenic psyeugenic added the kanban label Jun 13, 2016
@bjorng bjorng added the waiting label Sep 22, 2016
@bjorng

This comment has been minimized.

Copy link
Contributor

commented Sep 22, 2016

We will take a closer at this PR request when you have fixed the failed test case mentioned above and rebased it on the latest master.

@nox

This comment has been minimized.

Copy link
Contributor

commented Sep 22, 2016

It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example:

I think you interpreted that incorrectly. Currently only UTF-8 encoded atoms that can be converted to Latin-1 are supported, so it just counts the number of Latin-1 code points, which equates the number of Latin-1 code units, which equates the number of bytes in the result.

@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Sep 22, 2016

@bjorng I will allocate time in the next weeks to work on all of my pending pull requests. :D

While talking to @nox, we came up with another solution for the second bullet above. @nox commented that it would be much simpler to count bytes instead of counting codepoints. However, if we count bytes and keep the limit of 255 bytes, it would be backwards incompatible because an atom made only of the letter é in latin1 will now take double the space in UTF-8, possibly triggering the system limit.

However, if we double the maximum atom size, we could keep the conceptually simpler and faster code that only counts bytes and remain backwards compatible.

From the Elixir perspective, Increasing the atom limit would be a positive change, since function names must be atoms, sometimes we may reach this limit when building functions dynamically.

@bjorng thoughts on this suggestion?

@bjorng

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2016

I think that the BIFs that create atoms should count code points, and let the limit be 255 code points. When converting from a list, the number of code points is the number of elements in the list (assuming that we don't try to combine code points that can be combined).

The chunk in the BEAM file can give the size of each atom in bytes, and the loader can trust the compiler not to create overlong atoms. (There could be a sanity check to abort the loading if any atom is longer than 1024 bytes.)

@bjorng

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2016

Rationale for counting code points: We don't want to handle support request from Klingon users that complain that they can only put 85 Klingon characters into their atoms, while speakers of most west-European languages can have 255 characters in their atoms.

@nox

This comment has been minimized.

Copy link
Contributor

commented Sep 24, 2016

What about following what ROK proposed in EEP20?

2 bytes for byte length, 2 bytes for code point count.

@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from 8b5121e to 7367a1f Sep 26, 2016
Eterm
erts_atom_put(const byte *name, int len, ErtsAtomEncoding enc, int trunc)
int
erts_atom_put_index(const byte *name, int len, ErtsAtomEncoding enc, int trunc)

This comment has been minimized.

Copy link
@josevalim

josevalim Sep 26, 2016

Author Contributor

@bjorng I have introduced this function that returns the atom index instead of the atom so we have control upstream if we should error with badarg or system_limit. I have two questions I would love your input on:

  1. Should the name have the erts_ prefix?
  2. Should we use #define ATOM_BAD_ENCODING and #define ATOM_MAX_CHARS instead of using integers? If so, is it fine to define those on atom.h?

Notice the new function and its return values are used in binary_to_atom (erl_unicode.c). The PR has also been rebased. Once those questions are answered, I will squash everything and we should be good to go.

This comment has been minimized.

Copy link
@psyeugenic

psyeugenic Sep 26, 2016

Contributor

The convention for the erts_ prefix is to use it for global functions. Older function, or some very often used functions, e.g. size_object does not use it though.

This comment has been minimized.

Copy link
@bjorng

bjorng Sep 26, 2016

Contributor

Answer for 2: Yes, and yes.

@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from 7367a1f to e426d18 Sep 26, 2016
@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2016

@bjorng This is ready for review. I have rebased, fixed all feedback and features.

I have ran emulator, compiler and stdlib test suites with the following results:

  • emulator: One failure regarding registered sends across nodes (I inspected the code and it seems to not related be to this patch)
  • compiler: No failures
  • stdlib: One failure in the shell suite which seems unrelated

I have also added a test that compiles a module from the Erlang Abstract Format with utf8 atoms, to ensure the whole compilation chain works, and also a test for the r19 option.

@OTP-Maintainer

This comment has been minimized.

Copy link

commented Sep 26, 2016

Patch has passed first testings and has been assigned to be reviewed


I am a script, I am not human


@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from e426d18 to f4424da Sep 26, 2016
@OTP-Maintainer

This comment has been minimized.

Copy link

commented Oct 1, 2016

Patch has passed first testings and has been assigned to be reviewed


I am a script, I am not human


@bjorng bjorng removed the waiting label Oct 5, 2016
@bjorng

This comment has been minimized.

Copy link
Contributor

commented Oct 12, 2016

Three test cases fail in compile_SUITE: core, core_roundtrip, and asm.

I updated the primary bootstrap and compiled from a clean repository.

@bjorng bjorng added the waiting label Oct 12, 2016
@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Oct 12, 2016

I updated the primary bootstrap and compiled from a clean repository.

How do I update the primary bootstrap? I remember doing a clean build but I did not touch the bootstrap files. Thanks!

@bjorng bjorng added testing and removed waiting labels Dec 7, 2016
@bjorng bjorng removed the testing label Dec 14, 2016
@bjorng

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2016

I found and fixed another minor bug. Please squash and rebase on latest master.

@bjorng bjorng added the waiting label Dec 19, 2016
@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from 4bb624c to b58cd32 Dec 19, 2016
@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Dec 19, 2016

Done and done!

Btw the reducing memory during compilation I had to rebase on may help Elixir too. 👍

@bjorng bjorng removed the waiting label Dec 19, 2016
@bjorng bjorng added testing and removed testing labels Jan 9, 2017
@bjorng

This comment has been minimized.

Copy link
Contributor

commented Jan 13, 2017

Can you please rebase again?

We will test the branch in our daily builds again to see that there are no remaining failed test cases.

@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from b58cd32 to d14cae9 Jan 13, 2017
@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Jan 13, 2017

@bjorng done!

@bjorng bjorng added the testing label Jan 14, 2017
@bjorng

This comment has been minimized.

Copy link
Contributor

commented Jan 25, 2017

This branch no longer seems to cause any problems in our daily builds. We are thinking about merging it soon. More work is probably needed in some applications, for example to ensure that "~ts" is used as format specifier when an atom is displayed, but we can do that in separate branches.

However, I think that the documentation should be updated in this branch to reflect the changes, at least the BIFs for the atom BIFs. Do you think you could update the documentation?

@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2017

I will push the docs soon. :) Two questions:

  1. Regarding ~ts, I understand why it is needed for binaries and char lists, but it shouldn't be necessary for atoms assuming that atoms do have the encoding they are written in? So I suggest making ~s work regardless of the atom encoding. Or do you see reasons where this would be a bad idea™?

  2. Should we explicitly document module names can now be UTF-8? If so, I would like to add a test to compile_SUITE where we compile Erlang Forms module with a UTF-8 atom as name.

@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2017

I have pushed a separate commit with docs. Once review is done, I can squash it all together.

@bjorng

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2017

  1. Making ~s work for any atom is an interesting idea. I will discuss it with the team.

  2. We have not reached a final decision yet. Our current thinking is that we will allow Unicode characters in module names, but we will not recommend it because of potential portability problems having to do with the file names (Linux does not enforce any particular encoding for file names, and their could be other potential pitfalls when moving files from one operating system to another). The reason that we will probably allow it is because those potential problem already exists for module names with Latin1 characters.

@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Jan 26, 2017

Our current thinking is that we will allow Unicode characters in module names, but we will not recommend it because of potential portability problems having to do with the file names

I agree. Let me know what are your decisions regarding 1 and 2 and I will gladly add tests and improve the docs.

@bjorng

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2017

There may some time before we can reach a decision. We will merge this branch and do further corrections/improvement in separate pull requests.

Please squash the commits. When you have done it, we will run the branch once more in our daily builds and then merge it.

The new chunk stores atoms encoded in UTF-8.

beam_lib has also been modified to handle the new
'utf8_atoms' attribute while the 'atoms' attribute
may be a missing chunk from now on.

The binary_to_atom/2 BIF can now encode any utf8
binary with up to 255 characters.

The list_to_atom/1 BIF can now accept codepoints
higher than 255 with up to 255 characters (thanks
to Björn Gustavsson).
@josevalim josevalim force-pushed the josevalim:jv-atu8-chunk branch from 7257839 to 26b59df Jan 30, 2017
@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Jan 30, 2017

Squashed and pushed.

@nox

This comment has been minimized.

Copy link
Contributor

commented Jan 30, 2017

do further corrections/improvement in separate pull requests.

Like renaming the chunk to 'JOSÉ', right?

@bjorng bjorng merged commit 26b59df into erlang:master Feb 1, 2017
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@josevalim

This comment has been minimized.

Copy link
Contributor Author

commented Feb 1, 2017

🎉

@weisslj

This comment has been minimized.

Copy link
Contributor

commented on 26b59df Feb 2, 2017

Before this change, this line exited with system_limit:

binary_to_existing_atom(list_to_binary(lists:duplicate(256, $a)), utf8).

after this change, it exits with badarg. I think this is intended, as

list_to_existing_atom(lists:duplicate($a, 256)).

always exited with badarg. Just wanted to note it here as it changes the behavior of the function.

@KronicDeth

This comment has been minimized.

Copy link
Contributor

commented Aug 7, 2017

Support for decompiling AtU8 chunks has been added to IntelliJ Elixir in KronicDeth/intellij-elixir#777, which will be released in version 6.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants
You can’t perform that action at this time.