Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Comparing changes

Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also compare across forks.
base fork: baryluk/otp
...
head fork: baryluk/otp
compare: source_code_encoding_in_compiler_and_epp
Checking mergeability… Don't worry, you can still create the pull request.
  • 2 commits
  • 7 files changed
  • 0 commit comments
  • 1 contributor
Commits on Nov 15, 2011
@baryluk Add encoding and Unicode support to compiler and epp
This commit adds and documents {encoding, Encoding} compiler options
(to be used like in compile:file/2). It also adds
epp:open/4 function, with 4th argument being an encoding.
Encoding term is like defined by unicode:encoding() datatype,
and passed to file:open/2.
A test for parsing UTF-8 encoding files (with Unicode string literals
and character contants) by epp is added to test suite.
A new section called "Source files" is added in "Using Unicode
in Erlang" guide.
7f992fa
@baryluk Small fixes to be merged with unicode support. 2ac9f99
View
37 lib/compiler/doc/src/compile.xml
@@ -152,6 +152,43 @@
for details.</p>
</item>
+ <tag><c>{encoding, Encoding}</c></tag>
+ <item>
+ <p>The compiler will parse source file using encoding Encoding,
+ like definied by <c>unicode:encoding()</c> datatype.
+ Encoding options are passed to <c>epp</c> module, and from there
+ to <c>file</c> module which performs actuall conversion.
+ This options is usefull for source code files encoded in UTF-8 (or other from UTF family)
+ having literal strings and literal integers written using $char notation,
+ which contains Unicode characters.
+ Default behaviour is to use <c>latin1</c> encoding.
+ </p>
+ <p><em>Note:</em>
+ This flag is currently
+ only intended to be used with <c>.erl</c> and <c>.hrl</c> files.
+ </p>
+ <p<em>Warning:</em>
+ Using encoding which is different than actual encoding of the compiled
+ file may produce unexpected behaviour.
+ All included files will be read and parsed using same encoding Encoding,
+ which may produce unexpected results, if they are in fact encoded using different encodings.
+ Remember however, that latin1 encoded files (which is default and only supported encoding in previous Erlang releases)
+ with characters codepoints below 127, are also valid and semantically same utf8 encoded files,
+ which essentially makes mixing of latin1 and utf8 files safe, even if encoding is forced to be utf8 in both cases.
+ To guarantee safe inclusion of other files, latin1 encoded files must use only safe subset of latin1,
+ precisly characters included in ASCII charset (becasue both latin1 and utf8 are compatible extensions of 7-bit ASCII encoding).
+ As actuall Erlang code syntax definition only uses characters from ASCII charset (ignoring $char notation, string literals and comments),
+ this limitation is not of practical importance.
+ </p>
+ <p><em>Note:</em>
+ All files on disk will still be written using standard encodings and formats.
+ Strings containing Unicode characters and integers with Unicode
+ characters will be interpreted and written as list of integers
+ and integer respectivly, not quoted string literals or integer literal with $char
+ notation. This can make intermidiate files harder to read.
+ </p>
+ </item>
+
<tag><c>encrypt_debug_info</c></tag>
<item>
<marker id="encrypt_debug_info"></marker>
View
11 lib/compiler/src/compile.erl
@@ -761,7 +761,7 @@ parse_module(St) ->
Opts = St#compile.options,
Cwd = ".",
IncludePath = [Cwd, St#compile.dir|inc_paths(Opts)],
- R = epp:parse_file(St#compile.ifile, IncludePath, pre_defs(Opts)),
+ R = epp:parse_file(St#compile.ifile, IncludePath, pre_defs(Opts), source_encoding(Opts)),
case R of
{ok,Forms} ->
{ok,St#compile{code=Forms}};
@@ -1473,6 +1473,15 @@ pre_defs([]) -> [].
inc_paths(Opts) ->
[ P || {i,P} <- Opts, is_list(P) ].
+source_encoding(Opts) ->
+ case [ Encoding || {encoding, Encoding} <- Opts ] of
+ [Encoding] ->
+ Encoding;
+ [] ->
+ default_encoding
+ % _ -> only one encoding option allowed. fail if more passed.
+ end.
+
src_listing(Ext, St) ->
listing(fun (Lf, {_Mod,_Exp,Fs}) -> do_src_listing(Lf, Fs);
(Lf, Fs) -> do_src_listing(Lf, Fs) end,
View
1  lib/stdlib/doc/src/epp.xml
@@ -51,6 +51,7 @@
<func>
<name name="open" arity="2"/>
<name name="open" arity="3"/>
+ <name name="open" arity="4"/>
<fsummary>Open a file for preprocessing</fsummary>
<desc>
<p>Opens a file for preprocessing.</p>
View
16 lib/stdlib/doc/src/unicode_usage.xml
@@ -168,6 +168,22 @@ Eshell V5.7 (abort with ^G)
<image file="ushell2.gif"><icaption>Unicode characters in allowed and disallowed context</icaption></image>
</section>
<section>
+<title>Source files</title>
+<p>Source code files can be compiled in different encoding by using <c>{encoding, Encoding}</c> option to the <c>compile:file/2</c> function.
+Encoding can also be passed to <c>epp:open/4</c> function when using preprocessor.</p>
+<p>For example <c>compile:file("somemodule.erl", [{encoding, utf8}])</c> will read,
+parse and compile file somemodule.erl using UTF-8 encoding.
+This allows using actuall UTF-8 characters in string literals and character constants.
+String binaries with literal Unicode are not supported yet, and will fail at runtime.
+Using extended encoding is not available for atoms or any other language constructs yet.
+They can however be used in comments.
+</p>
+<p>All included files by compiler and <c>epp</c> (Erlang preprocessor) will be read using same encoding.
+To prevent unexpected results, save all files under same encoding
+or in case of mixing with latin1, use utf8, makeing sure latin1 files use only latin1 codepoints
+below 127 (a ASCII charset), so they can be safely read as UTF-8 files.</p>
+</section>
+<section>
<title>Unicode file names</title>
<p>Most modern operating systems support Unicode file names in some way or another. There are several different ways to do this and Erlang by default treats the different approaches differently:</p>
<taglist>
View
58 lib/stdlib/src/epp.erl
@@ -20,9 +20,9 @@
%% An Erlang code preprocessor.
--export([open/2,open/3,open/5,close/1,format_error/1]).
+-export([open/2,open/3,open/4,open/5,close/1,format_error/1]).
-export([scan_erl_form/1,parse_erl_form/1,macro_defs/1]).
--export([parse_file/1, parse_file/3]).
+-export([parse_file/1, parse_file/3, parse_file/4]).
-export([interpret_file_attribute/1]).
-export([normalize_typed_record_fields/1,restore_typed_record_fields/1]).
@@ -30,6 +30,7 @@
-type macros() :: [{atom(), term()}].
-type epp_handle() :: pid().
+-type encodings() :: unicode:encoding() | 'detect_unicode_encoding' | 'default_encoding'.
%% Epp state record.
-record(epp, {file, %Current file
@@ -54,12 +55,14 @@
%% open(FileName, IncludePath)
%% open(FileName, IncludePath, PreDefMacros)
+%% open(FileName, IncludePath, PreDefMacros, Encoding)
%% open(FileName, IoDevice, StartLocation, IncludePath, PreDefMacros)
%% close(Epp)
%% scan_erl_form(Epp)
%% parse_erl_form(Epp)
%% parse_file(Epp)
%% parse_file(FileName, IncludePath, PreDefMacros)
+%% parse_file(FileName, IncludePath, PreDefMacros, Encoding)
%% macro_defs(Epp)
-spec open(FileName, IncludePath) ->
@@ -81,13 +84,26 @@ open(Name, Path) ->
ErrorDescriptor :: term().
open(Name, Path, Pdm) ->
+ open(Name, Path, Pdm, default_encoding).
+
+-spec open(FileName, IncludePath, PredefMacros, Encoding) ->
+ {'ok', Epp} | {'error', ErrorDescriptor} when
+ FileName :: file:name(),
+ IncludePath :: [DirectoryName :: file:name()],
+ PredefMacros :: macros(),
+ Encoding :: encodings(),
+ Epp :: epp_handle(),
+ ErrorDescriptor :: term().
+
+open(Name, Path, Pdm, Encoding) ->
Self = self(),
- Epp = spawn(fun() -> server(Self, Name, Path, Pdm) end),
+ Epp = spawn(fun() -> server(Self, Name, Path, Pdm, Encoding) end),
epp_request(Epp).
+
open(Name, File, StartLocation, Path, Pdm) ->
Self = self(),
- Epp = spawn(fun() -> server(Self, Name, File, StartLocation,Path,Pdm) end),
+ Epp = spawn(fun() -> server(Self, Name, File, StartLocation, Path, Pdm) end),
epp_request(Epp).
-spec close(Epp) -> 'ok' when
@@ -164,6 +180,8 @@ format_error({'NYI',What}) ->
io_lib:format("not yet implemented '~s'", [What]);
format_error(E) -> file:format_error(E).
+%% parse_file(FileName, IncludePath, [PreDefMacro], Encoding) ->
+%% {ok,[Form]} | {error,OpenError}
%% parse_file(FileName, IncludePath, [PreDefMacro]) ->
%% {ok,[Form]} | {error,OpenError}
@@ -178,7 +196,25 @@ format_error(E) -> file:format_error(E).
OpenError :: file:posix() | badarg | system_limit.
parse_file(Ifile, Path, Predefs) ->
- case open(Ifile, Path, Predefs) of
+ parse_file_1(open(Ifile, Path, Predefs)).
+
+-spec parse_file(FileName, IncludePath, PredefMacros, Encoding) ->
+ {'ok', [Form]} | {error, OpenError} when
+ FileName :: file:name(),
+ IncludePath :: [DirectoryName :: file:name()],
+ Form :: erl_parse:abstract_form() | {'error', ErrorInfo} | {'eof',Line},
+ PredefMacros :: macros(),
+ Encoding :: encodings(),
+ Line :: erl_scan:line(),
+ ErrorInfo :: erl_scan:error_info() | erl_parse:error_info(),
+ OpenError :: file:posix() | badarg | system_limit.
+
+
+parse_file(Ifile, Path, Predefs, Encoding) ->
+ parse_file_1(open(Ifile, Path, Predefs, Encoding)).
+
+parse_file_1(OpenResult) ->
+ case OpenResult of
{ok,Epp} ->
Forms = parse_file(Epp),
close(Epp),
@@ -245,12 +281,18 @@ restore_typed_record_fields([{attribute,La,type,{{record,Record},Fields,[]}}|
restore_typed_record_fields([Form|Forms]) ->
[Form|restore_typed_record_fields(Forms)].
-%% server(StarterPid, FileName, Path, PreDefMacros)
+%% server(StarterPid, FileName, Path, PreDefMacros, Encoding)
-server(Pid, Name, Path, Pdm) ->
+server(Pid, Name, Path, Pdm, Encoding0) ->
process_flag(trap_exit, true),
- case file:open(Name, [read]) of
+ EncodingOpts = case Encoding0 of
+ default_encoding -> []; % [{encoding, latin1}];
+ detect_unicode_encoding -> []; % TODO: use BOM detection
+ OtherEncoding -> [{encoding, OtherEncoding}]
+ end,
+ case file:open(Name, [read | EncodingOpts]) of
{ok,File} ->
+ % Perform BOM detection before
Location = 1,
init_server(Pid, Name, File, Location, Path, Pdm, false);
{error,E} ->
View
26 lib/stdlib/test/epp_SUITE.erl
@@ -25,7 +25,8 @@
variable_1/1, otp_4870/1, otp_4871/1, otp_5362/1,
pmod/1, not_circular/1, skip_header/1, otp_6277/1, otp_7702/1,
otp_8130/1, overload_mac/1, otp_8388/1, otp_8470/1, otp_8503/1,
- otp_8562/1, otp_8665/1, otp_8911/1]).
+ otp_8562/1, otp_8665/1, otp_8911/1,
+ utf8_1/1]).
-export([epp_parse_erl_form/2]).
@@ -67,11 +68,12 @@ all() ->
{group, variable}, otp_4870, otp_4871, otp_5362, pmod,
not_circular, skip_header, otp_6277, otp_7702, otp_8130,
overload_mac, otp_8388, otp_8470, otp_8503, otp_8562,
- otp_8665, otp_8911].
+ otp_8665, otp_8911, {group, encoding}].
groups() ->
[{upcase_mac, [], [upcase_mac_1, upcase_mac_2]},
- {variable, [], [variable_1]}].
+ {variable, [], [variable_1]},
+ {encoding, [], [utf8_1]}].
init_per_suite(Config) ->
Config.
@@ -117,8 +119,15 @@ include_local(Config) when is_list(Config) ->
%%% after 4 seconds if the epp server doesn't respond. If we use the
%%% regular epp:parse_file, the test case will time out, and then epp
%%% server will go on growing until we dump core.
+epp_parse_file(File, Inc, Predef, Encoding) ->
+ {ok, Epp} = epp:open(File, Inc, Predef, Encoding),
+ epp_parse_file_1(Epp).
+
epp_parse_file(File, Inc, Predef) ->
{ok, Epp} = epp:open(File, Inc, Predef),
+ epp_parse_file_1(Epp).
+
+epp_parse_file_1(Epp) ->
List = collect_epp_forms(Epp),
epp:close(Epp),
{ok, List}.
@@ -1277,6 +1286,17 @@ otp_8665(Config) when is_list(Config) ->
?line [] = compile(Config, Cs),
ok.
+utf8_1(doc) ->
+ ["Unicode (UTF-8) support in source files."];
+utf8_1(suite) ->
+ [];
+utf8_1(Config) when is_list(Config) ->
+ ?line File = filename:join(?config(data_dir, Config), "utf8_1.erl"),
+ ?line {ok, List} = epp_parse_file(File, [], [], utf8),
+ ?line false = lists:keysearch(error, 1, List),
+ ?line check_errors(List),
+ ok.
+
check(Config, Tests) ->
eval_tests(Config, fun check_test/2, Tests).
View
96 lib/stdlib/test/epp_SUITE_data/utf8_1.erl
@@ -0,0 +1,96 @@
+%%
+%% %CopyrightBegin%
+%%
+%% Copyright Ericsson AB 2011. All Rights Reserved.
+%%
+%% The contents of this file are subject to the Erlang Public License,
+%% Version 1.1, (the "License"); you may not use this file except in
+%% compliance with the License. You should have received a copy of the
+%% Erlang Public License along with this software. If not, it can be
+%% retrieved online at http://www.erlang.org/.
+%%
+%% Software distributed under the License is distributed on an "AS IS"
+%% basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
+%% the License for the specific language governing rights and limitations
+%% under the License.
+%%
+%% %CopyrightEnd%
+%%
+-module(utf8_1).
+
+-export([romanize_char/1]).
+-export([strings_1/0]).
+
+romanize_char(X) ->
+ case X of
+ SLetter when $a =< SLetter, SLetter =< $z ->
+ SLetter;
+ BLetter when $A =< BLetter, BLetter =< $Z ->
+ BLetter - $A + $a; % Change to lower letter
+ Number when $0 =< Number, Number =< $9 ->
+ Number;
+
+ % Polish Unicode
+ -> $a;
+ -> $e;
+ -> $c;
+ -> $s;
+ -> $n;
+ -> $z;
+ -> $z;
+ -> $o;
+ -> $l;
+ -> $a;
+ -> $e;
+ -> $c;
+ -> $s;
+ -> $n;
+ -> $z;
+ -> $z;
+ -> $o;
+ -> $l;
+
+ % Other languages
+ -> $a;
+ -> $a;
+ -> $o;
+ -> $o;
+ -> $u;
+ -> $u;
+ -> $e;
+ -> $e;
+ -> $i; % dosyć rzadki znak (quite rare character)
+ -> [$a, $e];
+ -> [$a, $e];
+
+ -> $a;
+ -> $c;
+ %$ä -> $a;
+ -> $n;
+ -> $s;
+ -> $a;
+ -> $u;
+ %$á -> $a;
+ -> $e;
+ -> $i;
+ %$ó -> $o;
+ -> $u;
+ -> $e;
+ -> $o;
+ -> $u;
+ -> $o;
+ -> $r;
+ -> $a;
+ -> $r;
+ -> $l;
+
+ Other when is_integer(Other), 1 =< Other ->
+ $- % all other to '-' character
+ end.
+
+strings_1() ->
+ [
+ "ąęćśżźńół",
+ "Ala ma kota",
+ "Zażółć gęślą jaźń"
+ ].

No commit comments for this range

Something went wrong with that request. Please try again.