turn on Unicode - all of it
Perl
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
corpus
lib/utf8
t
.gitignore
.travis.yml
Changes
MANIFEST.SKIP
Makefile.PL
README
README.mkdn
appveyor.yml
dist.ini

README.mkdn

Build Status Kwalitee status

NAME

utf8::all - turn on Unicode - all of it

VERSION

version 0.021_001

SYNOPSIS

use utf8::all;                      # Turn on UTF-8, all of it.

open my $in, '<', 'contains-utf8';  # UTF-8 already turned on here
print length 'føø bār';             # 7 UTF-8 characters
my $utf8_arg = shift @ARGV;         # @ARGV is UTF-8 too (only for main)

DESCRIPTION

The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. This also means that you can now use literal Unicode characters as part of strings, variable names, and regular expressions.

utf8::all goes further:

  • charnames are imported so \N{...} sequences can be used to compile Unicode characters based on names.
  • On Perl v5.11.0 or higher, the use feature 'unicode_strings' is enabled.
  • use feature fc and use feature unicode_eval are enabled on Perl 5.16.0 and higher.
  • Filehandles are opened with UTF-8 encoding turned on by default (including STDIN, STDOUT, STDERR). Meaning that they automatically convert UTF-8 octets to characters and vice versa. If you don't want UTF-8 for a particular filehandle, you'll have to set binmode $filehandle.
  • @ARGV gets converted from UTF-8 octets to Unicode characters (when utf8::all is used from the main package). This is similar to the behaviour of the -CA perl command-line switch (see perlrun).
  • readdir, readlink, readpipe (including the qx// and backtick operators), and glob (including the <> operator) now all work with and return Unicode characters instead of (UTF-8) octets.

Lexical Scope

The pragma is lexically-scoped, so you can do the following if you had some reason to:

{
    use utf8::all;
    open my $out, '>', 'outfile';
    my $utf8_str = 'føø bār';
    print length $utf8_str, "\n"; # 7
    print $out $utf8_str;         # out as utf8
}
open my $in, '<', 'outfile';      # in as raw
my $text = do { local $/; <$in>};
print length $text, "\n";         # 10, not 7!

Instead of lexical scoping, you can also use no utf8::all to turn off the effects.

Note that the effect on @ARGV and the STDIN, STDOUT, and STDERR file handles is always global!

UTF-8 Errors

utf8::all will handle invalid code points (i.e., utf-8 that does not map to a valid unicode "character"), as a fatal error.

For glob, readdir, and readlink, one can change this behaviour by setting the attribute "$utf8::all::UTF8_CHECK".

ATTRIBUTES

$utf8::all::UTF8_CHECK

By default utf8::all marks decoding errors as fatal (default value for this setting is Encode::FB_CROAK). If you want, you can change this by setting $utf8::all::UTF8_CHECK. The value Encode::FB_WARN reports the encoding errors as warnings, and Encode::FB_DEFAULT will completely ignore them. Please see Encode for details. Note: Encode::LEAVE_SRC is always enforced.

Important: Only controls the handling of decoding errors in glob, readdir, and readlink.

INTERACTION WITH AUTODIE

If you use autodie, which is a great idea, you need to use at least version 2.12, released on June 26, 2012. Otherwise, autodie obliterates the IO layers set by the open pragma. See RT #54777 and GH #7.

BUGS

Please report any bugs or feature requests on the bugtracker website.

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

COMPATIBILITY

The filesystems of Dos, Windows, and OS/2 do not (fully) support UTF-8. The readlink and readdir functions and glob operators will therefore not be replaced on these systems.

SEE ALSO

AUTHORS

COPYRIGHT AND LICENSE

This software is copyright (c) 2009 by Michael Schwern mschwern@cpan.org; he originated it.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.