Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Fetching contributors…

Cannot retrieve contributors at this time

1730 lines (1508 sloc) 65.875 kb
\rSec0[lex]{Lexical conventions}
%gram: \rSec1[gram.lex]{Lexical conventions}
%gram:
\indextext{lexical conventions|see{conventions, lexical}}
\indextext{translation!separate|see{compilation, separate}}
\indextext{separate translation|see{compilation, separate}}
\indextext{separate compilation|see{compilation, separate}}
\indextext{phases of translation|see{translation, phases}}
\indextext{source file character|see{character, source file}}
\indextext{alternative token|see{token, alternative}}
\indextext{digraph|see{token, alternative}}
\indextext{integer literal|see{literal, integer}}
\indextext{character literal|see{literal, character}}
\indextext{floating literal|see{literal, floating}}
\indextext{floating-point literal|see{literal, floating}}
\indextext{string literal|see{literal, string}}
\indextext{boolean literal|see{literal, boolean}}
\indextext{pointer literal|see{literal, pointer}}
\indextext{user-defined literal|see{literal, user defined}}
\indextext{file, source|see{source file}}
\rSec1[lex.separate]{Separate translation}
\pnum
\indextext{conventions!lexical|(}%
\indextext{compilation!separate|(}%
The text of the program is kept in units called
\indextext{source file}\term{source files} in this International
Standard. A source file together with all the headers~(\ref{headers})
and source files included~(\ref{cpp.include}) via the preprocessing
directive \tcode{\#include}, less any source lines skipped by any of the
conditional inclusion~(\ref{cpp.cond}) preprocessing directives, is
called a \defn{translation unit}.
\enternote A \Cpp program need not all be translated at the same time.
\exitnote
\pnum
\enternote Previously translated translation units and instantiation
units can be preserved individually or in libraries. The separate
translation units of a program communicate~(\ref{basic.link}) by (for
example) calls to functions whose identifiers have external linkage,
manipulation of objects whose identifiers have external linkage, or
manipulation of data files. Translation units can be separately
translated and then later linked to produce an executable
program~(\ref{basic.link}). \exitnote%
\indextext{compilation!separate|)}
\rSec1[lex.phases]{Phases of translation}%
\pnum
\indextext{translation!phases|(}
The precedence among the syntax rules of translation is specified by the
following phases.\footnote{Implementations must behave as if these separate phases
occur, although in practice different phases might be folded together.}
\begin{enumerate}
\indextext{source file}%
\indextext{character!source file}%
\indextext{character set!basic source}%
\item Physical source file characters are mapped, in an
\impldef{mapping physical source file characters to basic source character set} manner,
to the basic source character set (introducing new-line characters for end-of-line
indicators) if necessary.
The set of physical source file characters accepted is \impldef{physical source file
characters}.
\indextext{trigraph sequence}Trigraph sequences~(\ref{lex.trigraph}) are
replaced by corresponding single-character internal representations. Any
source file character not in the basic source character
set~(\ref{lex.charset}) is replaced by the
\indextext{universal character name}universal-character-name that
designates that character. (An implementation may use any internal
encoding, so long as an actual extended character encountered in the
source file, and the same extended character expressed in the source
file as a universal-character-name (i.e., using the \tcode{\textbackslash
uXXXX} notation), are handled equivalently
except where this replacement is reverted in a raw string literal.)
\indextext{line splicing}%
\item Each instance of a backslash character (\textbackslash)
immediately followed by a new-line character is deleted, splicing
physical source lines to form logical source lines. Only the last
backslash on any physical source line shall be eligible for being part
of such a splice. If, as a result, a character sequence that matches the
syntax of a universal-character-name is produced, the behavior is
undefined. A source file that is not empty and that does not end in a new-line
character, or that ends in a new-line character immediately preceded by a
backslash character before any such splicing takes place,
shall be processed as if an additional new-line character were appended
to the file.
\item The source file is decomposed into preprocessing
tokens~(\ref{lex.pptoken}) and sequences of white-space characters
(including comments). A source file shall not end in a partial
preprocessing token or in a partial comment.\footnote{A partial preprocessing
token would arise from a source file
ending in the first portion of a multi-character token that requires a
terminating sequence of characters, such as a \grammarterm{header-name}
that is missing the closing \tcode{"}
or \tcode{>}. A partial comment
would arise from a source file ending with an unclosed \tcode{/*}
comment.}
Each comment is replaced by one space character. New-line characters are
retained. Whether each nonempty sequence of white-space characters other
than new-line is retained or replaced by one space character is
unspecified. The process of dividing a source file's
characters into preprocessing tokens is context-dependent.
\enterexample
see the handling of \tcode{<} within a \tcode{\#include} preprocessing
directive.
\exitexample
\item Preprocessing directives are executed, macro invocations are
expanded, and \tcode{_Pragma} unary operator expressions are executed.
If a character sequence that matches the syntax of a
universal-character-name is produced by token
concatenation~(\ref{cpp.concat}), the behavior is undefined. A
\tcode{\#include} preprocessing directive causes the named header or
source file to be processed from phase 1 through phase 4, recursively.
All preprocessing directives are then deleted.
\item Each source character set member in a character literal or a string
literal, as well as each escape sequence and universal-character-name in a
character literal or a non-raw string literal, is converted to the corresponding
member of the execution character set~(\ref{lex.ccon}, \ref{lex.string}); if
there is no corresponding member, it is converted to an \impldef{converting
characters from source character set to execution character set} member other
than the null (wide) character.\footnote{An implementation need not convert all
non-corresponding source characters to the same execution character.}
\item Adjacent string literal tokens are concatenated.
\item White-space characters separating tokens are no longer
significant. Each preprocessing token is converted into a
token.~(\ref{lex.token}). The resulting tokens are syntactically and
semantically analyzed and translated as a translation unit. \enternote
The process of analyzing and translating the tokens may occasionally
result in one token being replaced by a sequence of other
tokens~(\ref{temp.names}).\exitnote \enternote Source files, translation
units and translated translation units need not necessarily be stored as
files, nor need there be any one-to-one correspondence between these
entities and any external representation. The description is conceptual
only, and does not specify any particular implementation. \exitnote
\item Translated translation units and instantiation units are combined
as follows: \enternote Some or all of these may be supplied from a
library. \exitnote Each translated translation unit is examined to
produce a list of required instantiations. \enternote This may include
instantiations which have been explicitly
requested~(\ref{temp.explicit}). \exitnote The definitions of the
required templates are located. It is \impldef{whether source of translation units must
be available to locate template definitions} whether the
source of the translation units containing these definitions is required
to be available. \enternote An implementation could encode sufficient
information into the translated translation unit so as to ensure the
source is not required here. \exitnote All the required instantiations
are performed to produce
\defn{instantiation units}. \enternote These are similar
to translated translation units, but contain no references to
uninstantiated templates and no template definitions. \exitnote The
program is ill-formed if any instantiation fails.
\item All external entity references are resolved. Library
components are linked to satisfy external references to
entities not defined in the current translation. All such translator
output is collected into a program image which contains information
needed for execution in its execution environment.%
\indextext{translation!phases|)}
\end{enumerate}
\rSec1[lex.charset]{Character sets}
\pnum
\indextext{character set|(}%
\indextext{character set!basic source}%
The \term{basic source character set} consists of 96 characters: the space character,
the control characters representing horizontal tab, vertical tab, form feed, and
new-line, plus the following 91 graphical characters:\footnote{The glyphs for
the members of the basic source character set are intended to
identify characters from the subset of ISO/IEC 10646 which corresponds to the ASCII
character set. However, because the mapping from source file characters to the source
character set (described in translation phase 1) is specified as implementation-defined,
an implementation is required to document how the basic source characters are
represented in source files.}
\begin{codeblock}
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
_ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | @$\sim$@ ! = , @\textbackslash@ " '
\end{codeblock}
\pnum
The \grammarterm{universal-character-name} construct provides a way to name
other characters.
\begin{bnf}
\nontermdef{hex-quad}\br
hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit
\end{bnf}
\begin{bnf}
\nontermdef{universal-character-name}\br
\terminal{\textbackslash u} hex-quad\br
\terminal{\textbackslash U} hex-quad hex-quad
\end{bnf}
The character designated by the universal-character-name \tcode{\textbackslash
UNNNNNNNN} is that character whose character short name in ISO/IEC 10646 is
\tcode{NNNNNNNN}; the character designated by the universal-character-name
\tcode{\textbackslash uNNNN} is that character whose character short name in
ISO/IEC 10646 is \tcode{0000NNNN}. If the hexadecimal value for a
universal-character-name corresponds to a surrogate code point (in the
range 0xD800--0xDFFF, inclusive), the program is ill-formed. Additionally, if
the hexadecimal value for a universal-character-name outside
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
\grammarterm{r-char-sequence} of
a character or
string literal corresponds to a control character (in either of the
ranges 0x00--0x1F or 0x7F--0x9F, both inclusive) or to a character in the basic
source character set, the program is ill-formed.\footnote{A sequence of characters resembling a universal-character-name in an
\grammarterm{r-char-sequence}~(\ref{lex.string}) does not form a
universal-character-name.}
\pnum
The \term{basic execution character set} and the \term{basic
execution wide-character set} shall each contain all the members of the
basic source character set, plus control characters representing alert,
backspace, and carriage return, plus a \term{null character}
(respectively, \term{null wide character}), whose representation has
all zero bits. For each basic execution character set, the values of the
members shall be non-negative and distinct from one another. In both the
source and execution basic character sets, the value of each character
after \tcode{0} in the above list of decimal digits shall be one greater
than the value of the previous. The \term{execution character set}
and the \term{execution wide-character set} are
implementation-defined
\indeximpldef{execution character-set and execution wide-character set}
supersets of the
basic execution character set and the basic execution wide-character
set, respectively. The values of the members of the execution character sets
and the sets of additional members
are locale-specific.%
\indextext{character set|)}
\rSec1[lex.trigraph]{Trigraph sequences}
\pnum
\indextext{trigraph sequence|(}%
Before any other processing takes place, each occurrence of one of the
following sequences of three characters (``\term{trigraph
sequences}'') is replaced by the single character indicated in
Table~\ref{tab:trigraph.sequences}.
\begin{tokentable}{Trigraph sequences}{tab:trigraph.sequences}{Trigraph}{Replacement}
\tcode{??=} & \tcode{\#} &
\tcode{??(} & \tcode{[} &
\tcode{??<} & \tcode{\{} \\ \rowsep
\tcode{??/} & \tcode{\textbackslash} &
\tcode{??)} & \tcode{]} &
\tcode{??>} & \tcode{\}} \\ \rowsep
\tcode{??'} & \tcode{\^{}} &
\tcode{??!} & \tcode{|} &
\tcode{??-} & \tcode{$\sim$} \\
\end{tokentable}
\pnum
\enterexample
\begin{codeblock}
??=define arraycheck(a,b) a??(b??) ??!??! b??(a??)
\end{codeblock}
becomes
\begin{codeblock}
#define arraycheck(a,b) a[b] || b[a]
\end{codeblock}
\exitexample
\pnum
No other trigraph sequence exists. Each \tcode{?} that does not begin
one of the trigraphs listed above is not changed.%
\indextext{trigraph sequence|)}
\rSec1[lex.pptoken]{Preprocessing tokens}
\begin{bnf}
\indextext{token!preprocessing|(}%
\nontermdef{preprocessing-token}\br
header-name\br
identifier\br
pp-number\br
character-literal\br
user-defined-character-literal\br
string-literal\br
user-defined-string-literal\br
preprocessing-op-or-punc\br
\textnormal{each non-white-space character that cannot be one of the above}
\end{bnf}
\pnum
Each preprocessing token that is converted to a token~(\ref{lex.token})
shall have the lexical form of a keyword, an identifier, a literal, an
operator, or a punctuator.
\pnum
A preprocessing token is the minimal lexical element of the language in translation
phases 3 through 6. The categories of preprocessing token are: header names,
identifiers, preprocessing numbers, character literals (including user-defined character
literals), string literals (including user-defined string literals), preprocessing
operators and punctuators, and single non-white-space characters that do not lexically
match the other preprocessing token categories. If a \tcode{'} or a \tcode{"} character
matches the last category, the behavior is undefined. Preprocessing tokens can be
separated by
\indextext{space!white}%
white space;
\indextext{comment}%
this consists of comments~(\ref{lex.comment}), or white-space
characters (space, horizontal tab, new-line, vertical tab, and
form-feed), or both. As described in Clause~\ref{cpp}, in certain
circumstances during translation phase 4, white space (or the absence
thereof) serves as more than preprocessing token separation. White space
can appear within a preprocessing token only as part of a header name or
between the quotation characters in a character literal or string
literal.
\pnum
If the input stream has been parsed into preprocessing tokens up to a
given character:
\begin{itemize}
\item If the next character begins a sequence of characters that could be the prefix
and initial double quote of a raw string literal, such as \tcode{R"}, the next preprocessing
token shall be a raw string literal. Between the initial and final
double quote characters of the raw string, any transformations performed in phases
1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion
shall apply before any \grammarterm{d-char}, \grammarterm{r-char}, or delimiting
parenthesis is identified. The raw string literal is defined as the shortest sequence
of characters that matches the raw-string pattern
\begin{ncbnf}
encoding-prefix\opt \terminal{R} raw-string
\end{ncbnf}
\item Otherwise, if the next three characters are \tcode{<::} and the subsequent character
is neither \tcode{:} nor \tcode{>}, the \tcode{<} is treated as a preprocessor token by
itself and not as the first character of the alternative token \tcode{<:}.
\item Otherwise,
the next preprocessing token is the longest sequence of
characters that could constitute a preprocessing token, even if that
would cause further lexical analysis to fail.
\end{itemize}
\enterexample
\begin{codeblock}
#define R "x"
const char* s = R"y"; // ill-formed raw string, not \tcode{"x" "y"}
\end{codeblock}
\exitexample
\pnum
\enterexample The program fragment \tcode{1Ex} is parsed as a
preprocessing number token (one that is not a valid floating or integer
literal token), even though a parse as the pair of preprocessing tokens
\tcode{1} and \tcode{Ex} might produce a valid expression (for example,
if \tcode{Ex} were a macro defined as \tcode{+1}). Similarly, the
program fragment \tcode{1E1} is parsed as a preprocessing number (one
that is a valid floating literal token), whether or not \tcode{E} is a
macro name. \exitexample
\pnum
\enterexample The program fragment \tcode{x+++++y} is parsed as \tcode{x
++ ++ + y}, which, if \tcode{x} and \tcode{y} have integral types,
violates a constraint on increment operators, even though the parse
\tcode{x ++ + ++ y} might yield a correct expression. \exitexample%
\indextext{token!preprocessing|)}
\rSec1[lex.digraph]{Alternative tokens}
\pnum
\indextext{token!alternative|(}%
Alternative token representations are provided for some operators and
punctuators.\footnote{\indextext{digraph}%
These include ``digraphs'' and additional reserved words. The term
``digraph'' (token consisting of two characters) is not perfectly
descriptive, since one of the alternative preprocessing-tokens is
\tcode{\%:\%:} and of course several primary tokens contain two
characters. Nonetheless, those alternative tokens that aren't lexical
keywords are colloquially known as ``digraphs''. }
\pnum
In all respects of the language, each alternative token behaves the
same, respectively, as its primary token, except for its spelling.\footnote{Thus the ``stringized'' values~(\ref{cpp.stringize}) of
\tcode{[} and \tcode{<:} will be different, maintaining the source
spelling, but the tokens can otherwise be freely interchanged. }
The set of alternative tokens is defined in
Table~\ref{tab:alternative.tokens}.
\begin{tokentable}{Alternative tokens}{tab:alternative.tokens}{Alternative}{Primary}
\tcode{<\%} & \tcode{\{} &
\tcode{and} & \tcode{\&\&} &
\tcode{and_eq} & \tcode{\&=} \\ \rowsep
\tcode{\%>} & \tcode{\}} &
\tcode{bitor} & \tcode{|} &
\tcode{or_eq} & \tcode{|=} \\ \rowsep
\tcode{<:} & \tcode{[} &
\tcode{or} & \tcode{||} &
\tcode{xor_eq} & \tcode{\^{}=} \\ \rowsep
\tcode{:>} & \tcode{]} &
\tcode{xor} & \tcode{\^{}} &
\tcode{not} & \tcode{!} \\ \rowsep
\tcode{\%:} & \tcode{\#} &
\tcode{compl} & \tcode{$\sim$} &
\tcode{not_eq} & \tcode{!=} \\ \rowsep
\tcode{\%:\%:} & \tcode{\#\#} &
\tcode{bitand} & \tcode{\&} &
& \\
\end{tokentable}%
\indextext{token!alternative|)}
\rSec1[lex.token]{Tokens}
\indextext{token|(}%
\begin{bnf}
\nontermdef{token}\br
identifier\br
keyword\br
literal\br
operator\br
punctuator
\end{bnf}
\pnum
\indextext{\idxgram{token}}%
There are five kinds of tokens: identifiers, keywords, literals,\footnote{Literals include strings and character and numeric literals.
}
operators, and other separators.
\indextext{white~space}%
Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments
(collectively, ``white space''), as described below, are ignored except
as they serve to separate tokens. \enternote Some white space is
required to separate otherwise adjacent identifiers, keywords, numeric
literals, and alternative tokens containing alphabetic characters.
\exitnote%
\indextext{token|)}
\rSec1[lex.comment]{Comments}
\pnum
\indextext{comment|(}%
\indextext{comment!\tcode{/*}~\tcode{*/}}%
\indextext{comment!\tcode{//}}%
The characters \tcode{/*} start a comment, which terminates with the
characters \tcode{*/}. These comments do not nest.
\indextext{comment!\tcode{//}}%
The characters \tcode{//} start a comment, which terminates with the
next new-line character. If there is a form-feed or a vertical-tab
character in such a comment, only white-space characters shall appear
between it and the new-line that terminates the comment; no diagnostic
is required. \enternote The comment characters \tcode{//}, \tcode{/*},
and \tcode{*/} have no special meaning within a \tcode{//} comment and
are treated just like other characters. Similarly, the comment
characters \tcode{//} and \tcode{/*} have no special meaning within a
\tcode{/*} comment. \exitnote%
\indextext{comment|)}
\rSec1[lex.header]{Header names}
\indextext{header!name|(}%
\begin{bnf}
\nontermdef{header-name}\br
\terminal{<} h-char-sequence \terminal{>}\br
\terminal{"} q-char-sequence \terminal{"}
\end{bnf}
\begin{bnf}
\nontermdef{h-char-sequence}\br
h-char\br
h-char-sequence h-char
\end{bnf}
\begin{bnf}
\nontermdef{h-char}\br
\textnormal{any member of the source character set except new-line and \terminal{>}}
\end{bnf}
\begin{bnf}
\nontermdef{q-char-sequence}\br
q-char\br
q-char-sequence q-char
\end{bnf}
\begin{bnf}
\nontermdef{q-char}\br
\textnormal{any member of the source character set except new-line and \terminal{"}}
\end{bnf}
\pnum
Header name preprocessing tokens shall only appear within a
\tcode{\#include} preprocessing directive~(\ref{cpp.include}). The
sequences in both forms of \grammarterm{header-name}{s} are mapped in an
\impldef{mapping header name to header or external source file} manner to headers or to
external source file names as specified in~\ref{cpp.include}.
\pnum
The appearance of either of the characters \tcode{'} or \tcode{\textbackslash} or of
either of the character sequences \tcode{/*} or \tcode{//} in a
\grammarterm{q-char-sequence} or an \grammarterm{h-char-sequence}
is conditionally supported with implementation-defined semantics,
as is the appearance of the
character \tcode{"} in an \grammarterm{h-char-sequence}.\footnote{Thus, a
sequence of characters that resembles an escape sequence might result in an
error, be interpreted as the character corresponding to the escape sequence, or
have a completely different meaning, depending on the implementation.}%
\indextext{header!name|)}
\rSec1[lex.ppnumber]{Preprocessing numbers}
\index{number!preprocessing|(}%
\begin{bnf}
\nontermdef{pp-number}\br
digit\br
\terminal{.} digit\br
pp-number digit\br
pp-number identifier-nondigit\br
pp-number \terminal{e} sign\br
pp-number \terminal{E} sign\br
pp-number \terminal{.}
\end{bnf}
\pnum
Preprocessing number tokens lexically include all integral literal
tokens~(\ref{lex.icon}) and all floating literal
tokens~(\ref{lex.fcon}).
\pnum
A preprocessing number does not have a type or a value; it acquires both
after a successful conversion to an integral literal token or a floating literal
token.%
\index{number!preprocessing|)}
\rSec1[lex.name]{Identifiers}
\indextext{identifier|(}%
\begin{bnf}
\nontermdef{identifier}\br
identifier-nondigit\br
identifier identifier-nondigit\br
identifier digit
\end{bnf}
\begin{bnf}
\nontermdef{identifier-nondigit}\br
nondigit\br
universal-character-name\br
\textnormal{other implementation-defined characters}
\end{bnf}
\begin{bnf}
\nontermdef{nondigit} \textnormal{one of}\br
\terminal{a b c d e f g h i j k l m}\br
\terminal{n o p q r s t u v w x y z}\br
\terminal{A B C D E F G H I J K L M}\br
\terminal{N O P Q R S T U V W X Y Z _}
\end{bnf}
\begin{bnf}
\nontermdef{digit} \textnormal{one of}\br
\terminal{0 1 2 3 4 5 6 7 8 9}
\end{bnf}
\pnum
\indextext{name!length~of}%
\indextext{name}%
An identifier is an arbitrarily long sequence of letters and digits.
Each universal-character-name in an identifier shall designate a
character whose encoding in ISO 10646 falls into one of the ranges
specified in~\ref{charname.allowed}.
The initial element shall not be a universal-character-name
designating a character whose encoding falls into one of the ranges
specified in~\ref{charname.disallowed}.
Upper- and lower-case letters are
different. All characters are significant.\footnote{On systems in which linkers cannot accept extended
characters, an encoding of the universal-character-name may be used in
forming valid external identifiers. For example, some otherwise unused
character or sequence of characters may be used to encode the
\tcode{\textbackslash u} in a universal-character-name. Extended
characters may produce a long external identifier, but \Cpp does not
place a translation limit on significant characters for external
identifiers. In \Cpp, upper- and lower-case letters are considered
different for all identifiers, including external identifiers. }
\pnum
The identifiers in Table~\ref{tab:identifiers.special} have a special meaning when
appearing in a certain context. When referred to in the grammar, these identifiers
are used explicitly rather than using the \grammarterm{identifier} grammar production.
any ambiguity as to whether a given \grammarterm{identifier} has a special meaning is
resolved to interpret the token as a regular \grammarterm{identifier}.
\begin{floattable}{Identifiers with special meaning}{tab:identifiers.special}
{ll}
\topline
\tcode{override} &
\tcode{final} \\
\end{floattable}
\pnum
\indextext{\idxcode{_}|see{character, underscore}}%
\indextext{character!underscore!in identifier}%
\indextext{reserved~identifier}%
In addition, some identifiers are reserved for use by \Cpp
implementations and standard libraries~(\ref{global.names}) and shall
not be used otherwise; no diagnostic is required.%
\indextext{identifier|)}
\rSec1[lex.key]{Keywords}
\enlargethispage{\baselineskip}%
\pnum
\indextext{keyword|(}%
The identifiers shown in Table~\ref{tab:keywords} are reserved for use
as keywords (that is, they are unconditionally treated as keywords in
phase 7) except in an \grammarterm{attribute-token}~(\ref{dcl.attr.grammar})
\enternote The \tcode{export} keyword is unused but is reserved for future use.\exitnote:
\begin{floattable}{Keywords}{tab:keywords}
{lllll}
\topline
\tcode{alignas} &
\tcode{continue} &
\tcode{friend} &
\tcode{register} &
\tcode{true} \\
\tcode{alignof} &
\tcode{decltype} &
\tcode{goto} &
\tcode{reinterpret_cast} &
\tcode{try} \\
\tcode{asm} &
\tcode{default} &
\tcode{if} &
\tcode{return} &
\tcode{typedef} \\
\tcode{auto} &
\tcode{delete} &
\tcode{inline} &
\tcode{short} &
\tcode{typeid} \\
\tcode{bool} &
\tcode{do} &
\tcode{int} &
\tcode{signed} &
\tcode{typename} \\
\tcode{break} &
\tcode{double} &
\tcode{long} &
\tcode{sizeof} &
\tcode{union} \\
\tcode{case} &
\tcode{dynamic_cast} &
\tcode{mutable} &
\tcode{static} &
\tcode{unsigned} \\
\tcode{catch} &
\tcode{else} &
\tcode{namespace} &
\tcode{static_assert} &
\tcode{using} \\
\tcode{char} &
\tcode{enum} &
\tcode{new} &
\tcode{static_cast} &
\tcode{virtual} \\
\tcode{char16_t} &
\tcode{explicit} &
\tcode{noexcept} &
\tcode{struct} &
\tcode{void} \\
\tcode{char32_t} &
\tcode{export} &
\tcode{nullptr} &
\tcode{switch} &
\tcode{volatile} \\
\tcode{class} &
\tcode{extern} &
\tcode{operator} &
\tcode{template} &
\tcode{wchar_t} \\
\tcode{const} &
\tcode{false} &
\tcode{private} &
\tcode{this} &
\tcode{while} \\
\tcode{constexpr} &
\tcode{float} &
\tcode{protected} &
\tcode{thread_local} & \\
\tcode{const_cast} &
\tcode{for} &
\tcode{public} &
\tcode{throw} & \\
\end{floattable}
\pnum
Furthermore, the alternative representations shown in
Table~\ref{tab:alternative.representations} for certain operators and
punctuators~(\ref{lex.digraph}) are reserved and shall not be used
otherwise:
\begin{floattable}{Alternative representations}{tab:alternative.representations}
{llllll}
\topline
\tcode{and} & \tcode{and_eq} & \tcode{bitand} & \tcode{bitor} & \tcode{compl} & \tcode{not} \\
\tcode{not_eq} & \tcode{or} & \tcode{or_eq} & \tcode{xor} & \tcode{xor_eq} & \\
\end{floattable}%
\indextext{keyword|)}%
\rSec1[lex.operators]{Operators and punctuators}
\pnum
\indextext{operator|(}%
\indextext{punctuator|(}%
The lexical representation of \Cpp programs includes a number of
preprocessing tokens which are used in the syntax of the preprocessor or
are converted into tokens for operators and punctuators:
\begin{bnfkeywordtab}
\nontermdef{preprocessing-op-or-punc} \textnormal{one of}\br
\>\{ \>\} \>[ \>] \>\# \>\#\# \>( \>)\br
\><: \>:> \><\% \>\%> \>\%: \>\%:\%: \>; \>: \>.{..}\br
\>new \>delete \>? \>:: \>. \>.*\br
\>+ \>- \>* \>/ \>\% \>\^{} \>\& \>| \>\tilde\br
\>! \>= \>< \>> \>+= \>-= \>*= \>/= \>\%=\br
\>\^{}= \>\&= \>|= \>\shl \>\shr \>\shr= \>\shl= \>== \>!=\br
\><= \>>= \>\&\& \>|| \>++ \>-{-} \>, \>->* \>->\br
\>and \>and_eq \>bitand \>bitor \>compl \>not \>not_eq\br
\>or \>or_eq \>xor \>xor_eq
\end{bnfkeywordtab}
Each \grammarterm{preprocessing-op-or-punc} is converted to a single token
in translation phase 7~(\ref{lex.phases}).%
\indextext{punctuator|)}%
\indextext{operator|)}
\rSec1[lex.literal]{Literals}%
\indextext{literal|(}
\rSec2[lex.literal.kinds]{Kinds of literals}
\pnum
\indextext{constant}%
\indextext{literal!constant}%
There are several kinds of literals.\footnote{The term ``literal'' generally designates, in this
International Standard, those tokens that are called ``constants'' in
ISO C. }
\begin{bnf}
\nontermdef{literal}\br
integer-literal\br
character-literal\br
floating-literal\br
string-literal\br
boolean-literal\br
pointer-literal\br
user-defined-literal
\end{bnf}
\rSec2[lex.icon]{Integer literals}
\indextext{literal!integer}%
\begin{bnf}
\nontermdef{integer-literal}\br
decimal-literal integer-suffix\opt\br
octal-literal integer-suffix\opt\br
hexadecimal-literal integer-suffix\opt
\end{bnf}
\begin{bnf}
\nontermdef{decimal-literal}\br
nonzero-digit\br
decimal-literal digit
\end{bnf}
\begin{bnf}
\nontermdef{octal-literal}\br
\terminal{0}\br
octal-literal octal-digit
\end{bnf}
\begin{bnf}
\nontermdef{hexadecimal-literal}\br
\terminal{0x} hexadecimal-digit\br
\terminal{0X} hexadecimal-digit\br
hexadecimal-literal hexadecimal-digit
\end{bnf}
\begin{bnf}
\nontermdef{nonzero-digit} \textnormal{one of}\br
\terminal{1 2 3 4 5 6 7 8 9}
\end{bnf}
\begin{bnf}
\nontermdef{octal-digit} \textnormal{one of}\br
\terminal{0 1 2 3 4 5 6 7}
\end{bnf}
\begin{bnf}
\nontermdef{hexadecimal-digit} \textnormal{one of}\br
\terminal{0 1 2 3 4 5 6 7 8 9}\br
\terminal{a b c d e f}\br
\terminal{A B C D E F}
\end{bnf}
\begin{bnf}
\nontermdef{integer-suffix}\br
unsigned-suffix long-suffix\opt \br
unsigned-suffix long-long-suffix\opt \br
long-suffix unsigned-suffix\opt \br
long-long-suffix unsigned-suffix\opt
\end{bnf}
\begin{bnf}
\nontermdef{unsigned-suffix} \textnormal{one of}\br
\terminal{u U}
\end{bnf}
\begin{bnf}
\nontermdef{long-suffix} \textnormal{one of}\br
\terminal{l L}
\end{bnf}
\begin{bnf}
\nontermdef{long-long-suffix} \textnormal{one of}\br
\terminal{ll LL}
\end{bnf}
\pnum
\indextext{literal!\idxcode{unsigned}}%
\indextext{literal!\idxcode{long}}%
\indextext{literal!integer}%
\indextext{literal!hexadecimal}%
\indextext{literal!octal}%
\indextext{literal!decimal}%
\indextext{literal!base~of integer}%
An \term{integer literal} is a sequence of digits that has no period
or exponent part. An integer literal may have a prefix that specifies
its base and a suffix that specifies its type. The lexically first digit
of the sequence of digits is the most significant. A \term{decimal}
integer literal (base ten) begins with a digit other than \tcode{0} and
consists of a sequence of decimal digits. An \term{octal} integer
literal (base eight) begins with the digit \tcode{0} and consists of a
sequence of octal digits.\footnote{The digits \tcode{8} and \tcode{9} are not octal digits. }
A \term{hexadecimal} integer literal (base sixteen) begins with
\tcode{0x} or \tcode{0X} and consists of a sequence of hexadecimal
digits, which include the decimal digits and the letters \tcode{a}
through \tcode{f} and \tcode{A} through \tcode{F} with decimal values
ten through fifteen. \enterexample the number twelve can be written
\tcode{12}, \tcode{014}, or \tcode{0XC}. \exitexample
\pnum
\indextext{literal!\idxcode{long}}%
\indextext{literal!\idxcode{unsigned}}%
\indextext{suffix!\idxcode{L}}%
\indextext{suffix!\idxcode{U}}%
\indextext{suffix!\idxcode{l}}%
\indextext{suffix!\idxcode{u}}%
\indextext{literal!type~of integer}%
The type of an integer literal is the first of the corresponding list
in Table~\ref{tab:lex.type.integer.constant} in which its value can be
represented.
\enlargethispage{\baselineskip}%
\begin{LongTable}{Types of integer constants}{tab:lex.type.integer.constant}{l|l|l}
\\ \topline
\lhdr{Suffix} & \chdr{Decimal constants} & \rhdr{Octal or hexadecimal constant} \\ \capsep
\endfirsthead
\continuedcaption\\
\hline
\lhdr{Suffix} & \chdr{Decimal constants} & \rhdr{Octal or hexadecimal constant} \\ \capsep
\endhead
none &
\tcode{int} &
\tcode{int}\\
&
\tcode{long int} &
\tcode{unsigned int}\\
&
\tcode{long long int} &
\tcode{long int}\\
&
&
\tcode{unsigned long int}\\
&
&
\tcode{long long int}\\
&
&
\tcode{unsigned long long int}\\\hline
\tcode{u} or \tcode{U} &
\tcode{unsigned int} &
\tcode{unsigned int}\\
&
\tcode{unsigned long int} &
\tcode{unsigned long int}\\
&
\tcode{unsigned long long int} &
\tcode{unsigned long long int}\\\hline
\tcode{l} or \tcode{L} &
\tcode{long int} &
\tcode{long int}\\
&
\tcode{long long int} &
\tcode{unsigned long int}\\
&
&
\tcode{long long int}\\
&
&
\tcode{unsigned long long int}\\\hline
Both \tcode{u} or \tcode{U} &
\tcode{unsigned long int} &
\tcode{unsigned long int}\\
and \tcode{l} or \tcode{L} &
\tcode{unsigned long long int} &
\tcode{unsigned long long int}\\\hline
\tcode{ll} or \tcode{LL} &
\tcode{long long int} &
\tcode{long long int}\\
&
&
\tcode{unsigned long long int}\\\hline
Both \tcode{u} or \tcode{U} &
\tcode{unsigned long long int} &
\tcode{unsigned long long int}\\
and \tcode{ll} or \tcode{LL} &
&
\\
\end{LongTable}
\pnum
If an integer literal cannot be represented by any type in its list and
an extended integer type~(\ref{basic.fundamental}) can represent its value, it may have that
extended integer type. If all of the types in the list for the literal
are signed, the extended integer type shall be signed. If all of the
types in the list for the literal are unsigned, the extended integer
type shall be unsigned. If the list contains both signed and unsigned
types, the extended integer type may be signed or unsigned. A program is
ill-formed if one of its translation units contains an integer literal
that cannot be represented by any of the allowed types.
\rSec2[lex.ccon]{Character literals}
\indextext{literal!character}%
\begin{bnf}
\nontermdef{character-literal}\br
\terminal{'} c-char-sequence \terminal{'}\br
u\terminal{'} c-char-sequence \terminal{'}\br
U\terminal{'} c-char-sequence \terminal{'}\br
L\terminal{'} c-char-sequence \terminal{'}
\end{bnf}
\begin{bnf}
\nontermdef{c-char-sequence}\br
c-char\br
c-char-sequence c-char
\end{bnf}
\begin{bnftab}
\nontermdef{c-char}\br
\>\textnormal{any member of the source character set except}\br
\>\>\textnormal{the single-quote \terminal{'}, backslash \terminal{\textbackslash}, or new-line character}\br
\>escape-sequence\br
\>universal-character-name
\end{bnftab}
\begin{bnf}
\nontermdef{escape-sequence}\br
simple-escape-sequence\br
octal-escape-sequence\br
hexadecimal-escape-sequence
\end{bnf}
\begin{bnf}
\nontermdef{simple-escape-sequence} \textnormal{one of}\br
\terminal{\textbackslash'}\quad\terminal{\textbackslash"}\quad\terminal{\textbackslash ?}\quad\terminal{\textbackslash\textbackslash}\br
\terminal{\textbackslash a}\quad\terminal{\textbackslash b}\quad\terminal{\textbackslash f}\quad\terminal{\textbackslash n}\quad\terminal{\textbackslash r}\quad\terminal{\textbackslash t}\quad\terminal{\textbackslash v}
\end{bnf}
\begin{bnf}
\nontermdef{octal-escape-sequence}\br
\terminal{\textbackslash} octal-digit\br
\terminal{\textbackslash} octal-digit octal-digit\br
\terminal{\textbackslash} octal-digit octal-digit octal-digit
\end{bnf}
\begin{bnf}
\nontermdef{hexadecimal-escape-sequence}\br
\terminal{\textbackslash x} hexadecimal-digit\br
hexadecimal-escape-sequence hexadecimal-digit
\end{bnf}
\pnum
\indextext{literal!character}%
\indextext{literal!narrow-character}%
\indextext{literal!\idxcode{char16_t}}%
\indextext{literal!\idxcode{char32_t}}%
A character literal is one or more characters enclosed in single quotes,
as in \tcode{'x'}, optionally preceded by one of the letters \tcode{u},
\tcode{U}, or \tcode{L}, as in \tcode{u'y'}, \tcode{U'z'}, or
\tcode{L'x'}, respectively.
\indextext{literal!type~of character}%
A character literal that does not begin with \tcode{u}, \tcode{U}, or
\tcode{L} is an ordinary character literal, also referred to as a
narrow-character literal. An ordinary character literal that contains a
single \grammarterm{c-char} has type \tcode{char}, with value equal to the
numerical value of the encoding of the \grammarterm{c-char} in the
execution character set. An ordinary character literal that contains
more than one \grammarterm{c-char} is a
\indextext{multicharacter literal|see{literal, multicharacter}}%
\defnx{multicharacter literal}{literal!multicharacter}.
A multicharacter literal has type \tcode{int}
\indextext{literal!multicharacter!implementation-defined value of}%
and \impldef{value of multicharacter literal} value.
\pnum
\indextext{wide-character}%
\indextext{char16_t character@\tcode{char16_t} character}%
\indextext{char32_t character@\tcode{char32_t} character}%
\indextext{\idxhdr{stddef.h}}%
\indextext{\idxcode{wchar_t}}%
\indextext{\idxcode{char16_t}}%
\indextext{\idxcode{char32_t}}%
A character literal that begins with the letter \tcode{u}, such as
\tcode{u'y'}, is a character literal of type \tcode{char16_t}. The value
of a \tcode{char16_t} literal containing a single \grammarterm{c-char} is
equal to its ISO 10646 code point value, provided that the code point is
representable with a single 16-bit code unit. (That is, provided it is a
basic multi-lingual plane code point.) If the value is not representable
within 16 bits, the program is ill-formed. A \tcode{char16_t} literal
containing multiple \grammarterm{c-char}{s} is ill-formed. A character
literal that begins with the letter \tcode{U}, such as \tcode{U'z'}, is
a character literal of type \tcode{char32_t}. The value of a
\tcode{char32_t} literal containing a single \grammarterm{c-char} is equal
to its ISO 10646 code point value. A \tcode{char32_t} literal containing
multiple \grammarterm{c-char}{s} is ill-formed. A character literal that
begins with the letter \tcode{L}, such as \tcode{L'x'},
\indextext{prefix!\idxcode{L}}%
is a wide-character literal. A wide-character literal has type
\tcode{wchar_t}.\footnote{They are intended for character sets where a character does
not fit into a single byte. }
The value of a wide-character literal containing a single
\grammarterm{c-char} has value equal to the numerical value of the encoding
of the \grammarterm{c-char} in the execution wide-character set, unless the
\grammarterm{c-char} has no representation in the execution wide-character set, in which
case the value is \impldef{value of wide-character literal with single c-char that is
not in execution wide-character set}. \enternote The type \tcode{wchar_t} is able to
represent all members of the execution wide-character set (see~\ref{basic.fundamental}).
\exitnote. The value
of a wide-character literal containing multiple \grammarterm{c-char}{s} is
\impldef{value of wide-character literal containing multiple characters}.
\pnum
Certain nongraphic characters, the single quote \tcode{'}, the double quote \tcode{"},
the question mark \tcode{?},\footnote{Using an escape sequence for a question mark can
avoid accidentally creating a trigraph.} and the backslash
\indextext{backslash~character}%
\indextext{\idxcode{\textbackslash}|see{backslash}}%
\indextext{escape~character|see{backslash}}%
\tcode{\textbackslash}, can be represented according to
Table~\ref{tab:escape.sequences}.
\indextext{escape~sequence!undefined}%
The double quote \tcode{"} and the question mark \tcode{?}, can be
represented as themselves or by the escape sequences
\tcode{\textbackslash "} and \tcode{\textbackslash ?} respectively, but
the single quote \tcode{'} and the backslash \tcode{\textbackslash}
shall be represented by the escape sequences \tcode{\textbackslash'} and
\tcode{\textbackslash\textbackslash} respectively. Escape sequences in
which the character following the backslash is not listed in
Table~\ref{tab:escape.sequences} are conditionally-supported, with \impldef{semantics of
non-standard escape sequences} semantics. An escape sequence specifies a single
character.
\begin{floattable}{Escape sequences}{tab:escape.sequences}
{llm}
\topline
new-line & NL(LF) & \tcode{\textbackslash n} \\
horizontal tab & HT & \tcode{\textbackslash t} \\
vertical tab & VT & \tcode{\textbackslash v} \\
backspace & BS & \tcode{\textbackslash b} \\
carriage return & CR & \tcode{\textbackslash r} \\
form feed & FF & \tcode{\textbackslash f} \\
alert & BEL & \tcode{\textbackslash a} \\
backslash & \textbackslash & \tcode{\textbackslash\textbackslash} \\
question mark & ? & \tcode{\textbackslash ?} \\
single quote & \tcode{'} & \tcode{\textbackslash\tcode{'}} \\
double quote & \tcode{"} & \tcode{\textbackslash\tcode{"}} \\
octal number & \numconst{ooo} & \tcode{\textbackslash\numconst{ooo}} \\
hex number & \numconst{hhh} & \tcode{\textbackslash x\numconst{hhh}} \\
\end{floattable}
\pnum
The escape
\indextext{number!octal}%
\tcode{\textbackslash\numconst{ooo}} consists of the backslash followed by one,
two, or three octal digits that are taken to specify the value of the
desired character. The escape
\indextext{number!hex}%
\tcode{\textbackslash x\numconst{hhh}}
consists of the backslash followed by \tcode{x} followed by one or more
hexadecimal digits that are taken to specify the value of the desired
character. There is no limit to the number of digits in a hexadecimal
sequence. A sequence of octal or hexadecimal digits is terminated by the
first character that is not an octal digit or a hexadecimal digit,
respectively.
\indextext{literal!implementation-defined value~of \idxcode{char}}%
The value of a character literal is \impldef{value of character literal outside range of
corresponding type} if it falls outside of the implementation-defined range defined for
\tcode{char}
(for literals with no prefix), \tcode{char16_t} (for literals prefixed
by \tcode{'u'}), \tcode{char32_t} (for literals prefixed by
\tcode{'U'}), or \tcode{wchar_t} (for literals prefixed by \tcode{'L'}).
\pnum
A universal-character-name is translated to the encoding, in the appropriate
execution character set, of the character named. If there is no such
encoding, the universal-character-name is translated to an
\impldef{encoding of universal character name not in execution character set} encoding.
\enternote In translation phase 1, a universal-character-name is introduced whenever an
actual extended
character is encountered in the source text. Therefore, all extended
characters are described in terms of universal-character-names. However,
the actual compiler implementation may use its own native character set,
so long as the same results are obtained. \exitnote
\rSec2[lex.fcon]{Floating literals}
\indextext{literal!floating}%
\begin{bnf}
\nontermdef{floating-literal}\br
fractional-constant exponent-part\opt floating-suffix\opt\br
digit-sequence exponent-part floating-suffix\opt
\end{bnf}
\begin{bnf}
\nontermdef{fractional-constant}\br
digit-sequence\opt \terminal{.} digit-sequence\br
digit-sequence \terminal{.}
\end{bnf}
\begin{bnf}
\nontermdef{exponent-part}\br
\terminal{e} sign\opt digit-sequence\br
\terminal{E} sign\opt digit-sequence
\end{bnf}
\begin{bnf}
\nontermdef{sign} \textnormal{one of}\br
\terminal{+ -}
\end{bnf}
\begin{bnf}
\nontermdef{digit-sequence}\br
digit\br
digit-sequence digit
\end{bnf}
\begin{bnf}
\nontermdef{floating-suffix} \textnormal{one of}\br
\terminal{f l F L}
\end{bnf}
\pnum
\indextext{literal!floating}%
A floating literal consists of an integer part, a decimal point, a
fraction part, an
\indextext{suffix!\idxcode{e}}%
\indextext{suffix!\idxcode{E}}%
\tcode{e} or \tcode{E}, an optionally signed integer exponent, and an
optional type suffix. The integer and fraction parts both consist of a
sequence of decimal (base ten) digits. Either the integer part or the
fraction part (not both) can be omitted; either the decimal point or the
letter \tcode{e} (or \tcode{E} ) and the exponent (not both) can be
omitted. The integer part, the optional decimal point and the optional
fraction part form the \term{significant part} of the
floating literal. The exponent, if present, indicates the power of 10 by
which the significant part is to be scaled. If the scaled value is in
the range of representable values for its type, the result is the scaled
value if representable, else the larger or smaller representable value
nearest the scaled value, chosen in an \impldef{choice of larger or smaller value of
floating literal} manner.
\indextext{literal!\idxcode{double}}%
The type of a floating literal is \tcode{double}
\indextext{literal!type~of floating~point}%
unless explicitly specified by a suffix.
\indextext{literal!\idxcode{float}}%
\indextext{suffix!\idxcode{F}}%
\indextext{suffix!\idxcode{f}}%
The suffixes \tcode{f} and \tcode{F} specify \tcode{float},
\indextext{suffix!\idxcode{L}}%
\indextext{suffix!\idxcode{l}}%
\indextext{literal!\idxcode{long double}}%
the suffixes \tcode{l} and \tcode{L} specify \tcode{long}
\tcode{double}. If the scaled value is not in the range of representable
values for its type, the program is ill-formed.
\rSec2[lex.string]{String literals}
\indextext{literal!string}%
\begin{bnf}
\nontermdef{string-literal}\br
encoding-prefix\opt \terminal{"} s-char-sequence\opt \terminal{"}\br
encoding-prefix\opt \terminal{R} raw-string
\end{bnf}
\begin{bnf}
\nontermdef{encoding-prefix}\br
\terminal{u8}\br
\terminal{u}\br
\terminal{U}\br
\terminal{L}
\end{bnf}
\begin{bnf}
\nontermdef{s-char-sequence}\br
s-char\br
s-char-sequence s-char
\end{bnf}
\begin{bnftab}
\nontermdef{s-char}\br
\>\textnormal{any member of the source character set except}\br
\>\>\textnormal{the double-quote \terminal{"}, backslash \terminal{\textbackslash}, or new-line character}\br
\>escape-sequence\br
\>universal-character-name
\end{bnftab}
\begin{bnf}
\nontermdef{raw-string}\br
\terminal{"} d-char-sequence\opt \terminal{(} r-char-sequence\opt \terminal{)} d-char-sequence\opt \terminal{"}
\end{bnf}
\begin{bnf}
\nontermdef{r-char-sequence}\br
r-char\br
r-char-sequence r-char
\end{bnf}
\begin{bnftab}
\nontermdef{r-char}\br
\>\textnormal{any member of the source character set, except}\br
\>\>\textnormal{a right parenthesis \terminal{)} followed by the initial \nonterminal{d-char-sequence}}\br
\>\>\textnormal{(which may be empty) followed by a double quote \terminal{"}.}\br
\end{bnftab}
\begin{bnf}
\nontermdef{d-char-sequence}\br
d-char\br
d-char-sequence d-char
\end{bnf}
\begin{bnftab}
\nontermdef{d-char}\br
\>\textnormal{any member of the basic source character set except:}\br
\>\>\textnormal{space, the left parenthesis \terminal{(}, the right parenthesis \terminal{)}, the backslash \terminal{\textbackslash},}\br
\>\>\textnormal{and the control characters representing horizontal tab,}\br
\>\>\textnormal{vertical tab, form feed, and newline.}
\end{bnftab}
\pnum
\indextext{literal!string}%
\indextext{literal!string!narrow}%
\indextext{literal!string!wide}%
\indextext{literal!string!\idxcode{char16_t}}%
\indextext{literal!string!\idxcode{char32_t}}%
\indextext{character~string}%
A string literal is a sequence of characters (as defined
in~\ref{lex.ccon}) surrounded by double quotes, optionally prefixed by
\tcode{R},
\tcode{u8},
\tcode{u8R},
\tcode{u},
\tcode{uR},
\tcode{U},
\tcode{UR},
\tcode{L},
or \tcode{LR},
as in
\tcode{"..."},
\tcode{R"(...)"},
\tcode{u8"..."},
\tcode{u8R"**(...)**"},
\tcode{u"..."},
\tcode{uR"*\~{}(...)*\~{}"},
\tcode{U"..."},
\tcode{UR"zzz(...)zzz"},
\tcode{L"..."},
or \tcode{LR"(...)"},
respectively.
\pnum
A string literal that has an \tcode{R} in the prefix is a \defn{raw string literal}. The
\grammarterm{d-char-sequence} serves as a delimiter. The terminating
\grammarterm{d-char-sequence} of a \grammarterm{raw-string} is the same sequence of
characters as the initial \grammarterm{d-char-sequence}. A \grammarterm{d-char-sequence}
shall consist of at most 16 characters.
\pnum
\enternote The characters \tcode{'('} and \tcode{')'} are permitted in a
\grammarterm{raw-string}. Thus, \tcode{R"delimiter((a|b))delimiter"} is equivalent to
\tcode{"(a|b)"}. \exitnote
\pnum
\enternote A source-file new-line in a raw string literal results in a new-line in the
resulting execution \term{string-literal}. Assuming no
whitespace at the beginning of lines in the following example, the assert will succeed:
\begin{codeblock}
const char *p = R"(a\
b
c)";
assert(std::strcmp(p, "a\\\nb\nc") == 0);
\end{codeblock}
\exitnote
\pnum
\enterexample The raw string
\begin{codeblock}
R"a(
)\
a"
)a"
\end{codeblock}
is equivalent to \tcode{"\textbackslash n)\textbackslash \textbackslash \textbackslash na\textbackslash"\textbackslash n"}. The raw string
\begin{codeblock}
R"(??)"
\end{codeblock}
is equivalent to \tcode{"\textbackslash?\textbackslash?"}. The raw string
\begin{codeblock}
R"#(
)??="
)#"
\end{codeblock}
is equivalent to \tcode{"\textbackslash n)\textbackslash?\textbackslash?=\textbackslash"\textbackslash n"}. \exitexample
\pnum
\indextext{string!type~of}%
\indextext{literal!string!narrow}%
After translation phase 6, a string literal that does not begin with an \grammarterm{encoding-prefix} is an ordinary string literal, and is initialized with the given characters.
\pnum
A string literal that begins with \tcode{u8}, such as \tcode{u8"asdf"}, is a UTF-8 string literal and is initialized with the given characters as encoded in UTF-8.
\pnum
Ordinary string literals and UTF-8 string literals are
also referred to as narrow
string literals. A narrow string literal has type
\indextext{literal!string!type~of}%
``array of \term{n} \tcode{const char}'', where \term{n} is the size of
the string as defined below, and has static storage
duration~(\ref{basic.stc}).
\pnum
\indextext{literal!string!\idxcode{char16_t}}%
A string literal that begins with \tcode{u}, such as \tcode{u"asdf"}, is
a \tcode{char16_t} string literal. A \tcode{char16_t} string literal has
type ``array of \term{n} \tcode{const char16_t}'', where \term{n} is the
size of the string as defined below; it has static storage duration and
is initialized with the given characters. A single \grammarterm{c-char} may
produce more than one \tcode{char16_t} character in the form of
surrogate pairs.
\pnum
\indextext{literal!string!\idxcode{char32_t}}%
A string literal that begins with \tcode{U}, such as \tcode{U"asdf"}, is
a \tcode{char32_t} string literal. A \tcode{char32_t} string literal has
type ``array of \term{n} \tcode{const char32_t}'', where \term{n} is the
size of the string as defined below; it has static storage duration and
is initialized with the given characters.
\pnum
\indextext{literal!string!wide}%
A string literal that begins with
\tcode{L},
such as \tcode{L"asdf"},
is a wide string literal.
\indextext{\idxhdr{stddef.h}}%
\indextext{\idxcode{wchar_t}}%
\indextext{literal!string!wide}%
\indextext{prefix!\idxcode{L}}%
A wide string literal has type ``array of \term{n} \tcode{const
wchar_t}'', where \term{n} is the size of the string as defined below; it
has static storage duration and is initialized with the given
characters.
\pnum
\indextext{literal!string!implementation-defined}%
\indextext{string!distinct}%
Whether all string literals are distinct (that is, are stored in
nonoverlapping objects) is \impldef{distinctness of string literals}.
\indextext{literal!string!undefined change~to}%
The effect of attempting to modify a string literal is undefined.
\pnum
\indextext{concatenation!string}%
In translation phase 6~(\ref{lex.phases}), adjacent string literals are concatenated. If
both string literals have the same \grammarterm{encoding-prefix}, the resulting concatenated string literal has
that \grammarterm{encoding-prefix}. If one string literal has no \grammarterm{encoding-prefix}, it is treated as a string literal of
the same \grammarterm{encoding-prefix} as the other operand. If a UTF-8 string literal token is adjacent to a
wide string literal token, the program is ill-formed. Any other concatenations are
conditionally supported with \impldef{concatenation of some types of string literals}
behavior. \enternote This
concatenation is an interpretation, not a conversion.
Because the interpretation happens in translation phase 6 (after each character from a
literal has been translated into a value from the appropriate character set), a string
literal's initial rawness has no effect on the interpretation or well-formedness of the
concatenation.
\exitnote
Table~\ref{tab:lex.string.concat} has some examples of valid concatenations.
\begin{floattable}{String literal concatenations}{tab:lex.string.concat}
{lll|lll|lll}
\topline
\multicolumn{2}{|c}{Source} &
Means &
\multicolumn{2}{c}{Source} &
Means &
\multicolumn{2}{c}{Source} &
Means \\
\tcode{u"a"} & \tcode{u"b"} & \tcode{u"ab"} &
\tcode{U"a"} & \tcode{U"b"} & \tcode{U"ab"} &
\tcode{L"a"} & \tcode{L"b"} & \tcode{L"ab"} \\
\tcode{u"a"} & \tcode{"b"} & \tcode{u"ab"} &
\tcode{U"a"} & \tcode{"b"} & \tcode{U"ab"} &
\tcode{L"a"} & \tcode{"b"} & \tcode{L"ab"} \\
\tcode{"a"} & \tcode{u"b"} & \tcode{u"ab"} &
\tcode{"a"} & \tcode{U"b"} & \tcode{U"ab"} &
\tcode{"a"} & \tcode{L"b"} & \tcode{L"ab"} \\
\end{floattable}
Characters in concatenated strings are kept distinct.
\enterexample
\begin{codeblock}
"\xA" "B"
\end{codeblock}
contains the two characters \tcode{'\textbackslash xA'} and \tcode{'B'}
after concatenation (and not the single hexadecimal character
\tcode{'\textbackslash xAB'}).
\exitexample
\pnum
\indextext{\idxcode{0}|seealso{zero,~null}}%
\indextext{\idxcode{0}!string terminator}%
\indextext{\idxcode{0}!null~character}%
After any necessary concatenation, in translation phase
7~(\ref{lex.phases}), \tcode{'\textbackslash 0'} is appended to every
string literal so that programs that scan a string can find its end.
\pnum
\indextext{encoding!multibyte}%
Escape sequences and universal-character-names in non-raw string literals
have the same meaning as in character literals~(\ref{lex.ccon}), except that
the single quote \tcode{'} is representable either by itself or by the escape sequence
\tcode{\textbackslash'}, and the double quote \tcode{"} shall be preceded by a
\tcode{\textbackslash}.
\indextext{string!\idxcode{sizeof}}%
In a narrow string literal, a universal-character-name may map to more
than one \tcode{char} element due to \term{multibyte encoding}. The
size of a \tcode{char32_t} or wide string literal is the total number of
escape sequences, universal-character-names, and other characters, plus
one for the terminating \tcode{U'\textbackslash 0'} or
\tcode{L'\textbackslash 0'}. The size of a \tcode{char16_t} string
literal is the total number of escape sequences,
universal-character-names, and other characters, plus one for each
character requiring a surrogate pair, plus one for the terminating
\tcode{u'\textbackslash 0'}. \enternote The size of a \tcode{char16_t}
string literal is the number of code units, not the number of
characters. \exitnote Within \tcode{char32_t} and \tcode{char16_t}
literals, any universal-character-names shall be within the range
\tcode{0x0} to \tcode{0x10FFFF}. The size of a narrow string literal is
the total number of escape sequences and other characters, plus at least
one for the multibyte encoding of each universal-character-name, plus
one for the terminating \tcode{'\textbackslash 0'}.
\rSec2[lex.bool]{Boolean literals}
\indextext{literal!boolean}%
\begin{bnf}
\nontermdef{boolean-literal}\br
\terminal{false}\br
\terminal{true}
\end{bnf}
\pnum
\indextext{Boolean literal}%
The Boolean literals are the keywords \tcode{false} and \tcode{true}.
Such literals are prvalues and have type \tcode{bool}.
\rSec2[lex.nullptr]{Pointer literals}
\indextext{literal!pointer}%
\begin{bnf}
\nontermdef{pointer-literal}\br
\terminal{nullptr}
\end{bnf}
\pnum
The pointer literal is the keyword \tcode{nullptr}. It is a prvalue of type
\tcode{std::nullptr_t}.
\enternote
\tcode{std::nullptr_t} is a distinct type that is neither a pointer type nor a pointer
to member type; rather, a prvalue of this type is a null pointer constant and can be
converted to a null pointer value or null member pointer value. See~\ref{conv.ptr}
and~\ref{conv.mem}.
\exitnote
\rSec2[lex.ext]{User-defined literals}
\indextext{literal!user defined}%
\begin{bnf}
\nontermdef{user-defined-literal}\br
user-defined-integer-literal\br
user-defined-floating-literal\br
user-defined-string-literal\br
user-defined-character-literal
\end{bnf}
\begin{bnf}
\nontermdef{user-defined-integer-literal}\br
decimal-literal ud-suffix\br
octal-literal ud-suffix\br
hexadecimal-literal ud-suffix
\end{bnf}
\begin{bnf}
\nontermdef{user-defined-floating-literal}\br
fractional-constant exponent-part\opt ud-suffix\br
digit-sequence exponent-part ud-suffix
\end{bnf}
\begin{bnf}
\nontermdef{user-defined-string-literal}\br
string-literal ud-suffix
\end{bnf}
\begin{bnf}
\nontermdef{user-defined-character-literal}\br
character-literal ud-suffix
\end{bnf}
\begin{bnf}
\nontermdef{ud-suffix}\br
identifier
\end{bnf}
\pnum
If a token matches both \grammarterm{user-defined-literal} and another literal kind, it
is treated as the latter. \enterexample \tcode{123_km}
is a \grammarterm{user-defined-literal}, but \tcode{12LL} is an
\grammarterm{integer-literal}. \exitexample
The syntactic non-terminal preceding the \grammarterm{ud-suffix} in a
\grammarterm{user-defined-literal} is taken to be the longest sequence of
characters that could match that non-terminal.
\pnum
A \grammarterm{user-defined-literal} is treated as a call to a literal operator or
literal operator template~(\ref{over.literal}). To determine the form of this call for a
given \grammarterm{user-defined-literal} \term{L} with \grammarterm{ud-suffix} \term{X},
the \grammarterm{literal-operator-id} whose literal suffix identifier is \term{X} is
looked up in the context of \term{L} using the rules for unqualified name
lookup~(\ref{basic.lookup.unqual}). Let \term{S} be the set of declarations found by
this lookup. \term{S} shall not be empty.
\pnum
If \term{L} is a \grammarterm{user-defined-integer-literal}, let \term{n} be the literal
without its \grammarterm{ud-suffix}. If \term{S} contains a literal operator with
parameter type \tcode{unsigned long long}, the literal \term{L} is treated as a call of
the form
\begin{codeblock}
operator "" @\term{X}@(@\term{n}@ULL)
\end{codeblock}
Otherwise, \term{S} shall contain a raw literal operator or a literal operator
template~(\ref{over.literal}) but not both. If \term{S} contains a raw literal operator,
the \term{literal} \term{L} is treated as a call of the form
\begin{codeblock}
operator "" @\term{X}@(@"\term{n}{"}@)
\end{codeblock}
Otherwise (\term{S} contains a literal operator template), \term{L} is treated as a call
of the form
\begin{codeblock}
operator "" @\term{X}@<'@$c_1$@', '@$c_2$@', ... '@$c_k$@'>()
\end{codeblock}
where \term{n} is the source character sequence $c_1c_2...c_k$. \enternote The sequence
$c_1c_2...c_k$ can only contain characters from the basic source character set.
\exitnote
\pnum
If \term{L} is a \grammarterm{user-defined-floating-literal}, let \term{f} be the
literal without its \grammarterm{ud-suffix}. If \term{S} contains a literal operator
with parameter type \tcode{long double}, the literal \term{L} is treated as a call of
the form
\begin{codeblock}
operator "" @\term{X}@(@\term{f}@L)
\end{codeblock}
Otherwise, \term{S} shall contain a raw literal operator or a literal operator
template~(\ref{over.literal}) but not both. If \term{S} contains a raw literal operator,
the \term{literal} \term{L} is treated as a call of the form
\begin{codeblock}
operator "" @\term{X}@(@"\term{f}{"}@)
\end{codeblock}
Otherwise (\term{S} contains a literal operator template), \term{L} is treated as a call
of the form
\begin{codeblock}
operator "" @\term{X}@<'@$c_1$@', '@$c_2$@', ... '@$c_k$@'>()
\end{codeblock}
where \term{f} is the source character sequence $c_1c_2...c_k$. \enternote The sequence
$c_1c_2...c_k$ can only contain characters from the basic source character set.
\exitnote
\pnum
If \term{L} is a \grammarterm{user-defined-string-literal}, let \term{str} be the
literal without its \grammarterm{ud-suffix} and let \term{len} be
the number of
code units in \term{str} (i.e., its length excluding the terminating
null character).
The literal \term{L} is treated as a call of the form
\begin{codeblock}
operator "" @\term{X}@(@\term{str}{}@, @\term{len}{}@)
\end{codeblock}
\pnum
If \term{L} is a \grammarterm{user-defined-character-literal}, let \term{ch} be the
literal without its \grammarterm{ud-suffix}.
\term{S} shall contain a literal operator~(\ref{over.literal}) whose only parameter has
the type of \term{ch} and the
literal \term{L} is treated as a call
of the form
\begin{codeblock}
operator "" @\term{X}@(@\term{ch}{}@)
\end{codeblock}
\pnum
\enterexample
\begin{codeblock}
long double operator "" _w(long double);
std::string operator "" _w(const char16_t*, size_t);
unsigned operator "" _w(const char*);
int main() {
1.2_w; // calls \tcode{operator "" _w(1.2L)}
u"one"_w; // calls \tcode{operator "" _w(u"one", 3)}
12_w; // calls \tcode{operator "" _w("12")}
"two"_w; // error: no applicable literal operator
}
\end{codeblock}
\exitexample
\pnum
In translation phase 6~(\ref{lex.phases}), adjacent string literals are concatenated and
\grammarterm{user-defined-string-literal}{s} are considered string literals for that
purpose. During concatenation, \grammarterm{ud-suffix}{es} are removed and ignored and
the concatenation process occurs as described in~\ref{lex.string}. At the end of phase
6, if a string literal is the result of a concatenation involving at least one
\grammarterm{user-defined-string-literal}, all the participating
\grammarterm{user-defined-string-literal}{s} shall have the same \grammarterm{ud-suffix}
and that suffix is applied to the result of the concatenation.
\pnum
\enterexample
\begin{codeblock}
int main() {
L"A" "B" "C"_x; // OK: same as \tcode{L"ABC"_x}
"P"_x "Q" "R"_y;// error: two different \grammarterm{ud-suffix}{es}
}
\end{codeblock}
\exitexample%
\pnum
Some \grammarterm{identifier}{s} appearing as \grammarterm{ud-suffix}{es} are
reserved for future standardization~(\ref{usrlit.suffix}). A program containing
such a \grammarterm{ud-suffix} is ill-formed, no diagnostic required.
\indextext{literal|)}%
\indextext{conventions!lexical|)}
Jump to Line
Something went wrong with that request. Please try again.