Skip to content

Commit

Permalink
Character Literals
Browse files Browse the repository at this point in the history
  • Loading branch information
cabmeurer committed Aug 10, 2022
1 parent 21d39ef commit 5b22640
Showing 1 changed file with 130 additions and 0 deletions.
130 changes: 130 additions & 0 deletions proposals/p1964.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Character literals

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

[Pull request](https://github.com/carbon-language/carbon-lang/pull/1964)

<!-- toc -->

## Table of contents

- [Problem](#problem)
- [Background](#background)
- [Proposal](#proposal)
- [Details](#details)
- [Encoding](#encoding)
- [Rationale](#rationale)
- [Alternatives considered](#alternatives-considered)

<!-- tocstop -->

## Problem

This proposal specifies lexical rules for constant characters in Carbon.

## Background

We wish to provide a distinct lexical syntax for character literals versus
string literals.

In theory we could just reuse string literals for the purpose of character
literals. However, it could benefit the readablity of our code if we had a
distinct lexical syntax for character literals versus string literals.

## Proposal

The idea is to create and manage a character literal the same we would as a
string, but using the single quote delimiter (') compared to the string double
quote (").

As with string literals, each character literal would have a different type.

var w: ch8 = 'w';

We will not support:

- Multi-line literals
- "raw" literals (using #'x'#)
- Empty character literals (''')

## Details

A character literal is a sequence enclosed with single quotes delimiter ('),
excluding:

- New line
- Single quote (`'`)
- Back-slash (`\`)
- Escape sequences

The type of a character literal will depend on the the contents, so that `'c'`
and `u'b'` would have different types (as would `'b'` and `"b"`). However any
`'\n'` and `'\u{A}'` would be of the same type (As when they are encoded, they
are the same unicode entities `%0A`).

These different types should resemble the different C++ character literal types:

Ordinary (UTF-8) character literals:

- C++`char`: `char c = 'c';`
- Carbon: `ch8`: `var c: ch8 = 'c';`

UTF-16 character literals

- C++ `char16_t`: `char16_t c = u'c';`
- Carbon `ch16`: `var c: ch16 = u'c';`

UTF-32 character literals

- C++ `char32_t`: `char32_t c = U'c';`
- Carbon `ch32`: `var c: ch32 = U'c';`

Wide-character literals:

- C++ `wchar_t`: `wchar_t c = L'c';`
- Carbon `wch`: `var c: wch = L'c';`

### Encoding

They type of character literal and the way it is encoded should directly
correlate i.e depend on what type is being initialized by the literal:

- Ordinary (UTF-8) character literals should use a single UTF-8 code unit.
- Wide-character literals should use single Unicode code point.
- UTF-16 character literals should use a single Unicode code point.
- UTF-32 character literals should use a single Unicode code point.
- Glyph character literals should use a base character (Single Unicode point)
plus a sequence of combining characters.

This is experimental, and should be revisited if we find motivation for
expressing character literals in other encodings.

## Rationale

This proposal supports the goal of making Carbon code
[easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
and
[Interoperability with and migration from existing C++ code](/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code)
by ensuring that every kind of character literal that exists in C++ can be
represented in a Carbon character literal. This is done in a way that is natural
to adopt, understand, easy to read by having explicit character types mapped to
the C++ character types and the correct associated encoding.

## Alternatives considered

- No explicit Wide-character literals type, as this is primarily used by
Windows systems, encoded to UTF-16 whereas other systems use UTF-32. In
terms of C++ interop, we would need to import the associated `wchar_t` to
the correct Carbon type based simply on the encoding/system using `wchar_t`
leading to further complexity.

- No distinct character literal. In principle a character literal can be
represented by reusing string literals. However it terms of readablility, if
we had a distinct lexical syntax for character literals versus string
literals, this would be more inline with Carbon's language design goals
related to self documenting code, easy to read, understand, write and C++
interopability.

0 comments on commit 5b22640

Please sign in to comment.