Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal Unicode grapheme cluster support #49

Closed
lrhn opened this issue Oct 16, 2018 · 9 comments
Closed

Minimal Unicode grapheme cluster support #49

lrhn opened this issue Oct 16, 2018 · 9 comments
Assignees
Labels
feature Proposed language feature that solves one or more problems

Comments

@lrhn
Copy link
Member

lrhn commented Oct 16, 2018

Strawman proposal for #34

As a minimal solution, we add one getter to the String class, get graphemeClusters => GraphemClusters(this);, returning an instance of a new class GrahemeClusters, which is an iterable of GraphemeCluster, which represents an extended grapheme cluster.
The GraphemeCluster and GraphemeClusters classes are defined something like:

abstract class GraphemeCluster {
  /// Whether the cluster represents a sequence of Unicode scalara values.
  /// 
  /// If not, the [runes] will contain only one (invalid) code unit.
  bool get isValid;

  /// The code points making up this grapheme cluster.
  Runes get runes;

  /// Returns a string containing just this grapheme cluster.
  String toString() => string.substring(start, end);

  /// The length of this grapheme cluster.
  /// 
  /// Returns a number that can be added to the start index of this
  /// grapheme (like one returned by [GraphemeClusters.indexOf]), 
  /// to produce an index just after this grapheme cluster.
  int get length;

  /// Whether [other] is another grapheme cluster with the same runes.
  /// 
  /// Returns `true` if [other] is a grapheme cluster, and it has the
  /// same value for [isValid] and the same sequence of [runes].
  bool operator==(Object other);
  int get hashCode;
}


/// A view of a `String` as a sequence of grapheme clusters or invalid code units.
///
/// Many operations are based on "indices". These indices should always be
/// values returned or provided by other operations on this class, with `0` 
/// being the start of the grapheme clusters, and [length] being the end.
/// The [iterator] provides access to start and end indices of the current 
/// grapheme cluster, and `indexOf` or `replaceAllMapped` provides indices.
/// These indices represent *grapheme cluster boundaries*.
/// If an index is used that does not represent a grapheme cluster boundary,
/// then the behavior of the methods are unspecified.
abstract class GraphemeClusters extends Iterable<GraphemeCluster> {
  const factory GrahphemeClusters(String string) = _SomeImplementationClass;

  /// The extended grapheme clusters of this string.
  /// 
  /// The grapheme clusters are found progressively starting at the
  /// beginning of the string. If the string contains invalid encodings,
  /// they are represented by a [GraphemeCluster] with [GraphemeCluster.isValid]
  /// returning false.
  SliceIterator<GraphemeCluster> get iterator;

  /// Whether the grapheme clusters of this contains [other].
  /// 
  /// If [start] and [end] are provided, they must be valid indices,
  /// and then only the slice from start to end is checked for [other].
  /// The default value of [end] is [length].
  bool contains(GraphemeClusters other, [int start = 0, int end])

  /// Finds the first position of [other] in the grapheme clusters of this.
  /// 
  /// Returns an integer the position of the match. This index can be used
  /// as valid arguments to other methods that take indices, including
  /// the [start] and [end] parameters.
  /// 
  /// With a [start], which should be a valid grapheme cluster index,
  /// the search starts at that index instead of at the start of the 
  /// [GraphemeClusters].
  /// With an [end], which should also be a valid grapheme cluster,
  /// the search ends when reaching that position. The default value
  /// for [end] is [length].
  int indexOf(GraphemeClusters other, [int start = 0, int end]);

  /// Whether [other] is a prefix of the grapeheme clusters of this.
  /// 
  /// If [start] is provided, then it must be a valid index, and
  bool startsWith(GrapehemeClusters other, [int start]);
  
  /// Whether [other] is a suffix of the grapeheme clusters of this.
  bool endsWith(GrapehemeClusters other);

  /// Creates a new [GraphemeClusters] containing a slice of this.
  /// 
  /// The returned clusters contain all the clusters between the [start] 
  /// and [end] index positions. Both positions must be valid indices
  /// returned by methods on this class.
  GraphemeClusters getRange(int start, [int end]);

  /// The index position after the last grapheme cluster.
  /// 
  /// This is a valid index position for functions list [subclusters].
  /// It represents the position after the last [GraphemeCluster] of
  /// [iterator].
  int get length;

  bool get isEmpty => length == 0;
  bool get isNotEmpty => length != 0;

  /// Replaces a section of the grapheme clusters with a [replacement].
  GraphemeClusters replaceRange(int start, int end, 
      GraphemeClusters replacement);

  /// Replaces the first occurrence of [pattern] with [replacement].
  GraphemeClusters replaceFirst(
      GraphemeClusters pattern, GraphemeClusters replacement, [int start]);
  /// ...
  GraphemeClusters replaceAll(
      GraphemeClusters pattern, GraphemeClusters replacement);
  /// ...
  GraphemeClusters replaceFirstMapped(
      GraphemeClusters pattern, GraphemeClusters replace(int start, int end));
  /// ...
  GraphemeClusters replaceAllMapped(
      GraphemeClusters pattern, GraphemeClusters replace(int start, int end))

  /// Whether [other] is the same sequence of [GraphemeCluster]s.
  /// 
  /// Returns `true` if [other] is a [GraphemeClusters] and the
  /// [GrapemeClusters.iterator] produces the same number of grapheme
  /// clusters that are pairwise equal according to 
  /// [GrapehmeCluster.operator==].
  bool operator==(Object other);
  int get hashCode;
}

/// An iterator moving over slices of some integer-indexable collection.
abstract class SliceIterator<T> implements Iterator<T> {
  // RuneIterator could implement this interface.

  /// Finds the next slice.
  /// 
  /// Findes the next slice after [end], then moves [start] to the start
  /// of that slice and [end] to its end.
  /// If there is no next slice, [moveNext] returns false and 
  /// then [start] and [end] will have the same value
  bool moveNext();

  /// The start index of [current].
  /// 
  /// Is equal to [end] before the first call to [moveNext] and after
  /// [moveNext] has returned false.
  int get start;

  /// The end index of [current].
  int get end;
}

There are no Patterns on grapheme clusters. We can define a ClusterPattern if necessary, but RegExp won't implement it.

This design has no support for:

  • Normalization
  • Localization

All it needs to be implemented is enough information to recognize Unicode extended grapheme clusters when scanning a string from left to right.

@Hixie
Copy link

Hixie commented Oct 16, 2018

/// If an index is used that does not represent a grapheme cluster boundary,
/// then the behavior of the methods are unspecified.

People will depend on whatever the implementation does. I strongly recommend not leaving details like this unspecified, because what that really means is you're specifying it randomly based on the whim of the first popularly-used implementation.

I think length should be called end (and maybe give the absolute end index, not the length), since it doesn't represent a length in any meaningful sense (in that the number of UTF-16 words needed to describe a grapheme cluster is meaningless).

I would strongly recommend against exposing any ints in this API or having any way to detect that this is based on UTF-16. It is strongly desireable IMHO that string APIs be agnostic to encoding, as it maximises the flexibility of the implementation. It should be a strong goal IMHO to support zero-copy string operations on mmapped files with arbitrary encodings, including ideally being able to concatenate such substrings, still with zero copies, and then iterate over those concatenated strings.

@lrhn lrhn added the feature Proposed language feature that solves one or more problems label Oct 17, 2018
@lrhn
Copy link
Member Author

lrhn commented Oct 17, 2018

Using invalid indices can be specified as just assuming it is a grapheme cluster boundary, and acting as if the string started at that point (including emitting whatever invalid code-unit clusters is needed to get to an actual valid grapheme cluster start). That is likely what will happen anyway, so yes, we could define it.

I considered length being named end, but I feel bad not having a start then, and start would always be zero.
For this API, with its restrictions, I do think length makes sense.

The API is designed so that it will still work if the underlying data is UTF-8 encoded.
The integers are used to be efficient (allocating an object for each operation is expensive).
There are other approaches, which we can also strawman, like returning an index object that is mutable and using it as a kind of iterator. This is not using that approach.

It is highly questionable that a UTF-8 based GraphemeClusters can equal a UTF-16 based GraphemeClusters, but they expose different indices for the same operations. That is one argument against this approach.

@Hixie
Copy link

Hixie commented Oct 17, 2018

I don't think a GraphemeClusters should have an encoding implied by the interface.

Suppose, in some far future, I have a UTF-16-backed string and I concatenate it with a UTF-8-backed string. It's quite possible that a grapheme cluster will cross the boundary from one buffer to the next. IMHO, we should make sure our API doesn't prevent us from implementing such a feature in the future.

@Hixie
Copy link

Hixie commented Oct 17, 2018

(One way to do that would be to have a new type that is an integer internally, but doesn't expose a way to convert it to an int or double.)

@lrhn
Copy link
Member Author

lrhn commented Oct 18, 2018

Sadly, opaque type aliases don't work well with dynamic code.

@leafpetersen
Copy link
Member

Sadly, opaque type aliases don't work well with dynamic code.

Fixed that for you. :)

Seriously - I immediately had the same thought as @Hixie there. Using an opaque type doesn't support dynamism it's true but that feels like the .1% or less use case.

@rakudrama had a different approach to avoiding indices based on slices, I wonder if he has any comment here?

@yjbanov
Copy link

yjbanov commented Oct 20, 2018

Some prior art: https://golang.org/ref/spec#Type_declarations - particularly the distinction between "alias declarations" and "type definitions". Judging from the discussion above, we want "type definitions", which do support dynamic code. This program:

package main

import "fmt"

type foo int

func main() {
	a := 12
	describe(a)

	b := foo(a)
	describe(b)

	var c interface{}
	c = b
	describe(c)
	c = "hello"
	describe(c)
}

func describe(value interface{}) {
	fmt.Printf("%v of type %T\n", value, value)
}

Prints:

12 of type int
12 of type main.foo
12 of type main.foo
hello of type string

Go implements it by representing interface{} using fat pointers. When the type is concrete, its value is represented in the most efficient form: plain int (i.e. 64 bits on 64-bit hardware). Unfortunately, fat pointers make dynamic code more expensive.

@mit-mit
Copy link
Member

mit-mit commented Nov 14, 2019

Our current experiments with support for this are being carried out in this new characters package:
https://github.com/dart-lang/characters

@mit-mit mit-mit moved this from Being discussed to Being implemented in Language funnel Nov 14, 2019
@mit-mit
Copy link
Member

mit-mit commented Nov 14, 2019

Closing the present issue in favor of the more recent, more specific issue #685

@mit-mit mit-mit closed this as completed Nov 14, 2019
@mit-mit mit-mit removed this from Being implemented in Language funnel Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Proposed language feature that solves one or more problems
Projects
None yet
Development

No branches or pull requests

5 participants