Skip to content

Commit

Permalink
close #421
Browse files Browse the repository at this point in the history
  • Loading branch information
gagolews committed Apr 30, 2021
1 parent 7d208a5 commit 606296a
Show file tree
Hide file tree
Showing 334 changed files with 1,771 additions and 244 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: stringi
Version: 1.6.1
Date: 2021-04-29
Date: 2021-04-30
Title: Character String Processing Facilities
Description: A multitude of character string/text/natural language
processing tools: pattern searching (e.g., with 'Java'-like regular
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,7 @@ export(stri_paste_list)
export(stri_rand_lipsum)
export(stri_rand_shuffle)
export(stri_rand_strings)
export(stri_rank)
export(stri_read_lines)
export(stri_read_raw)
export(stri_remove_empty)
Expand Down
11 changes: 8 additions & 3 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,19 @@
The ICU4C bundle has been updated from version 61.1 to 69.1
which features Unicode 13.0 and CLDR 39.

* ...todo... #408 (stri_trans_casefold),
* [NEW FEATURE] #408: ...todo... `stri_trans_casefold()`,

* [INTERNAL] #414: Use `LEVELS(x)` macro instead of accessing `(x)->sxpinfo.gp`
directly (@lukaszdaniel).
* [NEW FEATURE] #421: `stri_rank()` ranks strings in a character vector
(e.g., for ordering data frames with regards to multiple criteria,
the ranks can be passed to `order()`, see #219).

* [BUGFIX] `stri_sort_key()` now outputs `bytes`-encoded strings.

* [BUGFIX] #415: `locale=''` was not equivalent to `locale=NULL`
in `stri_opts_collator()`.

* [INTERNAL] #414: Use `LEVELS(x)` macro instead of accessing `(x)->sxpinfo.gp`
directly (@lukaszdaniel).


## 1.5.3 (2020-09-04) **CRAN**
Expand Down
2 changes: 1 addition & 1 deletion R/encoding.R
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@
#' is a translation scheme: we need to communicate with \R somehow,
#' relying on how it represents strings.
#'
#' Basically, \R has a very simple encoding marking mechanism,
#' Overall, \R has a very simple encoding marking mechanism,
#' see \code{\link{stri_enc_mark}}. There is an implicit assumption
#' that your platform's default (native) encoding always extends
#' ASCII -- \pkg{stringi} checks that whenever your native encoding
Expand Down
81 changes: 70 additions & 11 deletions R/sort.R
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
#'
#'
#' @description
#' This function sorts a character vector according to the locale-dependent
#' This function sorts a character vector according to a locale-dependent
#' lexicographic order.
#'
#'
Expand All @@ -45,7 +45,7 @@
#' in \pkg{stringi}, refer to \code{\link{stri_opts_collator}}.
#'
#' As usual in \pkg{stringi}, non-character inputs are coerced to strings,
#' see an example below for a perhaps non-intitive behavior of lexicographic
#' see an example below for a somewhat non-intuitive behavior of lexicographic
#' sorting on numeric inputs.
#'
#' This function uses a stable sort algorithm (\pkg{STL}'s \code{stable_sort}),
Expand Down Expand Up @@ -106,16 +106,16 @@ stri_sort <- function(str, decreasing = FALSE, na_last = NA, ..., opts_collator
#' in \pkg{stringi}, refer to \code{\link{stri_opts_collator}}.
#'
#' As usual in \pkg{stringi}, non-character inputs are coerced to strings,
#' see an example below for a perhaps non-intuitive behavior of lexicographic
#' see an example below for a somewhat non-intuitive behavior of lexicographic
#' sorting on numeric inputs.
#'
#'
#'
#'
#' This function uses a stable sort algorithm (\pkg{STL}'s \code{stable_sort}),
#' which performs up to \eqn{N*log^2(N)} element comparisons,
#' where \eqn{N} is the length of \code{str}.
#'
#' For ordering with regards to multiple criteria (such as sorting
#' data frames by more than 1 column), see \code{\link{stri_rank}}.
#'
#' @param str a character vector
#' @param decreasing a single logical value; should the sort order
#' be nondecreasing (\code{FALSE}, default)
Expand Down Expand Up @@ -288,16 +288,20 @@ stri_duplicated_any <- function(str, from_last = FALSE, fromLast = from_last, ..
#' Sort Keys
#'
#' @description
#' This function computes a locale-dependent 'sort key', which is an alternative
#' This function computes a locale-dependent sort key, which is an alternative
#' character representation of the string that, when ordered in the C locale
#' (which orders using bytes directly), will give an equivalent ordering to the
#' original string. It is useful for enhancing algorithms that sort only in the
#' C locale with the ability to be locale-aware.
#' (which orders using the underlying bytes directly), will give an equivalent
#' ordering to the original string. It is useful for enhancing algorithms
#' that sort only in the C locale (e.g., the \code{strcmp} function in libc)
#' with the ability to be locale-aware.
#'
#' @details
#' For more information on \pkg{ICU}'s Collator and how to tune it up
#' in \pkg{stringi}, refer to \code{\link{stri_opts_collator}}.
#'
#' See also \code{\link{stri_rank}} for ranking strings with a single character
#' vector, i.e., generating relative sort keys.
#'
#' @param str a character vector
#' @param opts_collator a named list with \pkg{ICU} Collator's options,
#' see \code{\link{stri_opts_collator}}, \code{NULL}
Expand All @@ -306,7 +310,7 @@ stri_duplicated_any <- function(str, from_last = FALSE, fromLast = from_last, ..
#'
#' @return
#' The result is a character vector with the same length as \code{str} that
#' contains the sort keys.
#' contains the sort keys. The output is marked as \code{bytes}-encoded.
#'
#' @references
#' \emph{Collation} - ICU User Guide,
Expand All @@ -325,3 +329,58 @@ stri_sort_key <- function(str, ..., opts_collator = NULL)
opts_collator <- do.call(stri_opts_collator, as.list(c(opts_collator, ...)))
.Call(C_stri_sort_key, str, opts_collator)
}



#' @title
#' Ranking
#'
#'
#' @description
#' This function ranks each string in a character vector according to a
#' locale-dependent lexicographic order.
#' It is a portable replacement for the base \code{xtfrm} function.
#'
#'
#' @details
#' Missing values result in missing ranks and tied observations receive
#' the same ranks (based on min).
#'
#' For more information on \pkg{ICU}'s Collator and how to tune it up
#' in \pkg{stringi}, refer to \code{\link{stri_opts_collator}}.
#'
#' @param str a character vector
#' @param opts_collator a named list with \pkg{ICU} Collator's options,
#' see \code{\link{stri_opts_collator}}, \code{NULL}
#' for default collation options
#' @param ... additional settings for \code{opts_collator}
#'
#' @return
#' The result is a vector of ranks corresponding to each
#' string in \code{str}.
#'
#' @references
#' \emph{Collation} - ICU User Guide,
#' \url{http://userguide.icu-project.org/collation}
#'
#' @family locale_sensitive
#' @export
#' @rdname stri_rank
#'
#' @examples
#' stri_rank(c('hladny', 'chladny'), locale='pl_PL')
#' stri_rank(c('hladny', 'chladny'), locale='sk_SK')
#'
#' stri_rank("a" %s+% c(1, 100, 2, 101, 11, 10)) # lexicographic order
#' stri_rank("a" %s+% c(1, 100, 2, 101, 11, 10), numeric=TRUE)
#'
#' # Ordering a data frame with respect to two criteria:
#' X <- data.frame(a=c("b", NA, "b", "b", NA, "a", "a", "c"), b=runif(8))
#' X[order(stri_rank(X$a), X$b), ]
stri_rank <- function(str, ..., opts_collator=NULL)
{
if (!missing(...))
opts_collator <- do.call(stri_opts_collator, as.list(c(opts_collator, ...)))

.Call(C_stri_rank, str, opts_collator)
}
13 changes: 6 additions & 7 deletions R/stringi-package.R
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
#' \pkg{stringi} is THE R package for fast, correct, consistent,
#' and convenient string/text manipulation.
#' It gives predictable results on every platform, in each locale,
#' and under any ``native'' character encoding.
#' and under any native character encoding.
#'
#' \bold{Keywords}: R, text processing, character strings,
#' internationalization, localization, ICU, ICU4C, i18n, l10n, Unicode.
Expand All @@ -61,7 +61,7 @@
#' locale-sensitive operations. In particular, see
#' \code{\link{stri_opts_collator}} for a description of the string
#' collation algorithm, which is used for string comparing, ordering,
#' sorting, case-folding, and searching.
#' ranking, sorting, case-folding, and searching.
#'
#' \item \link{about_arguments} -- information on how \pkg{stringi}
#' treats its functions' arguments.
Expand Down Expand Up @@ -119,8 +119,8 @@
#' text transforms, including transliteration.
#'
#' \item \code{\link{stri_cmp}}, \code{\link{\%s<\%}}, \code{\link{stri_order}},
#' \code{\link{stri_sort}}, \code{\link{stri_unique}}, and
#' \code{\link{stri_duplicated}} for collation-based,
#' \code{\link{stri_sort}}, \code{\link{stri_rank}}, \code{\link{stri_unique}},
#' and \code{\link{stri_duplicated}} for collation-based,
#' locale-aware operations, see also \link{about_locale}.
#'
#' \item \code{\link{stri_split_lines}} (among others)
Expand All @@ -147,9 +147,8 @@
#' @docType package
#' @author Marek Gagolewski,
#' with contributions from Bartek Tartanus and many others.
#' ICU4C was developed by IBM and others.
#' The Unicode Character Database is due to Unicode, Inc.;
#' see the COPYRIGHTS file for more details.
#' ICU4C was developed by IBM, Unicode, Inc., and others.
#'
#' @references
#' \emph{\pkg{stringi} Package homepage}, \url{https://stringi.gagolewski.com/}
#'
Expand Down
Binary file modified devel/sphinx/_build/doctrees/environment.pickle
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/news.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/about_encoding.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/about_locale.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/about_search_boundaries.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/about_search_coll.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/about_stringi.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/operator_compare.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_compare.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_count_boundaries.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_duplicated.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_enc_detect2.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_extract_boundaries.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_locate_boundaries.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_opts_collator.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_order.doctree
Binary file not shown.
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_sort.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_sort_key.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_split_boundaries.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_trans_casemap.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_unique.doctree
Binary file not shown.
Binary file modified devel/sphinx/_build/doctrees/rapi/stri_wrap.doctree
Binary file not shown.
12 changes: 9 additions & 3 deletions devel/sphinx/_build/html/_sources/news.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,20 @@ What Is New in *stringi*
- …todo… #401 (update ICU4C to 69.1), The ICU4C bundle has been updated
from version 61.1 to 69.1 which features Unicode 13.0 and CLDR 39.

- …todo… #408 (stri_trans_casefold),
- [NEW FEATURE] #408: …todo… ``stri_trans_casefold()``,

- [INTERNAL] #414: Use ``LEVELS(x)`` macro instead of accessing
``(x)->sxpinfo.gp`` directly (@lukaszdaniel).
- [NEW FEATURE] #421: ``stri_rank()`` ranks strings in a character
vector (e.g., for ordering data frames with regards to multiple
criteria, the ranks can be passed to ``order()``, see #219).

- [BUGFIX] ``stri_sort_key()`` now outputs ``bytes``-encoded strings.

- [BUGFIX] #415: ``locale=''`` was not equivalent to ``locale=NULL`` in
``stri_opts_collator()``.

- [INTERNAL] #414: Use ``LEVELS(x)`` macro instead of accessing
``(x)->sxpinfo.gp`` directly (@lukaszdaniel).

1.5.3 (2020-09-04) **CRAN**
---------------------------

Expand Down
1 change: 1 addition & 0 deletions devel/sphinx/_build/html/_sources/rapi.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@ R Package *stringi* Reference
rapi/stri_rand_lipsum
rapi/stri_rand_shuffle
rapi/stri_rand_strings
rapi/stri_rank
rapi/stri_read_lines
rapi/stri_read_raw
rapi/stri_remove_empty
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ Character Encodings in R

Data in memory are just bytes (small integer values) – an en\ *coding* is a way to represent characters with such numbers, it is a semantic 'key' to understand a given byte sequence. For example, in ISO-8859-2 (Central European), the value 177 represents Polish “a with ogonek”, and in ISO-8859-1 (Western European), the same value denotes the “plus-minus” sign. Thus, a character encoding is a translation scheme: we need to communicate with R somehow, relying on how it represents strings.

Basically, R has a very simple encoding marking mechanism, see `stri_enc_mark <stri_enc_mark.html>`__. There is an implicit assumption that your platform's default (native) encoding always extends ASCII – stringi checks that whenever your native encoding is being detected automatically on ICU's initialization and each time when you change it manually by calling `stri_enc_set <stri_enc_set.html>`__.
Overall, R has a very simple encoding marking mechanism, see `stri_enc_mark <stri_enc_mark.html>`__. There is an implicit assumption that your platform's default (native) encoding always extends ASCII – stringi checks that whenever your native encoding is being detected automatically on ICU's initialization and each time when you change it manually by calling `stri_enc_set <stri_enc_set.html>`__.

Character strings in R (internally) can be declared to be in:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,6 @@ See Also

Other locale_management: `stri_locale_info() <stri_locale_info.html>`__, `stri_locale_list() <stri_locale_list.html>`__, `stri_locale_set() <stri_locale_set.html>`__

Other locale_sensitive: `%s<%() <operator_compare.html>`__, `about_search_boundaries <about_search_boundaries.html>`__, `about_search_coll <about_search_coll.html>`__, `stri_compare() <stri_compare.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_duplicated() <stri_duplicated.html>`__, `stri_enc_detect2() <stri_enc_detect2.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__, `stri_order() <stri_order.html>`__, `stri_sort_key() <stri_sort_key.html>`__, `stri_sort() <stri_sort.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_unique() <stri_unique.html>`__, `stri_wrap() <stri_wrap.html>`__
Other locale_sensitive: `%s<%() <operator_compare.html>`__, `about_search_boundaries <about_search_boundaries.html>`__, `about_search_coll <about_search_coll.html>`__, `stri_compare() <stri_compare.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_duplicated() <stri_duplicated.html>`__, `stri_enc_detect2() <stri_enc_detect2.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__, `stri_order() <stri_order.html>`__, `stri_rank() <stri_rank.html>`__, `stri_sort_key() <stri_sort_key.html>`__, `stri_sort() <stri_sort.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_unique() <stri_unique.html>`__, `stri_wrap() <stri_wrap.html>`__

Other stringi_general_topics: `about_arguments <about_arguments.html>`__, `about_encoding <about_encoding.html>`__, `about_search_boundaries <about_search_boundaries.html>`__, `about_search_charclass <about_search_charclass.html>`__, `about_search_coll <about_search_coll.html>`__, `about_search_fixed <about_search_fixed.html>`__, `about_search_regex <about_search_regex.html>`__, `about_search <about_search.html>`__, `about_stringi <about_stringi.html>`__
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ References
See Also
~~~~~~~~

Other locale_sensitive: `%s<%() <operator_compare.html>`__, `about_locale <about_locale.html>`__, `about_search_coll <about_search_coll.html>`__, `stri_compare() <stri_compare.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_duplicated() <stri_duplicated.html>`__, `stri_enc_detect2() <stri_enc_detect2.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__, `stri_order() <stri_order.html>`__, `stri_sort_key() <stri_sort_key.html>`__, `stri_sort() <stri_sort.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_unique() <stri_unique.html>`__, `stri_wrap() <stri_wrap.html>`__
Other locale_sensitive: `%s<%() <operator_compare.html>`__, `about_locale <about_locale.html>`__, `about_search_coll <about_search_coll.html>`__, `stri_compare() <stri_compare.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_duplicated() <stri_duplicated.html>`__, `stri_enc_detect2() <stri_enc_detect2.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__, `stri_order() <stri_order.html>`__, `stri_rank() <stri_rank.html>`__, `stri_sort_key() <stri_sort_key.html>`__, `stri_sort() <stri_sort.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_unique() <stri_unique.html>`__, `stri_wrap() <stri_wrap.html>`__

Other text_boundaries: `about_search <about_search.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_brkiter() <stri_opts_brkiter.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_split_lines() <stri_split_lines.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_wrap() <stri_wrap.html>`__

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,6 @@ See Also

Other search_coll: `about_search <about_search.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__

Other locale_sensitive: `%s<%() <operator_compare.html>`__, `about_locale <about_locale.html>`__, `about_search_boundaries <about_search_boundaries.html>`__, `stri_compare() <stri_compare.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_duplicated() <stri_duplicated.html>`__, `stri_enc_detect2() <stri_enc_detect2.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__, `stri_order() <stri_order.html>`__, `stri_sort_key() <stri_sort_key.html>`__, `stri_sort() <stri_sort.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_unique() <stri_unique.html>`__, `stri_wrap() <stri_wrap.html>`__
Other locale_sensitive: `%s<%() <operator_compare.html>`__, `about_locale <about_locale.html>`__, `about_search_boundaries <about_search_boundaries.html>`__, `stri_compare() <stri_compare.html>`__, `stri_count_boundaries() <stri_count_boundaries.html>`__, `stri_duplicated() <stri_duplicated.html>`__, `stri_enc_detect2() <stri_enc_detect2.html>`__, `stri_extract_all_boundaries() <stri_extract_boundaries.html>`__, `stri_locate_all_boundaries() <stri_locate_boundaries.html>`__, `stri_opts_collator() <stri_opts_collator.html>`__, `stri_order() <stri_order.html>`__, `stri_rank() <stri_rank.html>`__, `stri_sort_key() <stri_sort_key.html>`__, `stri_sort() <stri_sort.html>`__, `stri_split_boundaries() <stri_split_boundaries.html>`__, `stri_trans_tolower() <stri_trans_casemap.html>`__, `stri_unique() <stri_unique.html>`__, `stri_wrap() <stri_wrap.html>`__

Other stringi_general_topics: `about_arguments <about_arguments.html>`__, `about_encoding <about_encoding.html>`__, `about_locale <about_locale.html>`__, `about_search_boundaries <about_search_boundaries.html>`__, `about_search_charclass <about_search_charclass.html>`__, `about_search_fixed <about_search_fixed.html>`__, `about_search_regex <about_search_regex.html>`__, `about_search <about_search.html>`__, `about_stringi <about_stringi.html>`__
Loading

0 comments on commit 606296a

Please sign in to comment.