Explore using icu4 #1227

bim9262 · 2023-08-26T20:15:31Z

Relates to #854, but doesn't replace datetime code as a mentioned possible approach.

Uses icu4 for:

short_months
long_months
short_weekdays
long_weekdays
am_pm
decimal_point

Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...):

d_fmt
d_t_fmt
t_fmt
t_fmt_ampm

Thanks for contributing to chrono!

If your feature is semver-compatible, please target the 0.4.x branch;
the main branch will be used for 0.5.0 development going forward.

Please consider adding a test to ensure your bug fix/feature will not break in the future.

Relates to chronotope#854, but doesn't replace datetime code as a mentioned possible approach. Uses icu4 for: * short_months * long_months * short_weekdays * long_weekdays * am_pm * decimal_point Still uses pure_rust_locales for (I couldn't figure out how to get this from icu4 yet...): * d_fmt * d_t_fmt * t_fmt * t_fmt_ampm

codecov · 2023-08-26T20:22:53Z

Codecov Report

Merging #1227 (c83af79) into 0.4.x (7eb6db1) will decrease coverage by 0.11%.
The diff coverage is 86.69%.

@@            Coverage Diff             @@
##            0.4.x    #1227      +/-   ##
==========================================
- Coverage   86.14%   86.03%   -0.11%     
==========================================
  Files          37       37              
  Lines       13394    13663     +269     
==========================================
+ Hits        11538    11755     +217     
- Misses       1856     1908      +52

Files Changed	Coverage Δ
src/format/locales.rs	`82.31% <81.36%> (-17.69%)`	⬇️
src/format/formatting.rs	`94.78% <96.80%> (+0.05%)`	⬆️
src/datetime/tests.rs	`96.44% <100.00%> (ø)`
src/format/strftime.rs	`98.94% <100.00%> (ø)`

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

djc · 2023-08-26T20:28:19Z

What's the benefit here? How does it impact compile sizes and binary sizes?

bim9262 · 2023-08-26T22:02:55Z

My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales). In the tests you can see that there's some discrepencies in the localization data between the two.

For the compile size are you referring to the size of libchrono.rlib?

I made a little test wrapper that depends on chrono ( chrono = { version = "0.4.26", features = ["unstable-locales"] }) and compared the size with and without the icu addition.

without icu the bin is 4.6M
with icu the bin is 7.5M.

djc · 2023-08-28T07:55:29Z

So a 3M binary size penalty seems like a big downside...

pitdicker · 2023-08-28T10:18:22Z

My end goal is to support #281 and to improve localization overall - although I suppose it is subjective if the data from icu is better than the data in pure_rust_locales).

Good that you look at localization overall and not just adding "nd", "th" suffixes 👍. That looks to English-focused to me, as there are similar but different shortcomings when formatting in other languages.

But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?

Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem.
And when you want more advanced formatting, ICU4X would be the way to go.

pitdicker · 2023-08-28T10:26:21Z

So a 3M binary size penalty seems like a big downside...

We already don't include timezone data but leave it in chrono-tz because it would add 2+ Mb to the binary size (and there is the problem it can get outdated...). I would argue that that data is a lot closer to the core functionality of chrono.

pitdicker · 2023-08-28T11:01:42Z

src/datetime/tests.rs

-    assert_eq!(dt.format_localized("%T%.3f", ar_SY).to_string(), "18:58:00.123");
-    assert_eq!(dt.format_localized("%T%.6f", ar_SY).to_string(), "18:58:00.123456");
-    assert_eq!(dt.format_localized("%T%.9f", ar_SY).to_string(), "18:58:00.123456780");
+    assert_eq!(dt.format_localized("%T%.f", ar_SY).to_string(), "18:58:00٫123456780");


An Arabic comma (or what is it called) in roman numerals doesn't seem right.

Agreed, some locales have outdigits defined as part of LC_CTYPE, but pure-rust-locales doesn't support these. I did check though and there's nothing defined for ar_SY. I beleive that ICU has the numbers, though I'm still researching.

pitdicker · 2023-08-28T11:03:15Z

src/format/formatting.rs

-                    let nano = t.nanosecond() % 1_000_000_000;
-                    if nano == 0 {
+            let ret =
+                match *spec {


Rustfmt is at it again 😆. #1082 (comment)

pitdicker · 2023-08-28T11:11:48Z

Some links for a bit of context/history of our localization support:

Some of the goals where to not generate the data at build time, to not increase the binary size with locale data that is not used, and to work without allocating.

bim9262 · 2023-08-28T22:11:06Z

I'll take a look at generating rust files from the icu data that only has the data we'd need. For example there was a lot of extra information included about non-gregorian calendars. Producing data in a format like pure-rust-locales should allow us to have &'static strs, and not need to alloc. This should also cut down on the size of the additional allocation information. 🤞

bim9262 · 2023-08-29T05:24:28Z

Upon further testing by dumping out some mods from the ICU data I see results like:

#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
    pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const AM_PM: &[&str] = &["AM", "PM"];
    pub const DECIMAL_POINT: &str = ".";
    pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}

Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)

The data for decimal and digits looks promising (the digits are displayed 9-0 on github, even though they are 0-9 in my terminal. Probably due to left to right vs right to left. As a side note, although arabic is read right to left, the numbers are constructed left to right):

pub mod ar_SY {
    [...]
    pub const DECIMAL_POINT: &str = "٫";
    pub const DIGITS: &[char] = &['٠', '١', '٢', '٣', '٤', '٥', '٦', '٧', '٨', '٩'];
}

Still need to investigate the ordinal numbers...

bim9262 · 2023-08-29T07:02:33Z

This shows a comparison between the glibc and icu for am_pm and the decimal point

https://gist.github.com/bim9262/920d29bd6155e207239bf11f42fe5da6

There are a lot of interesting cases where glibc has something more specific than am/pm or icu will be more specific.

Some examples to check out:

hsb_DE (glibc blank, icu localized)
ig_NG (glibc am/pm, icu localized)
hif_FJ (glibc localized, icu am/pm)
ks_IN (both localized, but very different)

I'd be tempted to take use glibc unless it's blank or am/pm then try icu as the fallback.

autarch · 2024-02-11T05:41:08Z

But I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

ICU has a set of patterns for date & time components. See https://cldr.unicode.org/translation/date-time/date-time-patterns

It covers a lot more cases than the strftime formatting specifiers. You can use the ICU data with the strftime specifiers as well. You just have to make some decisions about the mapping. I did this with my Perl module DateTime. I could enumerate the choices I made there if there's interest in this.

autarch · 2024-02-11T05:49:43Z

Upon further testing by dumping out some mods from the ICU data I see results like:

#[allow(non_snake_case,non_camel_case_types,dead_code,unused_imports)]
pub mod aa_DJ {
    pub const SHORT_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const LONG_MONTHS: &[&str] = &["M01", "M02", "M03", "M04", "M05", "M06", "M07", "M08", "M09", "M10", "M11", "M12"];
    pub const SHORT_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const LONG_WEEKDAYS: &[&str] = &["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"];
    pub const AM_PM: &[&str] = &["AM", "PM"];
    pub const DECIMAL_POINT: &str = ".";
    pub const DIGITS: &[char] = &['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'];
}

Since there the data for months doesn't look good. I don't think that it would be a good to use icu for months (and probably not weekdays either)

This sort of thing happens when the ICU project's data for a locale is incomplete. You can see the underlying data for that locale at https://github.com/unicode-org/cldr/blob/main/common/main/aa_DJ.xml. It's nearly empty. Compare it to a more complete locale at https://github.com/unicode-org/cldr/blob/main/common/main/am.xml.

There's a whole system for locale "inheritance". In the case of aa-DJ it should fall back to aa, which does have data (at least in the CLDR GitHub repo). But the output above looks like it's falling back to the root locale. I'm not sure how you dumped that data. Or maybe it's using an older version of the data where the aa locale was empty.

autarch · 2024-02-11T05:51:16Z

One more thing re: CLDR formatting. If you want to support this I think it's best to come up with new functions that accept these formats, like format_cldr or something along those lines.

pitdicker · 2024-04-10T12:50:58Z

So a 3M binary size penalty seems like a big downside...

I don't think our strftime-format string is going to be flexible enough to provide optimal localized formatting. And I don't think we have the experience to come up with enough extensions to get this right either. If one project can design a good format and API it would be ICU.

Do we want to combine some localization data from ICU, while lacking their formatting API, with the formatting string format from libc?

Given that we use a strftime formatting string like libc, it makes sense to me that we also use the localization data from libc. Our localization would be 'as good as', and match, libc. That is already a useful thing for the Rust ecosystem. And when you want more advanced formatting, ICU4X would be the way to go

In my opinion chrono should focus mostly on being a good date-and-time library. ICU can be better at localization. I propose to close this PR and instead make it easy to convert to/from icu::calendar::{Date, DateTime}. Maybe behind an icu feature?

bim9262 added 2 commits August 26, 2023 18:08

icu prefer stand_alone vs format

df542e7

fallback to pure_rust_locales if icu fails

c83af79

pitdicker reviewed Aug 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore using icu4 #1227

Explore using icu4 #1227

bim9262 commented Aug 26, 2023

codecov bot commented Aug 26, 2023 •

edited

Loading

djc commented Aug 26, 2023

bim9262 commented Aug 26, 2023

djc commented Aug 28, 2023

pitdicker commented Aug 28, 2023

pitdicker commented Aug 28, 2023 •

edited

Loading

pitdicker Aug 28, 2023

bim9262 Aug 28, 2023

pitdicker Aug 28, 2023

pitdicker commented Aug 28, 2023

bim9262 commented Aug 28, 2023

bim9262 commented Aug 29, 2023

bim9262 commented Aug 29, 2023

autarch commented Feb 11, 2024

autarch commented Feb 11, 2024

autarch commented Feb 11, 2024

pitdicker commented Apr 10, 2024

Explore using icu4 #1227

Are you sure you want to change the base?

Explore using icu4 #1227

Conversation

bim9262 commented Aug 26, 2023

Thanks for contributing to chrono!

codecov bot commented Aug 26, 2023 • edited Loading

Codecov Report

djc commented Aug 26, 2023

bim9262 commented Aug 26, 2023

djc commented Aug 28, 2023

pitdicker commented Aug 28, 2023

pitdicker commented Aug 28, 2023 • edited Loading

pitdicker Aug 28, 2023

Choose a reason for hiding this comment

bim9262 Aug 28, 2023

Choose a reason for hiding this comment

pitdicker Aug 28, 2023

Choose a reason for hiding this comment

pitdicker commented Aug 28, 2023

bim9262 commented Aug 28, 2023

bim9262 commented Aug 29, 2023

bim9262 commented Aug 29, 2023

autarch commented Feb 11, 2024

autarch commented Feb 11, 2024

autarch commented Feb 11, 2024

pitdicker commented Apr 10, 2024

codecov bot commented Aug 26, 2023 •

edited

Loading

pitdicker commented Aug 28, 2023 •

edited

Loading