Anagram add tests for graphemes #2445

meatball133 · 2024-05-18T15:40:24Z

This has been discussed here: https://forum.exercism.org/t/unicode-testing-for-anagram-doesnt-actually-test-for-grapheme/10906

…icode code points

BethanyG

The instructions are explicit about ASCII characters only. We should have a discussion around weather or not we're internationalizing the Anagrams. Currently, they are all English only, and do not contain umlauts or other accents. I don't think this exercise is a good candidate for Unicode as currently specified.

meatball133 · 2024-05-18T17:06:03Z

The tests already contain non-ASCII characters. I am pretty sure this "β" is Greek for beta (a non ascii letter). From my perspective, it could be argued that this exercise shouldn't have Unicode characters. But that discussion should have been held when the first uncicode tests were added.

meatball133 · 2024-05-18T17:07:51Z

Here is the pr for reference: #2366

BethanyG · 2024-05-18T17:13:49Z

It could be argued that this exercise shouldn't have Unicode characters. But that discussion should have been held when the first uncicode tests were added.

Just because we didn't have the discussion then doesn't mean we can't have it now. I don't think those test cases should have been added, given the instructions (I know I approved that, but I shouldn't have without looking at the docs). Even with the scenario flag.

I think if we go this route, we need to change/clarify the instructions, and we also need to make sure that any test cases form valid words in the target language. For ref, the Wiki article on Anagrams.

senekor · 2024-05-18T18:46:12Z

The reasoning at the time of the original PR was that these tests are added under the unicode scenario. (forum post) Languages are free to exclude these tests and add additional instructions if the choose to include them. Example from the Rust track:

# Instructions append

The Rust track extends the possible letters to be any unicode character, not just ASCII alphabetic ones.

BethanyG · 2024-05-18T19:07:01Z

Yes - I know. 🙂 But at the time, I didn't realize that the Anagram instructions directed students specifically at ASCII. I think that directive should be removed, and the instructions made more general, or there should be explicit mention of Unicode handling. For example, reverse-string doesn't mention ASCII at all, and micro-blog gets very specific about encodings.

I think if we're going to include Unicode here, the instructions should follow one or the other of those exercise examples. I also think that we should make sure that any test cases follow the rules of Anagram formation.

ErikSchierboom · 2024-05-21T11:03:03Z

I think if we're going to include Unicode here, the instructions should follow one or the other of those exercise examples.

I think I would prefer to just not mention ASCII in the instructions, as we already have non-ascii characters.

I also think that we should make sure that any test cases follow the rules of Anagram formation.

What do you mean by this? Just that we double-check if the test cases match the instructions?

BethanyG · 2024-05-21T15:53:35Z

What do you mean by this? Just that we double-check if the test cases match the instructions?

An Anagram needs to be a valid word in a given language (capitalization non withstanding), so as the instructions are written I think all candidates passed in test cases either need to be valid words, or obviously not valid, if that makes sense.

senekor · 2024-05-21T18:13:03Z

An Anagram needs to be a valid word in a given language

I feel like the only sensible interpretation of the candidates list is that students can assume them to be valid words. Otherwise, every single solution to this exercise would have to include an actual dicitionary of the English language.

Wikipedia may have some definition of the word Anagram, but natural language processing is not the goal of this exerise, right?

BethanyG · 2024-05-21T18:51:26Z

I feel like the only sensible interpretation of the candidates list is that students can assume them to be valid words. Otherwise, every single solution to this exercise would have to include an actual dicitionary of the English language.

..which is part of the point I am trying to make. In adding in Unicode characters (or any extended ASCII for that matter) we've taken this beyond English. With very few exceptions, English doesn't include any accented characters (nor Greek!)

But that's fine! It just means when we craft test cases, we need to make sure that the candidates are valid words -- whatever the language is that the candidates are written in. Conversely, I don't know any languages that use a Euro sign within words, so "€a" (last test case) is "obviously" wrong (and the start word is a single letter followed by a non-word symbol anyways), with I think is fine.

We might want to add "assume all candidates are valid words" to the instructions as well tho, just to be safe.

And for the test case in Greek, we have ["ΒΓΑ", "ΒΓΔ", "γβα", "αβγ"] with the expected result being ["ΒΓΑ", "γβα"]. But why not "αβγ"? Because its not a valid word - it is "ABC". But I had to go look that up in Google translate to figure that out. So I think that case needs to be changed.

And for this proposed test case, we need to pick a word that's valid, and craft candidates that are also valid words in the same language.

senekor · 2024-05-21T19:00:30Z

But why not "αβγ"?

Because it's the lowercase version of ΑΒΓ.

I don't see the impact of whether the words are valid in some natural langue on the user experience, but I'm fine with reimplementing the test cases with actual words that test the same thing.

senekor · 2024-05-21T19:04:55Z

Note that the test case using €a is extremely important for correct unicode handling. Finding valid words in a natural language that contain multi-byte characters in UTF-8 and have the same number of each byte could prove difficult.

MatthijsBlom · 2024-05-21T20:54:57Z

I don't think considering languages is feasible. The strings abba and baäb might be anagrams in one 'language', but not in another: various writing systems disagree on whether two characters are the same 'letter' or not.

If Unicode tests are to be included, I think it would be wise to have the instructions explicitly define what they mean by 'letter' for the purposes of this exercise, but not consider applicability to the World's writing systems.

BethanyG · 2024-05-21T22:01:31Z

Because it's the lowercase version of ΑΒΓ.

D'OH 🤦🏽‍♀️ Nevermind, then.

Note that the test case using €a is extremely important for correct unicode handling. Finding valid words in a natural language that contain multi-byte characters in UTF-8 and have the same number of each byte could prove difficult.

Probably. In any case, I don't have the interest to find any. 🙂

If we change the instructions to remove the reference to ASCII and add in a note that the student should assume all candidate words are valid, then I am fine.

Add test case for anagram with different characters using the same un…

23755bc

…icode code points

IsaacG approved these changes May 18, 2024

View reviewed changes

BNAndras approved these changes May 18, 2024

View reviewed changes

BethanyG self-requested a review May 18, 2024 16:39

BethanyG requested changes May 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anagram add tests for graphemes #2445

Anagram add tests for graphemes #2445

meatball133 commented May 18, 2024

BethanyG left a comment

meatball133 commented May 18, 2024 •

edited

Loading

meatball133 commented May 18, 2024

BethanyG commented May 18, 2024 •

edited

Loading

senekor commented May 18, 2024

BethanyG commented May 18, 2024 •

edited

Loading

ErikSchierboom commented May 21, 2024

BethanyG commented May 21, 2024

senekor commented May 21, 2024

BethanyG commented May 21, 2024 •

edited

Loading

senekor commented May 21, 2024

senekor commented May 21, 2024

MatthijsBlom commented May 21, 2024

BethanyG commented May 21, 2024

Anagram add tests for graphemes #2445

Are you sure you want to change the base?

Anagram add tests for graphemes #2445

Conversation

meatball133 commented May 18, 2024

BethanyG left a comment

Choose a reason for hiding this comment

meatball133 commented May 18, 2024 • edited Loading

meatball133 commented May 18, 2024

BethanyG commented May 18, 2024 • edited Loading

senekor commented May 18, 2024

BethanyG commented May 18, 2024 • edited Loading

ErikSchierboom commented May 21, 2024

BethanyG commented May 21, 2024

senekor commented May 21, 2024

BethanyG commented May 21, 2024 • edited Loading

senekor commented May 21, 2024

senekor commented May 21, 2024

MatthijsBlom commented May 21, 2024

BethanyG commented May 21, 2024

meatball133 commented May 18, 2024 •

edited

Loading

BethanyG commented May 18, 2024 •

edited

Loading

BethanyG commented May 18, 2024 •

edited

Loading

BethanyG commented May 21, 2024 •

edited

Loading