Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concept & Specification #1

Closed
3 of 4 tasks
julkue opened this issue Aug 12, 2016 · 67 comments
Closed
3 of 4 tasks

Concept & Specification #1

julkue opened this issue Aug 12, 2016 · 67 comments
Assignees
Milestone

Comments

@julkue
Copy link
Member

julkue commented Aug 12, 2016

The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.

As there is no single, trustworthy and complete source, all information need to be collected by users manually.

Example mapping:

Schön => Schoen
Schoen => Schön

User Requirements

Someone using diacritics mapping information.

It should be possible to:

  1. Output diacritics mapping information in a CLI and web interface
  2. Output diacritics mapping information for various languages, e.g. a JavaScript array/object
  3. Fetch diacritics mapping information in builds
  4. Filter diacritics mapping information based by:
    • By diacritic
    • By mapping value
    • By language
    • By continent
    • By alphabet (e.g. Latin)

Contributor Requirements

Someone providing diacritics mapping information.

Assuming every contributor has a GitHub account and is familiar with Git.

Providing information should be:

  1. Easy to collect
  2. Possible without manual invitations
  3. Possible without registration (an exception is: "Register with GitHub")
  4. Done at one place
  5. Easy to check correctness of information structure
  6. Checkable before acceptance by another contributor familiar with the language
  7. Possible without a Git clone

System Specification

There are two ways of realization:

  1. Create a JSON database in this GitHub repository, as this fits user and contributor requirements.

  2. Create a database in a third-party service that fits the user and contributor requirements.

    Tested:

    • Transifex: Doesn't fit requirements. It would allow providing mapping information, but not metadata.
    • Contentful: Doesn't fit requirements. It would require a manual invitation and registration.

Because we're not familiar with further third-party services that could fit user and contributor requirements, we'll continue realizing the first point.

System Requirements

See the documentation and pull request.

Build & Distribution

Build

According to the contributor requirements it should be possible to compile source files without making a Git clone necessary. This means that we can't require users to run e.g. $ grunt dist at the and, since this would require to clone, install dependencies and run things. What we'll do is implementing a build bot that will run our build on Travis CI and commits changes directly to a dist branch in this repository. Therefore once you merge something or you commit something yourself the dist branch will be updated automatically. Some people already doing this to update their gh-pages branch when something changes in the master branch (e.g. this script).

Since we'll use a server-side component to filter and serve actual mapping information we just need to generate one diacritics.json file containing all data.

To make parsing easier and to encode diacritics to unicode numbers in production we're going to need a build that minifies the files and encodes diacritics. This should be done using Grunt.

Integrity

In order to ensure integrity and consistency we need the following in our build process:

  • A JSON validator that validates database files (must work with comments)
  • A code style guideline, e.g. .jsbeautify
  • A linter for JSON files that makes sure the database is formatted according to the code style

Distribution

To provide diacritics mapping according to the User Requirements it's necessary to run a custom server-side component that makes it possible to sort, limit and filter information and output them in different ways (e.g. JS object or array). This component should be realized using Node.js as it's made for handling JS/JSON files and PHP would cause a lot more serializing/deserializing.

Next Steps

  • Finalize system requirements
  • Create a spec .md file that specifies the entire database structure in detail
  • Implement the basics according to the system requirements
  • If the basics exist, start collecting repositories that use diacritics and invite owners and stargazers to share their country-specific mapping information. It's in their interest to drive development forward.

This comment is updated continuously during the discussion

@julkue
Copy link
Member Author

julkue commented Aug 29, 2016

@Mottie Do you know any other third-party services worth mentioning?
Do you agree with the specified requirements or do you have any other ideas or concerns to share?

@Mottie
Copy link
Member

Mottie commented Aug 29, 2016

I don't know of any other third-party services, but I'm sure there are more. I'll keep an eye out.

I like what you have so far. I do have a few points I wanted to add:

Also, I would love to hear what ideas @mathiasbynens might have on this subject.

@julkue
Copy link
Member Author

julkue commented Aug 30, 2016

Thanks for sharing your ideas, @Mottie.

The file names should also include the territory, just as the CLDR database is set up. For example, the German language should include all of these files [...]

Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.

Normalization of the code is still important as a diacritic can be represented in more than one way

What would be your solution approach here?

Collation rules might help with diacritic mapping for each language

How would you integrate these rules into the creation process of diacritics mapping?

Btw: As a collaborator you're allowed to update the specifications too.

@Mottie
Copy link
Member

Mottie commented Aug 30, 2016

Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.

It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.

What would be your solution approach here?

Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases... maybe? If I use the example from this page for the latin capital letter A with a ring above, the data section would need to look something like this:

"data":{
  // Latin capital letter A with ring above (U+00C5)
  "Å":{
    "mapping":"A",
    "equivalents" : [
      "Å", // Angstrom sign (U+212B)
      "Å", // A (U+0041) + Combining Ring above (U+030A)

      // maybe include the key that wraps this object as well?
      "Å" // Latin capital letter A with ring above (U+00C5)
    ]
  }
}

Btw: As a collaborator you're allowed to update the specifications too.

I know I am, but we're still discussing the layout 😉

@julkue
Copy link
Member Author

julkue commented Aug 31, 2016

It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.

That makes sense. Adding additional language variants would be optional. I've added this to the SysReq.

Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases

Good catch! I didn't know what you meant here on first glance – because I'm not familiar with any language that has such equavalents. Added this to the SysReq too.

@Mottie
Copy link
Member

Mottie commented Aug 31, 2016

I've updated the spec example. A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character.

There is another useful site I found, but it's not working for me at the moment - http://shapecatcher.com/

@andrewplummer
Copy link

Hello,

The Sugar diacritics were based purely on my own research with a healthy dose of consulting with my European friends/followers. The goal was simply to provide an "80% good" sort order for most languages compared to the default Unicode code point sort in raw Javascript. It's not intended to be anything close to a complete collation algorithm, let alone an authority on diacritic mappings.

I definitely agree that there is a need for this. For my part, I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it.

Thanks for working on this!

@julkue
Copy link
Member Author

julkue commented Sep 1, 2016

@Mottie

A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character.

Thanks, just learned something new. I think adding those equivalents will be up to the authors most of the time, as users probably don't know much about them.

I've invited the author of shapecatcher to particiate at this discussion.

@andrewplummer
Thank you for participating! 👍

I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it.

I see two ways of distribution:

  • Offering as dependency, e.g. via Bower. This would allow developers to use diacritics mapping within their applications very easily.
  • Offering as build integration. This would allow developers of libraries like you (and me) to use the mapping inside their builds, e.g. by replacing a placeholder like <% diacritics %>. You could also offer this as an additional add-in that integrates into your library – by creating a separate file that would overwrite the method that maps diacritics. In this case you don't need to have any production dependency, just one build helper to integrate diacritics mapping.

Whould the latter something you'd be interested in using? I'm asking because it's important to know if that would be a way library authors could imagine to integrate this. If not, what would be your preferred way?
@Mottie What do you think about this distribution?

@Mottie
Copy link
Member

Mottie commented Sep 1, 2016

Thanks @andrewplummer! you 🚀!

@julmot

  • Yes, distribution by bower and npm are pretty much given. As long as we provide optimized data (e.g. based on the diacritic, language, etc.) I think we'll be fine. I'm sure the users will let us know if we need to add more.

  • I'm not sure that "continent" is needed in the data, or what should be done if the language isn't associated with one, e.g. Esperanto. Would "null" be appropriate?

  • I think adding "native" (or equivalent) to the metadata would also be beneficial

    "metadata": {
      "alphabet": "Latn",
      "continent": "EU",
      "language": "German",
      "native": "Deutsch"
    }
    

    mostly for selfish reasons as it is easier to search for "Deutsch" than it is to type out "German language" 😁

  • Mapping should be provided with the character with the accent removed and decomposed. If you look at this section on dealing with umlauts, you'll see that there are three ways to deal with them.

  • While attempting to construct a few files, I found that it was difficult to determine if an equivalent was a single unicode character or a combination. I know you like to see the actual character, but maybe for the equivalents it would be best to use the unicode value. I'm just not sure how to make and edit equivalents easier.

  • So this is what I have so far:

    {
      "metadata": {
        "alphabet": "Latn",
        "continent": "EU",
        "language": "German",
        "native": "Deutsch"
        // there could be more
      },
      // Sources:
      // diacritic list: https://en.wikipedia.org/wiki/German_orthography#Special_characters
      // mapping: https://en.wikipedia.org/wiki/German_orthography#Sorting
      "data": {
        "ü": {
          "mapping": {
            "base": "u",
            "decompose": "ue"
          },
          "equivalents": [
            "u\u0308", // u + Combining diaeresis (U+0308)
            "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
          ]
        },
        "ö": {
          "mapping": {
            "base": "o",
            "decompose": "oe"
          },
          "equivalents": [
            "o\u0308", // o + Combining diaeresis (U+0308)
            "\u04e7",  // Cyrillic small letter o with diaeresis (U+04E7)
            "\u00f6"   // Latin small letter o with diaeresis (U+00F6)
          ]
        },
        "ä": {
          "mapping": {
            "base": "a",
            "decompose": "ae"
          },
          "equivalents": [
            "a\u0308", // a + Combining diaeresis (U+0308)
            "\u04d3", // Cyrillic small letter a with diaeresis (U+04D3)
            "\u00e4"  // Latin small letter a with diaeresis (U+00E4)
          ]
        },
        "ß": {
          "mapping": {
            "base": "\u00df",  // unchanged
            "decompose": "ss"
          }
        }
      }
    }

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

Next question. To allow adding comments to the source files, would you prefer to make them:

  • Plain .js files (using grunt to convert them to JSON),
  • YAML (using grunt-yaml to convert it to JSON),
  • Hjson (using grunt-hjson to convert it to JSON),
  • or something else?

@julkue
Copy link
Member Author

julkue commented Sep 2, 2016

@Mottie

Yes, distribution by bower and npm are pretty much given. As long as we provide optimized data (e.g. based on the diacritic, language, etc.) I think we'll be fine. I'm sure the users will let us know if we need to add more.

When I was talking about distribution using Bower I didn't meant to distribute the actual data. I meant a build helper that then fetches the data from this repository. This way we can have a specific version for our build helper but our users will always get the latest diacritics mapping information.
I see a few ways here:

  • We're going to distribute the actual data like you've mentioned into e.g. a dist folder. This will cause many variants based on the filter critera in the User Requirements. We could reduce these criteria but I'm quite sure that users will request them in future. The build helper could then simply load one existing file on the GitHub server.
  • We're going to build a server side service that fetches the data from this repository and provides them in the requested format. The build helper could then simply send a request and get the result as one file.

While I personally don't like to create a server side component, I also see that there would be many file variants. We'd need to specify a good dist structure to make finding things easily if we opt for the former.

What do you think?

I'm not sure that "continent" is needed in the data, or what should be done if the language isn't associated with one, e.g. Esperanto. Would "null" be appropriate?

No, it's not needed but would be a nice-to-have. Imagine a customer is distributing an application to a continent, e.g. Europe. Then it wouldn't be necessary to just include all mapping information by selecting every EU-country manually.
In case a country is associated with multiple continents like Russia we'd need to specify them inside an array.
I don't know any accepted language that isn't associated with a country. Esperanto seems like an idea of hippies, I'd vote for just ignoring it as there'll probably be no significant demand. But if we include it, I'd just add every continent inside an array, as it's globally available.

I think adding "native" (or equivalent) to the metadata would also be beneficial

Great idea. It would then be possible to select country specific diacritic mapping information by native language spellings. But would be another variant to consider in the distribution (see above).

Mapping should be provided with the character with the accent removed and decomposed. If you look at this section on dealing with umlauts, you'll see that there are three ways to deal with them.

Related to this article I agree with you and I'd vote for using it like you've did, having a base and decompose property when available, otherwise a simple string.

While attempting to construct a few files, I found that it was difficult to determine if an equivalent was a single unicode character or a combination. I know you like to see the actual character, but maybe for the equivalents it would be best to use the unicode value. I'm just not sure how to make and edit equivalents easier.

I agree with you. It's also hard to review when there is no visual difference. Would you mind to update the system requirements with this information?

Another open question about equivalents for me is who will collect them? We can't expect that users will do this and in this case how to integreate this into the workflow? When a user submits a pull request containing a new language we'd need to merge it and then adding the equivalents in the master branch.

Next question. To allow adding comments to the source files, would you prefer to make them

I'd prefer using strict JSON as .js files to allow code formatting (won't work with atom-beautify otherwise) and don't treat errors in text editors when adding comments. We'd need to integreate a JSON validator in the build. We'd also need to integrate a components that makes sure all database files are correctly formatted (according to a code style). And finally, we need to create a few code style files before (e.g. .jsbeautify).

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

I meant a build helper that then fetches the data from this repository.

That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.

Esperanto seems like an idea of hippies, I'd vote for just ignoring it as there'll probably be no significant demand.

LOL, that's fine

Another open question about equivalents for me is who will collect them?

That's when shapecatcher and FileFormat.info become useful! I can help work on the initial data collection. And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain. We can then use reference these files in the build process.

src/
├── de/
│   ├── de.js
├── equivalents/
│   ├── ü.js
│   ├── ö.js
│   ├── ä.js

I'm not sure if using the actual character would fair well with accessing the file, so maybe using the unicode value would be better (e.g. u-00fc.js instead of ü.js)?

Inside of the file:

/**
 * Visual equivalents of ü
 */
[
    "u\u0308", // u + Combining diaeresis (U+0308)
    "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
]

If, for some unique reason a character has a different equivalent, we could define it in the language file and then concatenate the equivalents values? Or not concatenate at all depending on what we discover. Actually now that I think about it, I remember reading somewhere that some fonts provide custom characters in the unicode private areas, but lets not worry about that unless it comes up.

We'd need to integreate a JSON validator in the build.

The Grunt file uses grunt.file.readJSON and within the process of building the files we'll end up using JSON.parse and JSON.stringify which will throw errors if the JSON isn't valid. I think it would be difficult to validate the JSON before the comments are stripped out.

As for beautifying the JSON, JSON.stringify would do that:

JSON.stringify({a:"b"}, null, 4);
/* result:
{
    "a": "b"
}
*/

@julkue
Copy link
Member Author

julkue commented Sep 2, 2016

That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.

I'm quite sure you can help here, you just don't know yet 😆 If we decide to implement a server-side component then we'll set it up using Node.js as we're handling only JS/JSON files and using it makes it a lot easier than e.g. PHP. While you might not be familiar with it in detail, if I set up the basics you'll probably understand it quickly.
Anyway, to find a conclusion at this point I think we need to realize a server-side component. Otherwise many variants will be necessary and it might be confusing to have that many files in a dist folder.

And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain.

Sorry, I didn't understand the benefit of this when we're going to collect them using the unicode number. Could you help me understanding the benefit by explaining it a little more?

JSON.parse and JSON.stringify which will throw errors if the JSON isn't valid.

That would be enough.

As for beautifying the JSON, JSON.stringify would do that

I didn't meant to beautify them in the build, I meant to implement a build integration that checks if they are correctly formatted inside the src folder. Beautifying won't be necessary for the output.

What do you think of my question in your PR?

Wouldn't it make sense to provide the sources in the metadata object instead of comments? When they would be entered by users manually without providing sources we could fill in "provided by users" or something similiar.

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

Could you help me understanding the benefit by explaining it a little more?

Well when it comes to normalization, there are a limited number of visual equivalents for each given character. When we list the equivalents for a ü, we'll be repeating the same values in multiple languages. I was proposing centralizing these values in one place, then adding them to the language during a build, but only if the "equivalents" value is undefined in the language file and there is an existing equivalents file for the character.

Example language file:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        }
        // no equivalents added here
    },
    ...

equivalents file

/**
 * Visual equivalents of ü
 */
[
    "u\u0308", // u + Combining diaeresis (U+0308)
    "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
]

Resulting file:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        }
        "equivalents": [
            "u\u0308",
            "\u00fc"
        ]
    },
    ...

I hope I better explained my idea.

provide the sources in the metadata object instead of comments?

Yes, that is a better idea. I guess I missed that question in the PR. I'll update the spec.

@julkue
Copy link
Member Author

julkue commented Sep 2, 2016

@Mottie I understand this. But what I still don't understand is the benefit of saving them in a separate file

I was proposing centralizing these values in one place

Saving them in the "equivalents" property would be one central place too?

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

Saving them in the "equivalents" property would be one central place too?

Yes, that would work too. Where would be the placement of that value within the file?

@julkue
Copy link
Member Author

julkue commented Sep 2, 2016

Like you've specified, in the equivalents property.

        "ü": {
            "mapping": {
                "base": "u",
                "decompose": "ue"
            },
            "equivalents": [
                "u\u0308", // u + Combining diaeresis (U+0308)
                "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
            ]
        }

I'm quite sure we're misunderstanding us at some point, but I'm not sure at which one.

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

What I'm saying is, for example, if you look at this list of alphabets, under "Latin", you'll see there are a lot of languages that use the á diacritic. Instead of maintaining a list of visual equivalents for that one diacritic within each language file, we centralize it in one place, but add it to each file during the build process.

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

Vietnamese is going to be fun... there are a lot of diacritics with multiple combining marks which may be added in different orders.

ẫ = a + ̃  + ̂  OR a + ̂  + ̃ 

Which means the equivalents array would need to include each order combination.

"ẫ" : [
    "a\u0303\u0302", // a + ̃  + ̂ 
    "a\u0302\u0303", // a + ̂  + ̃ 
    "\u1eab"         // ẫ
]

@Mottie
Copy link
Member

Mottie commented Sep 2, 2016

Did that clarify things? And what are we going to do about uppercase characters?

@julkue
Copy link
Member Author

julkue commented Sep 3, 2016

@Mottie

Did that clarify things?

Yes, thanks! I think I've understand you now. You meant to exclude them into separate files, to avoid redundant information in mapping files.

Seems like a good idea. Let's talk about the filenames. Naming them like the diacritic itself may cause issues on some operating systems. But naming them like the unicode number will make it impossible to find them quickly. Maybe we could map them by giving them a unique ID? Or do you see any alternatives?
Example:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        },
        "equivalents": 1 // 1 would be a filename: ../equivalents/1.js
    },
    ...

Not the most beautiful variant though.

And what are we going to do about uppercase characters?

I've replied to this question here

There may be diacritics only available in upper case characters. To play it safe I'd include also upper case diacritics. Don't you think so?

@Mottie
Copy link
Member

Mottie commented Sep 3, 2016

Maybe we could map them by giving them a unique ID? Or do you see any alternatives?

Now that I've counted how many diacritics are listed just in the "Latin" table (246), I think it might be a better idea to group them a little LOL. I thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded), but we haven't even considered languages like Arabic, Chinese and Japanese of which I have no clue how to begin. Should we even worry about non-Latin based languages at this stage?

If the "base" value was a character that needed encoding (e.g. ß, then I think the unicode value would be the best ID for the file. Something like u-00df.js?.

upper case characters

Including both upper and lower case would be the best idea then.

@julkue
Copy link
Member Author

julkue commented Sep 3, 2016

I'll come back to this tomorrow with a clear head. GN8 🌙

@julkue
Copy link
Member Author

julkue commented Sep 4, 2016

thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded)

Could you update the spec with this?

but we haven't even considered languages like Arabic, Chinese and Japanese

Absolutely right. Before we start implementing the database, we should have a layout that works in all languages.
I've tried to find out if there are any cases that wouldn't work with our current schema, but weren't successfully. We'll need someone familiar with these languages...

I'd like to ask @gromo if you can help us out. We'd like to know if the Arabic alphabet contains diacritics like e.g. Latin and if they can be mapped to a "base" character (e.g. "u" when the diacritic is "ü")? Hopefully you're familiar with this alphabet as someone living in Uzbekistan. I'd appreciate your answer!

@Mottie
Copy link
Member

Mottie commented Sep 4, 2016

Could you update the spec with this?

Done. I've updated the spec (and PR). Let me know what you think.

Also, I think ligatures (æ decomposes to ae) need to be mentioned in the spec since they aren't "officially" named diacritics.

@gromo
Copy link

gromo commented Sep 4, 2016

@julmot uzbek language uses cyrilic / latin alphabet, so I cannot help you with this

@Mottie
Copy link
Member

Mottie commented Sep 4, 2016

@gromo we could still use some feedback 😁

@julmot I forgot to ask, does ß only map to SS (uppercase)? If I use the following javascript, it gives interesting results:

console.log('ß'.toLowerCase(), 'ß'.toUpperCase());
// result: ß SS

@Mottie
Copy link
Member

Mottie commented Nov 13, 2016

I found this valuable resource! http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt (See the "precomposed" section half way down). The main issue is that it was never approved and was deprecated (ref). So, even though it is directly related to our database, would it still be a good idea to use it?

Secondly, I saw this post in the Elasticsearch blog about ascii (diacritic) folding... they use ASCII Folding Filter from the Lucene Java Library. I'm not sure where to find that java code, but I suspect they are doing the same thing as in the DiacriticFolding.txt file... I will search for the resource later (after some sleep).

Update: https://github.com/GVdP-Travix/ASCIIFoldingFilter/blob/master/ASCIIFoldingFilter.js

@julkue
Copy link
Member Author

julkue commented Nov 13, 2016

@Mottie Thanks for this idea.

First of, the deprecated DiacriticFolding won't be something we can use as we need to guarantee correctness.

I've had a look at the Elasticsearch site you're referring but wasn't able to find the original "ASCIIFolding" project (mapping database). So I've only had a look at the JS mapping you've provided.

From my point of view this would only be a solution for the base property, as the decomposed value isn't covered. For example I've searched for "ü" and only found a mapping to "u". On the other hand, "ß" is mapped to "ss" which is contradictory.

Therefore I have the following questions:

  1. Is this a trustful source?
  2. Is the data covering all necessary base mappings? (they specify covered Unicode blocks)
  3. Is the data covering the correct base mappings? (for example we've defined ß as the mapping for ß, they are defining ss)

@Mottie
Copy link
Member

Mottie commented Nov 13, 2016

I don't know the specifics, but the DiacriticFolding was created by a member of the Unicode group. So that may not guarantee correctness, but it might be about as close as we can get right now.

And yeah, I agree that the "ASCIIFolding" should only be used for the base mapping.

Is this a trustful source?

I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.

Is the data covering all necessary base mappings?

It looks like they are mapping only by unicode blocks and not by language. But in their case, the ASCII folding doesn't differ, it looks like they are basically stripping away any diacritic. Which is essentially what the DiacriticFolding file looks like it is doing.

Is the data covering the correct base mappings?

I'm not sure how to answer this question. ß isn't really a diacritic, so stripping away any diacritics from the character doesn't apply; I think that's why we chose to leave it unchanged. I guess what our question should be is how should we define the base map of a character? Should it be the character without any diacritic(s), or as in the case of the ASCIIFolding, should it convert the character to accommodate searches from U.S. keyboards?

@julkue
Copy link
Member Author

julkue commented Nov 14, 2016

So that may not guarantee correctness, but it might be about as close as we can get right now.

As "ASCIIFolding" seems to contain the same information, I think we should focus on that?

I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.

I know Elasticsearch, but as I couldn't find the database they're using I assume that they're using it from a third-party too? In that case, we don't need to find out if Elasticsearch is trustful, but the third-party ("ASCIIFoldingFilter.js"). We also need to make sure that we can use their database from a legal point of view.

Should it be the character without any diacritic(s), or as in the case of the ASCIIFolding, should it convert the character to accommodate searches from U.S. keyboards?

We can make this decision easy: If we're going to use their database, we need to use what they're providing.

@Mottie
Copy link
Member

Mottie commented Nov 14, 2016

It looks like they use an apache license (http://www.apache.org/licenses/LICENSE-2.0)

Source: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_solr_4_5_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java

What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two.


Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license.

@julkue
Copy link
Member Author

julkue commented Nov 14, 2016

It looks like they use an apache license

I'm not a lawyer, but according to the license it allows usage with copyright notice. However, in users end products there can't be a notice since it'll just be e.g. a regex (no logic). I'm not sure if it's done with providing their copyright in our database. To guarantee we're allowed to use it we need to contact them.

Thanks for providing the Java file, that helped.

What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two.

Interestingly. We should investigate this and find out if we can use the database. If so, I'd agree to use it to automatically generate the base property. But we need to document especially the case what happens with characters like ß that aren't diacritics. The special thing about ß is that when writing something uppercase it's replaced by SS otherwise ss.

Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license.

@hkulekci Can we assume that this is a mistake?

@hkulekci
Copy link

hkulekci commented Nov 14, 2016

@julmot yeah, you can. I am not good at licensing something. I was only trying to exampling something in golang. :) I guess, in this case, I must choose apache license. If you know, please correct me which license I should choose.

@julkue
Copy link
Member Author

julkue commented Nov 14, 2016

@hkulekci No, sorry, I don't know it too. But since this project is released under the MIT license and we'd like to use this database, this is of our interest too.

@Mottie I you have time, could you please find out one owner of the provided Java library and contact him regarding the usage (and set me cc: please)? There's another question he probably can answer. I just asked myself : is the mapping e.g. ü => u common in all languages except German, where it could also be mapped to ue? I mean, if German is the only language that needs the decompose property, and all other languages are just having a base, then houston we have a problem. Then the entire database would be pointless as everything is already covered in the ASCIIFolding project.

@Mottie
Copy link
Member

Mottie commented Nov 14, 2016

Sorry, I've had a busy day; I'm just now checking GitHub.

I do know there is at least one additional language that needs diacritics decomposed... I found this issue (OpenRefine/OpenRefine#650 (comment)) about the Norwegian language:

  • 'æ' is replaced with 'ae'
  • 'ø' is replaced with 'oe'
  • 'å' is replaced with 'aa'

@julkue
Copy link
Member Author

julkue commented Nov 15, 2016

@Mottie Thanks for this information. We definitely need to investigate this for more languages.

I've thought about this again and came to the conclusion that even if the decompose property is unnecessary in almost every language (e.g. except German and Norwegian), then the database still makes sense. Users can't use the ASCIIFolding class as it's not possible for them to integrate it (Java class). Our project would make it possible for them to use it. We're also providing metadata for all languages that allow users to filter them by their needs and processes to integrate it into their projects.

@julkue
Copy link
Member Author

julkue commented Nov 23, 2016

First of, I haven't received an answer from the lucene team regarding automatically generating the base property yet. Hopefully we have an answer soon.

Anyway, as soon as the API is merged the next step is to implement a process that allows users to integrate the diacritics project. We have several kind of projects:

  • JavaScript projects that are using a build
  • JavaScript projects that are serving source files direclty without a build (like e.g. @andrewplummer)
  • Projects with other languages (e.g. C, C++, C#, Java, ...)

I'd like to start discussing about JavaScript projects. We need to have a npm module that replaces placeholders with diacritics data. This module will use the API to fetch live data. There should be two possible placeholder types:

  • Those which will replace the placeholder with an array or object containing the diacritics mapping information in a useful structure. This is helpful for those who want to implement custom iterators over these mapping information
  • Those which will replace the placeholder with a method like this one, where no manual iterator is necessary. The user can then use this method to generate a regex that they can use to compare diacritic strings or replace characters.

While a placeholder syntax like <% diacritics %> would make sense, it's probably not the best idea. Why? Because there might be projects using the source files in development, like mark.js. It tests with the source files and only runs unit tests with compiled files. If we would have above named syntax then an error would be thrown. To avoid this, we need to have a placeholder syntax that can simply be replaced but is also valid without the replacement. An example could be:

const x = [/* diacritics: /?language=DE */];

[/* diacritics: /?language=DE */] would be the placeholder. As the actual information is placed within a comment this would be valid even without the replacement. The diacritics would be the actual keyword here. Everything following by the : would be an optional filter URL that is passed to the API.

This is just an idea, not set in stone. I'm open for other ideas. Anyone?

Okay, for these projects that aren't using a build they'd need to create a module that overwrites a method in their project by using the npm module in a build. There won't be a way to use the diacritics project without this module (or without a build) as the data is fetched dynamically from the API.

@Mottie What do you think?

@Mottie
Copy link
Member

Mottie commented Nov 23, 2016

Doesn't the API also need to indicate the type & format of the output?

/?diacritic=%C3%BC&output=base,decompose&format=string
  • output would indicate which data entries to return
  • format should be either a string, array or object

I'm not yet clear on how we would get the API to only return the first equivalent, or a specific equivalent if there is one. Also that specified equivalent's specific data (e.g. unicode).


In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.

@julkue
Copy link
Member Author

julkue commented Nov 23, 2016

You're right, there should be some parameters to ignore some data, e.g. a value to ignore equivalents or just some of the equivalents (by name, e.g. unicode). Ignoring either base or decompose makes so sense in my opinion, as both are optional and both are mapping information. Some diacritics have a base and no decompose and vice versa.

In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.

Yes, that parameter format wouldn't be part of the API in my opinion. This is something you need to specify in the placeholder, but is handled by the npm module.
In case of mark.js that would be an entire array, not just limited by e.g. u.

@julkue
Copy link
Member Author

julkue commented Nov 24, 2016

I've thought about this again and making the format parameter part of the API has one benefit: It would allow access to these formats outside the npm module. This is especially helpful if we want to show the code on the website, the array or the entire method. Users could then just copy and paste the code into their applications – which would be another good solution for projects without a build. So I'm open for this option.

If we introduce an option to specify the output structure (non-JSON) then this shouldn't be a parameter in my opinion (e.g. ?output=js-array). All the things under the route / are currently generating JSON. So the cleanest thing would be to introduce a new route, e.g. /js-array/?language=DE, where js-array is the output structure and the parameters are just like for the / route.

@Mottie What do you think?

@Mottie
Copy link
Member

Mottie commented Dec 5, 2016

Sorry for not responding earlier!

Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.

As an aside, I have started to work out what the npm module will provide and I've gotten stuck at how to deal with characters that are not going to be included under any language... like what happens when someone tried to remove diacritics from a crazy string like Iлtèrnåtïonɑlíƶatï߀ԉ? So I think the solution would be to create a en entry in the database that covers all the non-language specific diacritics. It's going to be huge.

@julkue
Copy link
Member Author

julkue commented Dec 5, 2016

Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.

I think it has one big advantage: It would allow copy and paste directly from the website. I currently imagine how our (later coming) website could look like. I imagine a website with a table full of diacritics, with their metadata and mapping information. You can filter and sort everything and finally you can just click "get code", select a language and structure (e.g. JavaScript and object or array) and get the code. This could be done using an option of the API. If it would be part of the npm module we would have to either create redundant code (npm module and website) or just don't implement such a button.

Interesting point in your second paragraph. Why would you call it "en"? And how would you map all kind of Unicode characters?

@Mottie
Copy link
Member

Mottie commented Dec 5, 2016

I think it has one big advantage: It would allow copy and paste directly from the website.

Ok, sounds good then!

Why would you call it "en"?

Well, English doesn't really include diacritics, and even when we do, we ignore them all the time. I didn't want to name it something like default and make an exception in the spec. So, the format will follow the spec like all the other languages.

One block of entries will be removing all combining diacritics... so the base would be an empty string:

    "data": {
        // combining diacritics -> convert to empty string
        "\u0301": { "mapping": { "base": "" } },
        "\u0300": { "mapping": { "base": "" }},
        "\u0306": { "mapping": { "base": "" }},
        "\u0302": { "mapping": { "base": "" }},
        "\u030c": { "mapping": { "base": "" }},
        ...
    }

Then we could include the decomposing of other symbols like and into (1)...

@julkue
Copy link
Member Author

julkue commented Dec 5, 2016

@Mottie I think that including special characters that aren't diacritics make sense (e.g. "①"), but we can't call them a "diacritic".

I'd say we should decide if we're going to include them depending on the effort. Is there any existing database like the one for the HTML entities? If so, then we can continue creating a new file in the build folder and adding them to the generated diacritics.json. A new API option should also allow excluding them.
If there's no database and it's much effort I don't think we can continue with it, or at least not at the current time. In my opinion we should focus on creating the npm module hopefully before new year. If this takes too much time it may be better to discuss this if we have time. In that case I'd personally find it confusing to name it "en" if English doesn't contain diacritics. I think the cleanest would be to just create a single JSON file directly in the src folder.

  1. Is there an existing database?
  2. How much time will it take to create that mapping information?
  3. In case there's no existing database: What do you think of the naming?

@julkue
Copy link
Member Author

julkue commented Dec 5, 2016

Btw.: Is the "Write" tab (textarea) also delayed for you while typing?

@Mottie
Copy link
Member

Mottie commented Dec 5, 2016

I'm not having any issues with the textarea.

I started with a bunch of characters and plugged them into node-unidecode which stripped out the diacritics and then added them to the data set... although some results ended up as [?]. The list I was working on is no where near complete.

In the mean time, I'll put this part on hold and continue working on the npm module.

@julkue
Copy link
Member Author

julkue commented Dec 5, 2016

is no where near complete

How do you know that? And where did you take the data from?

@julkue
Copy link
Member Author

julkue commented Dec 9, 2016

Ping @Mottie. And what's the current status with the npm module spec?

@Mottie
Copy link
Member

Mottie commented Dec 9, 2016

Hey @julmot!

I'll clean up what I have in the works and post it in about 4 hours (after I go to the gym)... it is still incomplete, but it'll give you an idea of where things are now.

@Mottie
Copy link
Member

Mottie commented Dec 9, 2016

As I said, it's still a work-in-progress... https://github.com/diacritics/node-diacritics-transliterator/tree/initial

@julkue
Copy link
Member Author

julkue commented Dec 9, 2016

@Mottie Would you mind to submit a PR? This would allow us to have a conversation about it directly in the repository.

@julkue
Copy link
Member Author

julkue commented Aug 21, 2017

Finally, we're in the end-phase and going live soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants