-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concept & Specification #1
Comments
@Mottie Do you know any other third-party services worth mentioning? |
I don't know of any other third-party services, but I'm sure there are more. I'll keep an eye out. I like what you have so far. I do have a few points I wanted to add:
Also, I would love to hear what ideas @mathiasbynens might have on this subject. |
Thanks for sharing your ideas, @Mottie.
Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.
What would be your solution approach here?
How would you integrate these rules into the creation process of diacritics mapping? Btw: As a collaborator you're allowed to update the specifications too. |
It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.
Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases... maybe? If I use the example from this page for the latin capital letter A with a ring above, the data section would need to look something like this: "data":{
// Latin capital letter A with ring above (U+00C5)
"Å":{
"mapping":"A",
"equivalents" : [
"Å", // Angstrom sign (U+212B)
"Å", // A (U+0041) + Combining Ring above (U+030A)
// maybe include the key that wraps this object as well?
"Å" // Latin capital letter A with ring above (U+00C5)
]
}
}
I know I am, but we're still discussing the layout 😉 |
That makes sense. Adding additional language variants would be optional. I've added this to the SysReq.
Good catch! I didn't know what you meant here on first glance – because I'm not familiar with any language that has such equavalents. Added this to the SysReq too. |
I've updated the spec example. A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character. There is another useful site I found, but it's not working for me at the moment - http://shapecatcher.com/ |
Hello, The Sugar diacritics were based purely on my own research with a healthy dose of consulting with my European friends/followers. The goal was simply to provide an "80% good" sort order for most languages compared to the default Unicode code point sort in raw Javascript. It's not intended to be anything close to a complete collation algorithm, let alone an authority on diacritic mappings. I definitely agree that there is a need for this. For my part, I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it. Thanks for working on this! |
Thanks, just learned something new. I think adding those equivalents will be up to the authors most of the time, as users probably don't know much about them. I've invited the author of shapecatcher to particiate at this discussion. @andrewplummer
I see two ways of distribution:
Whould the latter something you'd be interested in using? I'm asking because it's important to know if that would be a way library authors could imagine to integrate this. If not, what would be your preferred way? |
Thanks @andrewplummer! you 🚀!
|
Next question. To allow adding comments to the source files, would you prefer to make them:
|
When I was talking about distribution using Bower I didn't meant to distribute the actual data. I meant a build helper that then fetches the data from this repository. This way we can have a specific version for our build helper but our users will always get the latest diacritics mapping information.
While I personally don't like to create a server side component, I also see that there would be many file variants. We'd need to specify a good What do you think?
No, it's not needed but would be a nice-to-have. Imagine a customer is distributing an application to a continent, e.g. Europe. Then it wouldn't be necessary to just include all mapping information by selecting every EU-country manually.
Great idea. It would then be possible to select country specific diacritic mapping information by native language spellings. But would be another variant to consider in the distribution (see above).
Related to this article I agree with you and I'd vote for using it like you've did, having a
I agree with you. It's also hard to review when there is no visual difference. Would you mind to update the system requirements with this information? Another open question about equivalents for me is who will collect them? We can't expect that users will do this and in this case how to integreate this into the workflow? When a user submits a pull request containing a new language we'd need to merge it and then adding the equivalents in the master branch.
I'd prefer using strict JSON as |
That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.
That's when shapecatcher and FileFormat.info become useful! I can help work on the initial data collection. And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain. We can then use reference these files in the build process.
I'm not sure if using the actual character would fair well with accessing the file, so maybe using the unicode value would be better (e.g. Inside of the file: /**
* Visual equivalents of ü
*/
[
"u\u0308", // u + Combining diaeresis (U+0308)
"\u00fc" // Latin small letter u with diaeresis (U+00FC)
] If, for some unique reason a character has a different equivalent, we could define it in the language file and then concatenate the equivalents values? Or not concatenate at all depending on what we discover. Actually now that I think about it, I remember reading somewhere that some fonts provide custom characters in the unicode private areas, but lets not worry about that unless it comes up.
The Grunt file uses As for beautifying the JSON, JSON.stringify({a:"b"}, null, 4);
/* result:
{
"a": "b"
}
*/ |
I'm quite sure you can help here, you just don't know yet 😆 If we decide to implement a server-side component then we'll set it up using Node.js as we're handling only JS/JSON files and using it makes it a lot easier than e.g. PHP. While you might not be familiar with it in detail, if I set up the basics you'll probably understand it quickly.
Sorry, I didn't understand the benefit of this when we're going to collect them using the unicode number. Could you help me understanding the benefit by explaining it a little more?
That would be enough.
I didn't meant to beautify them in the build, I meant to implement a build integration that checks if they are correctly formatted inside the What do you think of my question in your PR?
|
Well when it comes to normalization, there are a limited number of visual equivalents for each given character. When we list the equivalents for a Example language file: "data": {
"ü": {
"mapping": {
"base": "u",
"decompose": "ue"
}
// no equivalents added here
},
... equivalents file /**
* Visual equivalents of ü
*/
[
"u\u0308", // u + Combining diaeresis (U+0308)
"\u00fc" // Latin small letter u with diaeresis (U+00FC)
] Resulting file: "data": {
"ü": {
"mapping": {
"base": "u",
"decompose": "ue"
}
"equivalents": [
"u\u0308",
"\u00fc"
]
},
... I hope I better explained my idea.
Yes, that is a better idea. I guess I missed that question in the PR. I'll update the spec. |
@Mottie I understand this. But what I still don't understand is the benefit of saving them in a separate file
Saving them in the "equivalents" property would be one central place too? |
Yes, that would work too. Where would be the placement of that value within the file? |
Like you've specified, in the "ü": {
"mapping": {
"base": "u",
"decompose": "ue"
},
"equivalents": [
"u\u0308", // u + Combining diaeresis (U+0308)
"\u00fc" // Latin small letter u with diaeresis (U+00FC)
]
} I'm quite sure we're misunderstanding us at some point, but I'm not sure at which one. |
What I'm saying is, for example, if you look at this list of alphabets, under "Latin", you'll see there are a lot of languages that use the |
Vietnamese is going to be fun... there are a lot of diacritics with multiple combining marks which may be added in different orders.
Which means the equivalents array would need to include each order combination. "ẫ" : [
"a\u0303\u0302", // a + ̃ + ̂
"a\u0302\u0303", // a + ̂ + ̃
"\u1eab" // ẫ
] |
Did that clarify things? And what are we going to do about uppercase characters? |
Yes, thanks! I think I've understand you now. You meant to exclude them into separate files, to avoid redundant information in mapping files. Seems like a good idea. Let's talk about the filenames. Naming them like the diacritic itself may cause issues on some operating systems. But naming them like the unicode number will make it impossible to find them quickly. Maybe we could map them by giving them a unique ID? Or do you see any alternatives? "data": {
"ü": {
"mapping": {
"base": "u",
"decompose": "ue"
},
"equivalents": 1 // 1 would be a filename: ../equivalents/1.js
},
... Not the most beautiful variant though.
I've replied to this question here
|
Now that I've counted how many diacritics are listed just in the "Latin" table (246), I think it might be a better idea to group them a little LOL. I thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded), but we haven't even considered languages like Arabic, Chinese and Japanese of which I have no clue how to begin. Should we even worry about non-Latin based languages at this stage? If the "base" value was a character that needed encoding (e.g.
Including both upper and lower case would be the best idea then. |
I'll come back to this tomorrow with a clear head. GN8 🌙 |
Could you update the spec with this?
Absolutely right. Before we start implementing the database, we should have a layout that works in all languages. I'd like to ask @gromo if you can help us out. We'd like to know if the Arabic alphabet contains diacritics like e.g. Latin and if they can be mapped to a "base" character (e.g. "u" when the diacritic is "ü")? Hopefully you're familiar with this alphabet as someone living in Uzbekistan. I'd appreciate your answer! |
Done. I've updated the spec (and PR). Let me know what you think. Also, I think ligatures ( |
@julmot uzbek language uses cyrilic / latin alphabet, so I cannot help you with this |
I found this valuable resource! http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt (See the "precomposed" section half way down). The main issue is that it was never approved and was deprecated (ref). So, even though it is directly related to our database, would it still be a good idea to use it? Secondly, I saw this post in the Elasticsearch blog about ascii (diacritic) folding... they use ASCII Folding Filter from the Lucene Java Library. I'm not sure where to find that java code, but I suspect they are doing the same thing as in the DiacriticFolding.txt file... I will search for the resource later (after some sleep). Update: https://github.com/GVdP-Travix/ASCIIFoldingFilter/blob/master/ASCIIFoldingFilter.js |
@Mottie Thanks for this idea. First of, the deprecated DiacriticFolding won't be something we can use as we need to guarantee correctness. I've had a look at the Elasticsearch site you're referring but wasn't able to find the original "ASCIIFolding" project (mapping database). So I've only had a look at the JS mapping you've provided. From my point of view this would only be a solution for the Therefore I have the following questions:
|
I don't know the specifics, but the DiacriticFolding was created by a member of the Unicode group. So that may not guarantee correctness, but it might be about as close as we can get right now. And yeah, I agree that the "ASCIIFolding" should only be used for the base mapping.
I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.
It looks like they are mapping only by unicode blocks and not by language. But in their case, the ASCII folding doesn't differ, it looks like they are basically stripping away any diacritic. Which is essentially what the DiacriticFolding file looks like it is doing.
I'm not sure how to answer this question. |
As "ASCIIFolding" seems to contain the same information, I think we should focus on that?
I know Elasticsearch, but as I couldn't find the database they're using I assume that they're using it from a third-party too? In that case, we don't need to find out if Elasticsearch is trustful, but the third-party ("ASCIIFoldingFilter.js"). We also need to make sure that we can use their database from a legal point of view.
We can make this decision easy: If we're going to use their database, we need to use what they're providing. |
It looks like they use an apache license (http://www.apache.org/licenses/LICENSE-2.0) What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two. Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license. |
I'm not a lawyer, but according to the license it allows usage with copyright notice. However, in users end products there can't be a notice since it'll just be e.g. a regex (no logic). I'm not sure if it's done with providing their copyright in our database. To guarantee we're allowed to use it we need to contact them. Thanks for providing the Java file, that helped.
Interestingly. We should investigate this and find out if we can use the database. If so, I'd agree to use it to automatically generate the
@hkulekci Can we assume that this is a mistake? |
@julmot yeah, you can. I am not good at licensing something. I was only trying to exampling something in golang. :) I guess, in this case, I must choose apache license. If you know, please correct me which license I should choose. |
@hkulekci No, sorry, I don't know it too. But since this project is released under the MIT license and we'd like to use this database, this is of our interest too. @Mottie I you have time, could you please find out one owner of the provided Java library and contact him regarding the usage (and set me cc: please)? There's another question he probably can answer. I just asked myself : is the mapping e.g. |
Sorry, I've had a busy day; I'm just now checking GitHub. I do know there is at least one additional language that needs diacritics decomposed... I found this issue (OpenRefine/OpenRefine#650 (comment)) about the Norwegian language:
|
@Mottie Thanks for this information. We definitely need to investigate this for more languages. I've thought about this again and came to the conclusion that even if the |
First of, I haven't received an answer from the lucene team regarding automatically generating the Anyway, as soon as the API is merged the next step is to implement a process that allows users to integrate the diacritics project. We have several kind of projects:
I'd like to start discussing about JavaScript projects. We need to have a npm module that replaces placeholders with diacritics data. This module will use the API to fetch live data. There should be two possible placeholder types:
While a placeholder syntax like const x = [/* diacritics: /?language=DE */];
This is just an idea, not set in stone. I'm open for other ideas. Anyone? Okay, for these projects that aren't using a build they'd need to create a module that overwrites a method in their project by using the npm module in a build. There won't be a way to use the diacritics project without this module (or without a build) as the data is fetched dynamically from the API. @Mottie What do you think? |
Doesn't the API also need to indicate the type & format of the output?
I'm not yet clear on how we would get the API to only return the first equivalent, or a specific equivalent if there is one. Also that specified equivalent's specific data (e.g. In the case of the mark.js repo, if you added a placeholder for say |
You're right, there should be some parameters to ignore some data, e.g. a value to ignore equivalents or just some of the equivalents (by name, e.g.
Yes, that parameter |
I've thought about this again and making the If we introduce an option to specify the output structure (non-JSON) then this shouldn't be a parameter in my opinion (e.g. ?output=js-array). All the things under the route @Mottie What do you think? |
Sorry for not responding earlier! Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API. As an aside, I have started to work out what the npm module will provide and I've gotten stuck at how to deal with characters that are not going to be included under any language... like what happens when someone tried to remove diacritics from a crazy string like |
I think it has one big advantage: It would allow copy and paste directly from the website. I currently imagine how our (later coming) website could look like. I imagine a website with a table full of diacritics, with their metadata and mapping information. You can filter and sort everything and finally you can just click "get code", select a language and structure (e.g. JavaScript and object or array) and get the code. This could be done using an option of the API. If it would be part of the npm module we would have to either create redundant code (npm module and website) or just don't implement such a button. Interesting point in your second paragraph. Why would you call it "en"? And how would you map all kind of Unicode characters? |
Ok, sounds good then!
Well, English doesn't really include diacritics, and even when we do, we ignore them all the time. I didn't want to name it something like One block of entries will be removing all combining diacritics... so the base would be an empty string: "data": {
// combining diacritics -> convert to empty string
"\u0301": { "mapping": { "base": "" } },
"\u0300": { "mapping": { "base": "" }},
"\u0306": { "mapping": { "base": "" }},
"\u0302": { "mapping": { "base": "" }},
"\u030c": { "mapping": { "base": "" }},
...
} Then we could include the decomposing of other symbols like |
@Mottie I think that including special characters that aren't diacritics make sense (e.g. "①"), but we can't call them a "diacritic". I'd say we should decide if we're going to include them depending on the effort. Is there any existing database like the one for the HTML entities? If so, then we can continue creating a new file in the build folder and adding them to the generated diacritics.json. A new API option should also allow excluding them.
|
Btw.: Is the "Write" tab (textarea) also delayed for you while typing? |
I'm not having any issues with the textarea. I started with a bunch of characters and plugged them into In the mean time, I'll put this part on hold and continue working on the npm module. |
How do you know that? And where did you take the data from? |
Ping @Mottie. And what's the current status with the npm module spec? |
Hey @julmot! I'll clean up what I have in the works and post it in about 4 hours (after I go to the gym)... it is still incomplete, but it'll give you an idea of where things are now. |
As I said, it's still a work-in-progress... https://github.com/diacritics/node-diacritics-transliterator/tree/initial |
@Mottie Would you mind to submit a PR? This would allow us to have a conversation about it directly in the repository. |
Finally, we're in the end-phase and going live soon. |
The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.
As there is no single, trustworthy and complete source, all information need to be collected by users manually.
Example mapping:
User Requirements
Someone using diacritics mapping information.
It should be possible to:
Contributor Requirements
Someone providing diacritics mapping information.
Assuming every contributor has a GitHub account and is familiar with Git.
Providing information should be:
System Specification
There are two ways of realization:
Create a JSON database in this GitHub repository, as this fits user and contributor requirements.
Create a database in a third-party service that fits the user and contributor requirements.
Tested:
Because we're not familiar with further third-party services that could fit user and contributor requirements, we'll continue realizing the first point.
System Requirements
See the documentation and pull request.
Build & Distribution
Build
According to the contributor requirements it should be possible to compile source files without making a Git clone necessary. This means that we can't require users to run e.g.
$ grunt dist
at the and, since this would require to clone, install dependencies and run things. What we'll do is implementing a build bot that will run our build on Travis CI and commits changes directly to adist
branch in this repository. Therefore once you merge something or you commit something yourself thedist
branch will be updated automatically. Some people already doing this to update theirgh-pages
branch when something changes in themaster
branch (e.g. this script).Since we'll use a server-side component to filter and serve actual mapping information we just need to generate one
diacritics.json
file containing all data.To make parsing easier and to encode diacritics to unicode numbers in production we're going to need a build that minifies the files and encodes diacritics. This should be done using Grunt.
Integrity
In order to ensure integrity and consistency we need the following in our build process:
Distribution
To provide diacritics mapping according to the User Requirements it's necessary to run a custom server-side component that makes it possible to sort, limit and filter information and output them in different ways (e.g. JS object or array). This component should be realized using Node.js as it's made for handling JS/JSON files and PHP would cause a lot more serializing/deserializing.
Next Steps
.md
file that specifies the entire database structure in detailThis comment is updated continuously during the discussion
The text was updated successfully, but these errors were encountered: