-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CedPane dictionary supplement? #60
Comments
Sorry I forgot to mention that the reason why I wrote this ticket was because a user of your extension emailed me asking me to fork your extension into a version with CedPane added. I'd rather avoid creating a fork if it's something that can be done upstream. |
I appreciate that users won't anymore spent a great deal of time analyzing enigmatic series of Chinese words when those Chinese words are merely people's transliterated names, compound words, company names, colloquial phrases, and idioms. In fact, I included your dictionary in my Chinese Words Separator extension for Chrome https://chrome.google.com/webstore/detail/chinese-words-separator/gacfacdpfimbkgcnlegknnmcccjgcbnp It will help a lot of Chinese language learners to save time from over-analyzing a series of hanzis. Here's an example of my extension result, before and after I included your CedPane dictionary: However, there are phrases that I feel should not be in the CedPane dictionary, for instance:
Yǔnxǔ ānzhuāng láizì wèizhī láiyuán de yìngyòng I feel that it's not a compound word, nor a colloquial phrase that should be remembered by heart by Chinese language learners. For that matter, I want to exclude it, so I made my code's compound-words look-ahead limited to a certain length, so those kind of lengthy phrases will be excluded from the extension's compound words mechanism. There are more phrases that I think should not be in the CedPane's dictionary The hesitancy of some Chinese dictionary tool makers to include CedPane's dictionary to their dictionary, stems from those examples, I believe. |
@ssb22 It's a good idea to put a field on CedPane's dictionary to indicate if something is a name, compound words, colloquial phrases, or idioms. Or for examples such as |
@ssb22 Here's another output of Chinese Words Separator extension with your CedPane dictionary included: Without CedPane: Thanks :) |
Thanks, I think the easiest way to omit the "phrase" entries is simply to omit any entry that has a Phrases like "allow installation for unknown sources" are included because we occasionally need them for translating English into Chinese. I find people tend not to understand my technical instructions unless I can quote the exact wording that's displayed on their screen, not just a paraphrase of it, so yes we do want to be able to look up how things like that are worded in Chinese. But they are not meant to be displayed without spaces, which is why I include spaces in the pinyin field of If anyone has code that can process phrases including spaces, I'd rather they include the multi-word phrases because some of these entries are used to "clear up" what would otherwise be a difficult case for a computer to get right. For example, the entry 万国都 is 2 words, and it is meant to clarify that, in the texts I've seen, 万国都 should be written as "wànguó dōu" (all nations + all), rather than "wàn guódū" (myriad + capital cities). Otherwise, software like Wenlin might incorrectly put "wàn guódū" because 国都 has a higher usage frequency than 万国 (usage frequency is the wrong signal to use in this case, so I added an 'override' phrase entry). CEDICT also has a few 'long phrase' entries (like 金窩銀窩不如自己的狗窩) which I don't think should be written without spaces. Unfortunately, CEDICT doesn't have the |
Hi @ssb22 , thanks for getting in touch. You obviously put a lot of work into compiling your dictionary and the result is very impressive. I would actually prefer if you made your work available via publishing it through CC-CEDICT. That dictionary already includes a number of names of famous people and well-known places. I believe this approach would have several benefits:
Anyway, I respect the amount of work you've put into this. By working together with the CC-CEDICT team you could make it available to an even wider audience and it would be a win for everybody. |
I overlooked the file (CedPane-ChinaScribe.txt) that have word boundaries delimited by underscore Is there a version or fork of CC-CEDICT that is in ChinaScribe format? It's neat when pinyin have word boundaries like underscore , not just 'syllable' boundaries. Indeed, the idiom there's no place like home is rendered with no spaces as it is treated as one word due to the CC-CEDICT source dictionary having no word boundaries :) |
Thanks @cschiller . I believe I was ostracized by the CEDICT team after an email misunderstanding 4 years ago, and I wouldn't want to annoy them by trying again now. The CEDICT license doesn't let developers mix CEDICT data with other data, unless that other data also has a CC license. I was in the awkward position of having been given special permission to use certain proprietary data in a zero-cost zero-profit Android app, but I didn't have permission to CC-license that data, therefore I could not mix in CEDICT (unless CEDICT gave me an exception to the "must CC it" rule, which they didn't). I did try Adsotrans data for a while, since Adso's license did allow mixing without a CC requirement. But I found issues with the quality of Adso's data, so ended up going my own way instead. Pleco has an innovative way of keeping dictionaries separate while still letting you use several, so Pleco is able to use both CC-CEDICT and proprietary dictionaries if you want. But not all apps can be written like Pleco (and I could not figure out how to make my Annotator Generator work anything like Pleco) so I couldn't just do it that way. I felt public-domain data would be least likely to cause problems for developers. Sure I'm happy for CEDICT to use the data as long as they don't try to stop me from keeping the public-domain version available as well. They would probably want to review everything before inclusion, which could end up being a lot of work. In the short term it may make more sense to have CedPane as a separate source, and perhaps label your entries so everyone knows which of them have been edited by CEDICT versus which of them have only been edited by me. I suppose it's not impossible CEDICT could decide my editing is good enough to import without further review, but that is not my call to make! At the very least, I'd want to draw their attention to:
etc. Marking all entries as "from CedPane" until reviewed could be one way to shift any blame. @ienablemuch the only other dictionary I know of that uses ChinaScribe format is the one bundled with ChinaScribe itself which is commercial Windows software with free trial (it sometimes works on WINE depending on the version). The License Agreement that pops up when you install it says: "Many ChinaScribe entries are derived from the following sources: CC-CEDICT Chinese-English dictionary. Available free of charged and licensed under a Creative Commons Attribution-Share Alike 3.0 License." So I suppose that means, although ChinaScribe is commercial software, its dictionary is a CC-CEDICT fork and can be used with other programs if you can get the data out of ChinaScribe. (The paid-up version has File / Export dictionary entries, but this refuses to run on the unpaid version. The internal file is typically (Edit: formatting) |
Incidentally ChinaScribe merged in CedPane in 2017, but I haven't checked to what extent they're keeping up since then. (I do keep an entry in CedPane for CedPane itself, with the date on it—see for example Ce.html—I figured that this entry could be used to check when a project that imported CedPane last did so, assuming they kept that entry. ChinaScribe doesn't seem to have it at the moment, which might perhaps mean their last import predates when I first added it.) |
Hi, I maintain a public-domain Chinese-English dictionary supplement (currently about 64,000 entries), it's data that is not usually in "normal" dictionaries but still useful to have in a reader (mostly names of people and places). If including these extra entries, I would suggest labelling them in some way to differentiate them from the "main" CEDICT, as it seems the CEDICT editors are not sure they want to merge in CedPane entries en-masse (and anyway I'm still writing it).
If you do want to merge in, I think the best starting point would be the CedPane ChinaScribe file because the format of that is quite similar to CEDICT. The main difference is that some of the pinyin syllables are separated with
_
instead of space: this indicates a word boundary in a multi-word phrase; if you can't cope with multi-word phrases then these entries are possibly best dropped. And some of the definitions are in<
...>
to indicate an environment (e.g. PRC, TW, netspeak, etc). Other than that it's basically the same apart from the sort order.I don't know what is your current method of generating
cedict.idx
and your modifiedcedict_ts.u8
from upstream (do you have scripts to do this / can you put them in the repo for reference?), but I'd imagine merging in another source (and perhaps labelling every definition with[CedPane]
or similar so as to differentiate them from mainline CEDICT) shouldn't be difficult.It's nice that you are able to push out several updates a year. I currently tend to publish my CedPane edits on the last day of each month, although that's not a guarantee. If you do a
git pull
from my repo as part of your normal update script, that should work. Alternatively on the CedPane home page there's always a "Last update" and entry count.(I used to make a text file of CedPane available for download from the home page, but then a developer in China thought it was a good idea to write a lookup extension that re-downloads CedPane from my server every time it was used, which caused hundreds of gigabytes of traffic—when I say “keep up to date” I don't mean that much☺ You can still get it from the home page but it's now a ZIP file which I hope will discourage our extension-writing friend from hammering my server. Meanwhile it also lives on the major Git providers which have more bandwidth. Including the data in the extension with periodic updates, as you do, does seem to be a better way.)
The text was updated successfully, but these errors were encountered: