New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State of the cjklib / understanding our datasets #3

Closed
tony opened this Issue Aug 19, 2012 · 2 comments

Comments

Projects
None yet
2 participants
@tony
Collaborator

tony commented Aug 19, 2012

I think it'd be good to get a state of matters for where we stand on cjklib in terms of its current codebase. Do we want to use it? As it stands, I'm not sure if I'm failing to grasp the complexities of comingling our data, or if there are architectural mistakes within that just would be best if we rewrote it.

If that is the case - I wonder if you could take some time to document what is what from a data perspective. Here are few questions that'd be helpful to have answers on:

  • In cjklib.data's csv an sql files - what are these datasets? how are they used? are they used in the same way? what data do/can they hold?

More specifically, what is the following:

  • edict
  • cedict
  • cedictgr
  • handedict
  • cfdict
  • unihan
  • kanjidic2

and

  • cantoneseipainitialfinal
  • cantoneseipainitialfinal
  • cantoneseyaleinitialnucleuscoda
  • cantoneseyalesyllables
  • characterdecomposition
  • charactershanghaineseipa
  • grabbreviation
  • grrhotacisedfinals
  • grsyllables
  • jyutpinginitialfinal
  • jyutpingipamapping
  • jyutpingsyllables
  • jyutpingyalemapping
  • kangxiradical
  • localecharacterglyph
  • mandarinipainitialfinal
  • pinyinbraillefinalmapping
  • pinyinbrailleinitialmapping
  • pinyingrmapping
  • pinyininitialfinal
  • pinyinipamapping
  • pinyinsyllables
  • radicalequivalentcharacter
  • shanghaineseipasyllables
  • strokeorder
  • strokes
  • Unihan.zip (is this downloaded to here?)
  • wadegilesinitialfinal
  • wadegilespinyinmapping
  • wadegilessyllables

What are the above? Why are some included while otheres are downloaded remotely? Can we package any/all of the remote data in cjklib? Is it it matter of licensing of assuring downloading of fresh data?

What data in the above datasets intersect, where?

If there is a place where the data intersects, often, I'm assuming we're massaging it in some sense so we can match it to a lookup? Maybe it'd help to have a spreadsheet / table on this?

I think that if we mapped the data we have to a spreadsheet it'd offer us all a better view of the picture - imo. Then we can take a look back away from legacy assumptions and be in a better position to make pull requests for larger architecture changes.

I realize the above is a pretty time-consuming thing, think you could take a bite at it though?

@cburgmer

This comment has been minimized.

Owner

cburgmer commented Sep 6, 2012

Tony, sorry for making you wait for so long.

While I feel that your questions are valid, a bug tracker might be the wrong place for discussing those. If we continue discussing, could you please take it to the mailing list? It might even make me respond quicker: https://groups.google.com/forum/?fromgroups#!forum/cjklib-devel

All the data files that live in this project are hand crafted for the use with cjklib. You can use the Python API to access all the data.

So to answer some of your questions:

edict, cedict, cedictgr, handedict, cfdict are all dictionaries. They are downloaded on the fly (so they are up-to-date) and can be queried via cjklib's dictionary API.

The list of files that you mention cover different things. For example readings of Chinese languages (Mandarin, Cantonese, Shanghainese) in some of their respective romanisation schemes. Some files describe chinese characters, their composition out of smaller elements, also strokes.

I did make sure to document what those lists were, and where the data comes from.

kanjidic2 and Unihan are used to derive information that either non of the own sources cover or don't cover to that extent. However for Unihan I can say that it doesn't provide the quality for the use case that I developed cjklib for so in general having more "own" data would be good.

So, the data is capture inside cjklib, not very visible for people from other language backgrounds, or even non-programmers. Ideally the data would go into some sort of web page independently from cjklib.

@tony

This comment has been minimized.

Collaborator

tony commented Sep 6, 2012

@cburgmer: 哪里哪里! Thank you for the response I'll note that google groups discussion list is preferred.

I am not strong enough in python to write something the pythonic way myself, but having a high level overview of cjklib's python code would be nice, have you ever seen http://www.aosabook.org/en/index.html? If a sage were to write up an overview of cjklib in that style it'd be cool.

In the mean time, if I delve into this subject / other things further I will bring it to the list.

@tony tony closed this Sep 6, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment