-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FYI: I'm rewriting the compiling/importing code #11
Comments
Stephen, in case it helps. I wrote a script for converting Perseus XML into plain text for my APA 2014 alliteration poster. The XML files—though pretty much all orderly TEI/XML—are not consistent so there’s some peculiarities to deal with different authors/works. This script is also includes a specific workaround for dealing with section breaks (what was needed for the alliteration study.) But you might find some of this useful, so here’s the code: https://github.com/diyclassics/Alliteration-in-Latin-Literature/blob/master/code/perseusPreprocess.py Best, On Oct 4, 2014, at 12:49 PM, Stephen Margheim notifications@github.com wrote:
|
Wow, you guys are awesome. @smargh You have correctly identified some very redundant code. Your cleanup of this will be a terrific help. That module grew organically as I needed to add access to a new corpora, and it is becoming hard to manage. Two tips: (1) Make sure that the corpus importer will be able to grow with other languages. For example, consider some kind of logic or class or argument to separate the downloading of, say, Hebrew from Greek from Sanskrit. I am 100% open to how this gets done. (2) If you think this revision will become an overwhelming task, try breaking it into two parts. In this case, I see two discrete tasks, (i) improving my spaghetti code for downloading corpora into @diyclassics Thanks for sharing your Perseus XML parsing. This will surely come in handy sooner than later. @smargh + @diyclassics An update from my end: I too have been struggling with XML lately. In my case, I have been parsing the Perseus treebank data for the purpose of making a POS training set & automated tagger. I have some early hacking here for Greek: https://github.com/kylepjohnson/treebank_perseus_greek What I have done is made a POS training set ( So far, I have only used the tagger with the UnigramTagger() but it seems to work well for texts which are part of the original training set (which is a good sign). I have yet to test it thoroughly, however. I'm going to do Latin tonight. Thanks again, both, and please holler if you get stuck bad on a problem. I try not to be touchy about edits to my code, so if you see anything of mine that looks like a bad idea, try your hand at making it better, more intuitive, and/or more Pythonic. |
Ok. I've pushed my initial work to my fork at https://github.com/smargh/cltk. You can see the class based structure that I am taking. I have some initial code for a base CLTK class, which integrates with a Right now, my plan is to mirror the current api in How does this sound? Thoughts, comments, concerns? stephen |
Hi Stephen, This is more than I expected and I am very impressed. Concerning the big picture stuff you're talking about, I think you're right on target. Your object-oriented approach to interacting with them is especially apt. A couple nit-picky points/questions:
I would really like to see this in action. What can I do to help it along? From my end, the closer the repository is to the current cltk master the smoother the transition. If you can get this new code into Thanks, |
My thought on other languages initially would be Now, we may want to add one level of complexity for flexibility and efficiency and make some intermediate classes. So, for corpus stuff, there are some corpora that only require downloading from the internet ( This would allow the most generic code to go into Anyway, those are my two cents. I'm finishing up the |
The Corpus and Document classes sound great. And I love the |
basic skeletons of |
Kyle,
I have started down the path of writing code to convert Perseus XML into structured, formatted plain text. It is a task, so who knows how long it'll take, but along the way, I have dug into the
cltk
code, primarily the code under/corpus
. I have forked this repo and am working on this fork, but it will probably be a while till I get a working Pull Request. Before then, I wanted to let you know what I'm doing and why I think it will help.I am restructuring the entire code within this scope to be entirely modular. I am writing classes for each corpus, which are all actually sub-classes of a
Corpus
class, which itself uses aCLTK
class. Once finished, there will be no redundant code, each corpus will be individually accessible (not just the import/compile code, but also future code, like convert to structured plain text), and the code should be easier to adapt over time.Like I said, a Pull Request is probably a ways out, but you can see where I am heading once I push my initial work (and then all subsequent work) to my fork.
stephen
The text was updated successfully, but these errors were encountered: