FYI: I'm rewriting the compiling/importing code #11

fractaledmind · 2014-10-04T16:49:25Z

Kyle,

I have started down the path of writing code to convert Perseus XML into structured, formatted plain text. It is a task, so who knows how long it'll take, but along the way, I have dug into the cltk code, primarily the code under /corpus. I have forked this repo and am working on this fork, but it will probably be a while till I get a working Pull Request. Before then, I wanted to let you know what I'm doing and why I think it will help.

I am restructuring the entire code within this scope to be entirely modular. I am writing classes for each corpus, which are all actually sub-classes of a Corpus class, which itself uses a CLTK class. Once finished, there will be no redundant code, each corpus will be individually accessible (not just the import/compile code, but also future code, like convert to structured plain text), and the code should be easier to adapt over time.

Like I said, a Pull Request is probably a ways out, but you can see where I am heading once I push my initial work (and then all subsequent work) to my fork.

stephen

The text was updated successfully, but these errors were encountered:

diyclassics · 2014-10-04T17:31:06Z

Stephen, in case it helps. I wrote a script for converting Perseus XML into plain text for my APA 2014 alliteration poster. The XML files—though pretty much all orderly TEI/XML—are not consistent so there’s some peculiarities to deal with different authors/works. This script is also includes a specific workaround for dealing with section breaks (what was needed for the alliteration study.) But you might find some of this useful, so here’s the code: https://github.com/diyclassics/Alliteration-in-Latin-Literature/blob/master/code/perseusPreprocess.py

Best,
PJB
@diyclassics

On Oct 4, 2014, at 12:49 PM, Stephen Margheim notifications@github.com wrote:

Kyle,

I have started down the path of writing code to convert Perseus XML into structured, formatted plain text. It is a task, so who knows how long it'll take, but along the way, I have dug into the cltk code, primarily the code under /corpus. I have forked this repo and am working on this fork, but it will probably be a while till I get a working Pull Request. Before then, I wanted to let you know what I'm doing and why I think it will help.

I am restructuring the entire code within this scope to be entirely modular. I am writing classes for each corpus, which are all actually sub-classes of a Corpus class, which itself uses a CLTK class. Once finished, there will be no redundant code, each corpus will be individually accessible (not just the import/compile code, but also future code, like convert to structured plain text), and the code should be easier to adapt over time.

Like I said, a Pull Request is probably a ways out, but you can see where I am heading once I push my initial work (and then all subsequent work) to my fork.

stephen

—
Reply to this email directly or view it on GitHub.

kylepjohnson · 2014-10-04T23:06:22Z

Wow, you guys are awesome.

@smargh You have correctly identified some very redundant code. Your cleanup of this will be a terrific help. That module grew organically as I needed to add access to a new corpora, and it is becoming hard to manage. Two tips: (1) Make sure that the corpus importer will be able to grow with other languages. For example, consider some kind of logic or class or argument to separate the downloading of, say, Hebrew from Greek from Sanskrit. I am 100% open to how this gets done. (2) If you think this revision will become an overwhelming task, try breaking it into two parts. In this case, I see two discrete tasks, (i) improving my spaghetti code for downloading corpora into ~/cltk_data and (ii) modularizing text manipulation (eg, xml parsing, code cleanup, Beta Code transliteration) of downloaded data. Based on my experience, I suspect that by first separating out the downloading from the parsing, the text-processing code will be much easier to modularize. With TEI so popular these days, the latter would be in incredible boon!

@diyclassics Thanks for sharing your Perseus XML parsing. This will surely come in handy sooner than later.

@smargh + @diyclassics An update from my end: I too have been struggling with XML lately. In my case, I have been parsing the Perseus treebank data for the purpose of making a POS training set & automated tagger. I have some early hacking here for Greek: https://github.com/kylepjohnson/treebank_perseus_greek What I have done is made a POS training set (pos_training_set.txt) with make_pos_training_set.py (in which you'll see my use of lxml). You can then generate your own machine learning tagger and then tag untagged text. (Note: This stuff is covered in chapters 3 & 4 of the "Python Text Processing with NLTK 2.0 Cookbook". ) You can follow along with the recipe I give in the README.

So far, I have only used the tagger with the UnigramTagger() but it seems to work well for texts which are part of the original training set (which is a good sign). I have yet to test it thoroughly, however. I'm going to do Latin tonight.

Thanks again, both, and please holler if you get stuck bad on a problem. I try not to be touchy about edits to my code, so if you see anything of mine that looks like a bad idea, try your hand at making it better, more intuitive, and/or more Pythonic.

fractaledmind · 2014-10-05T17:13:33Z

Ok. I've pushed my initial work to my fork at https://github.com/smargh/cltk. You can see the class based structure that I am taking. I have some initial code for a base CLTK class, which integrates with a config file (right now, the only thing this is used for is to alter the location of the cltk_data directory). I then have the beginnings of the Corpus class, which is the foundation class for all of the individual corpus classes. I have renamed /corpus to /corpora (more specific, and avoids conflict with corpus.py) and within that dir I have the beginnings of the individual corpus classes. These are all sub-classes of the Corpus base class. You can see that already a lot of redundancy is gone and the individual corpus classes are very streamlined. The idea is to put any corpus specific code into these classes, while all shared code (I define shared as used by 2 or more corpora) goes in the Corpus base class. You will also see that I structuring each individual corpus to have two classes: one for the corpus as a whole (where you can download and compile), and one for a specific document in that corpus. Any of the text/format manipulation will go here. You can see some (not very good or directed) examples of this in the perseus_greek.py file.

Right now, my plan is to mirror the current api in compiler.py, but have it call to these classes (and their retrieve() methods) for downloading. So the downloading stage will be able to remain the same, but then user's can access any corpus or any text in a corpus as an object. I think that on this data side, a fully fledged object oriented approach will make access more user-friendly as well as flexible. By going fully modular, we ought to enable easier "hacking" of CLTK.

How does this sound? Thoughts, comments, concerns?

stephen

kylepjohnson · 2014-10-05T20:49:30Z

Hi Stephen,

This is more than I expected and I am very impressed.

Concerning the big picture stuff you're talking about, I think you're right on target. Your object-oriented approach to interacting with them is especially apt.

A couple nit-picky points/questions:

I have been trying to shadow the nltk's directory structure when possible. This leads me to think that it would be preferable to keep the dir name corpus and to keep most of your new code (corpus.py, soup_utils.py, config/py, main.py) within it.
Specialized handling for specific languages can wait for now. If we do what you're working on well, we can add this modularly, as you say.
I have done work here and there for parsing TLG and PHI texts, so I can consolidate and contribute that to tlg.py (and a phi5.py and phi7.py)

I would really like to see this in action. What can I do to help it along? From my end, the closer the repository is to the current cltk master the smoother the transition. If you can get this new code into corpus/ and write out a few example commands to illustrate usage, I think the corpus imports at least will close to ready. From there, it's just a matter writing and improving text cleanup and interaction for specific corpora.

Thanks,
Kyle

fractaledmind · 2014-10-05T21:50:31Z

My thought on other languages initially would be corpus.py can handle that expansion. The Corpus base class should stay as generic as possible, so that ought not be a problem. And, all corpus specific tasks will reside in that corpus' sub-class anyway. Aside from that, we can create a new Document class (I'm actually going to mirror the Corpus -> specific corpus structure) for any new language. So, in the future, there are two base classes: Corpus, for the corpus level functions, and Document, for the document level functions. Each individual corpus will have corresponding classes: e.g. PerseusGreek and PerseusGreekDoc which are sub-classes of the base classes. This would effectively eliminate any language specific problems, as such things would live in the corpus specific code.

Now, we may want to add one level of complexity for flexibility and efficiency and make some intermediate classes. So, for corpus stuff, there are some corpora that only require downloading from the internet (retrieve()), others require a local directory. We could make sub-classes of the Corpus base class for these two basic types (maybe RemoteCorpus and LocalCorpus), and individual corpora sub-class off these depending on what type they are. For documents, we would do similarly. We could have TEIDoc, TXTDoc (I actually have already started these), and even GreekDoc, LatinDoc, and then any other languages. Then, corpus specific classes would be sub-classes of these.

This would allow the most generic code to go into Corpus and Document, any redundant code to go into the intermediate classes, and corpus specific code to go into the corpus sub-sub-class. Thus, when new languages are added, we would def add a new corpus sub-sub-class and maybe a new intermediate sub-class, but the base classes would remain the same. Not only would this hopefully make adding more languages and corpora easier, but it should also help direct peoples thinking. Whatever divisions we make in the intermediate classes will guide people in considering how to classify the specific corpus they want to add.

Anyway, those are my two cents. I'm finishing up the TLG sub-class now. This is actually what led to the intermediate classes thought. While all the remote corpora were easy to write once I had the Corpus class, I'm putting a lot of code in the TLG class which I think can be more generically useful for the other local corpora. So, once I finish it, and turn to PHI, I ought to have a much clearer picture of how such a tri-leveled setup would look like. The goal, regardless, is to make the corpus specific classes as lightweight as possible. That will make adding new ones as easy as possible.

kylepjohnson · 2014-10-05T23:48:10Z

The Corpus and Document classes sound great. And I love the RemoteCorpus and LocalCorpus retrieval idea. Can't wait to see it!

fractaledmind · 2014-10-06T18:11:23Z

basic skeletons of RemoteCorpus and LocalCorpus are up on my fork. For now, I think this discussion can be closed. I want to move to more structured conversations/issues.

Update master

$@fractaledmind$ fractaledmind closed this as completed Oct 6, 2014

kylepjohnson added the enhancement label Oct 10, 2014

kylepjohnson added this to the backend_rewrite milestone Oct 10, 2014

todd-cook pushed a commit that referenced this issue Apr 27, 2018

Merge pull request #11 from cltk/master

a730bd7

Update master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FYI: I'm rewriting the compiling/importing code #11

FYI: I'm rewriting the compiling/importing code #11

fractaledmind commented Oct 4, 2014

diyclassics commented Oct 4, 2014

kylepjohnson commented Oct 4, 2014

fractaledmind commented Oct 5, 2014

kylepjohnson commented Oct 5, 2014

fractaledmind commented Oct 5, 2014

kylepjohnson commented Oct 5, 2014

fractaledmind commented Oct 6, 2014

FYI: I'm rewriting the compiling/importing code #11

FYI: I'm rewriting the compiling/importing code #11

Comments

fractaledmind commented Oct 4, 2014

diyclassics commented Oct 4, 2014

kylepjohnson commented Oct 4, 2014

fractaledmind commented Oct 5, 2014

kylepjohnson commented Oct 5, 2014

fractaledmind commented Oct 5, 2014

kylepjohnson commented Oct 5, 2014

fractaledmind commented Oct 6, 2014