New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement NLTK CorpusReader(s) for existing corpora #361

Open
ryanfb opened this Issue Aug 16, 2016 · 13 comments

Comments

Projects
None yet
7 participants
@ryanfb
Copy link
Contributor

ryanfb commented Aug 16, 2016

See: #32 #296

NLTK's PlainTextCorpusReader may work for e.g. Lacus Curtius and other plaintext corpora (as is now done for The Latin Library). XMLCorpusReader may work for some XML corpora.

As an example, I'd like to be able to call .words() on all the available CorpusReader instances for a language, so I can programmatically build a comprehensive dictionary of unique words in each language. I can also imagine people wanting to be able to do the same for sentences, and so on. See here for common NLTK corpus reader functions: http://www.nltk.org/api/nltk.corpus.html#module-nltk.corpus

We might want to use the existing defined lists of corpora and attributes to do this in some programmatic way as well, i.e. if type is text, and markup is plaintext, use PlainTextCorpusReader and name to construct the path. We could load the reader instances into a Python dictionary based on name as well.

@diyclassics

This comment has been minimized.

Copy link
Contributor

diyclassics commented Aug 17, 2016

@ryanfb —I think that these are great ideas and plan to address them in time (though free to contribute, if you're so inclined!). I started with the Latin Library for these practical reasons: 1. to demonstrate the usefulness of having access to the corpus reader methods, 2. to make it as easy as possible for people curious about CLTK, esp. beginners to get up and running with something familiar, and 3. to have a common set of texts to base a series of blog posts on. I think this has worked out so far. So, yes, I think extending this functionality is a good idea—testing out XMLCorpusReader on the Perseus corpus might be a good next step.

Also, I like this attributes-based approach—if we experiment with XMLCorpusReader and Perseus, we can test a simple plaintext/xml detection setup with those two corpora.

@ryanfb

This comment has been minimized.

Copy link
Contributor Author

ryanfb commented Aug 17, 2016

FWIW, I experimented the other day with adding an XMLCorpusReader for Perseus Latin text and ran into a couple of issues. One was the XMLCorpusReader seems to only work with a single corpus fileid (probably easy enough to write a small wrapper around so that it can be used for all fileids in a corpus). The other was that there appears to be some weirdness in the Perseus XML which prevented it from parsing (which might be an upstream issue we need to discuss with Perseus):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 65, in words
    elt = self.xml(fileid)
  File "/usr/local/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 48, in xml
    elt = ElementTree.parse(self.abspath(fileid).open()).getroot()
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 1184, in parse
    tree.parse(source, parser)
  File "/usr/local/Cellar/python3/3.5.2_1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/xml/etree/ElementTree.py", line 596, in parse
    self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: undefined entity &responsibility;: line 14, column 0

See here for instances of this in the corpus: https://github.com/cltk/latin_text_perseus/search?utf8=%E2%9C%93&q=%26responsibility%3B

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Aug 17, 2016

I'm all in favor of extending the CorpusReader.

@ryanfb The legacy Perseus XML is not uniform and will not be parseable the way you'd like (that's my and some others' experience, at least). However, can avail ourselves of the new versions of these from the Open Philology project? These are in XML, though their markup is not trivial.

@diyclassics Could you make a JSONCorpusReader? @suheb made a repo of the latest Open Philology texts converted from XML to JSON, using a tool written by @PonteIneptique (https://github.com/cltk/capitains_corpora_converter). Here are the texts: https://github.com/cltk/capitains_text_corpora.

Ryan, if you feel strongly about pitching in on such a reader, Patrick and I are available as support.

Thanks for looking into this!

@PonteIneptique

This comment has been minimized.

Copy link
Member

PonteIneptique commented Sep 21, 2016

@ryanfb Sorry to get back this late, but yeah, the main issue with old XML of Perseus is what Kyle just say. If you are interested in reading the corpus of text converted, you should probably have a look a http://ci.perseids.org/repo/PerseusDL/canonical-latinLit or so. I have a corpus reader for the XML of Perseus which has been converted ( https://github.com/Capitains/Nautilus/blob/master/capitains_nautilus/inventory/local.py ) and I assume that the .words() function would be something interesting to add to the Text Class ( https://github.com/Capitains/MyCapytain/blob/master/MyCapytain/resources/texts/local.py#L27 ). There is doc online and I'd be happy to give some feedback.

If you wish to keep working with the original XML (with no care for the construction of the file), for the entity issue, use lxml parser and some network wrapper for the DTD : https://github.com/PerseusDL/tei-conversion-tools/tree/master/dtd

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Jun 4, 2017

Closing this (old) ticket. But I am open to something along these lines still.

@mlj

This comment has been minimized.

Copy link
Member

mlj commented Jun 9, 2017

May I suggest re-opening this issue (and keeping it open) to encourage further discussion and to signpost that this is an open challenge for cltk.

I'm coming to this as an historical linguist, and the attraction for someone like me is that cltk could be a bridge between nltk and a curated collection of corpora. It should be obvious that not having ready-made CorpusReaders for the corpora largely ruins it for me!

Of course, nltk's PlaintextCorpusReader works OK on the plain-text corpora, but this isn't what I would be looking for. Frankly, corpora like the Latin Library are too noisy for me anyway, at least the way they are now. The ones I'm interested in have a bit more structure and/or have been tended to more carefully. Those tend to require parsing XML or some other less trivial format, which may not be too difficult but really shouldn't be necessary for users to deal with.

Doing one thing well is much better than many things badly, so for cltk to be, in essence, a corpus downloader/manager would not be a bad thing. But the ambition seems to be higher than this. The docs today show off several fun things that one might want to do, but almost all of that is in reality only possible if you (1) parse the raw corpus files yourself or (2) bring your own data.

I think not having ready-made CorpusReaders is a serious mistake. It fails to unlock the full potential of cltk and encourages sloppy work on the part of those who add corpora. I know this last part is a bit harsh, but cltk is growing and I think raising the bar a little bit will be in everyone's interest!

(For the record, I will absolutely contribute to this myself, but I'd like to know what the consensus here is before investing too much time writing code.)

@kylepjohnson kylepjohnson reopened this Jun 13, 2017

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Jun 13, 2017

Marius, it's important to hear how important a CorpusReader is to your research. I'll share a few thoughts along these lines …

I think not having ready-made CorpusReaders is a serious mistake. It fails to unlock the full potential of cltk and encourages sloppy work on the part of those who add corpora.

The sloppiness of some of the added corpora is very real. How much of a problem this is depends on what your goals are. Having a stable doc format would be a great help.

On the note of making some kind of universal CLTK format, I have given this some thought. Last summer @lukehollis and I worked this JSON standard for the API and frontend: https://github.com/cltk/cltk_api/wiki/JSON-data-format-specifications. From the perspective of someone doing literature and NLP, this works for me because it allows me to get (a) plaintext but also (b) preserves traditional chunking schemas. If this sufficed, we could create a corpus reader for this (there is none for the NLTK, however we could create one ourselves).

@mlj Do you think that the above JSON structure would work for your needs? If not, do you have models of others that we could look at?

@mlj

This comment has been minimized.

Copy link
Member

mlj commented Jun 13, 2017

I'm absolutely in favour of something that isn't too complex or too over-engineered, otherwise this will be just too much work. So the JSON format is definitely along the right lines!

How do you handle headings etc. in this format? And do you normalise the representation of things like line breaks and paragraph breaks?

One possible drawback of JSON is that it can be difficult to diff unless liberally sprinkled with whitespace in the right places. That may be something to keep in mind since it is likely that we will have to regenerate JSON files from original upstream files quite regularly.

@mlj

This comment has been minimized.

Copy link
Member

mlj commented Jun 13, 2017

BTW, for some other corpus-related projects I have worked on we have simply used a restricted form of Markdown with metadata in YAML-style headers. I can't think of any particular advantages of doing it that way compared to using JSON (although it would come with the bonus of pretty rendering on github :) )

@jtauber

This comment has been minimized.

Copy link

jtauber commented Jun 13, 2017

Are there currently any endpoints serving up texts according to that JSON format? I wouldn't mind supporting it in my reader if there are things to consume.

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Jun 13, 2017

I'm absolutely in favour of something that isn't too complex or too over-engineered, otherwise this will be just too much work. So the JSON format is definitely along the right lines!

Encouraging : )

How do you handle headings etc. in this format? And do you normalise the representation of things like line breaks and paragraph breaks?

We talked some about this I am not sure we came to a final decision. What I remember suggestion was something along the lines of appending a number and underscore, so as to keep the texts sortable in the right order. For example, imagine the Γλῶσσαι of Hesychius. I believe the traditional organization of this is by first character and then the word itself. So the doc would βε something like:

{
   "englishTitle": "Glosses",
   "originalTitle": "Glossae",
   "source": "Somewhere",
   "sourceLink": "http://example.com",
   "language": "greek",
   "meta": "letter-word", 
   "author": "Hesychius",
   "text": {
              "Α": {
                       "1_ ἄγγεα": " Arma virumque cano, Troiae qui primus ab oris", 
                       "2_ἀγγελία": "Italiam, fato profugus, Laviniaque venit",
                        ...
              },

              "Β": {
                        ... 
              }

              ...
    }
}

Books of the bible, etc could be done the same way (I got this idea from talking with the terrific Sepharia). It looks a little awkward, but the initial n_ is easy to trip while parsing. @lukehollis and @pletcher would know if we are doing anything like this for the in-progress website.

One possible drawback of JSON is that it can be difficult to diff unless liberally sprinkled with whitespace in the right places. That may be something to keep in mind since it is likely that we will have to regenerate JSON files from original upstream files quite regularly.

Excellent point! I believe that, if we pretty print according to the same standard, diffs will become possible.

BTW, for some other corpus-related projects I have worked on we have simply used a restricted form of Markdown with metadata in YAML-style headers. I can't think of any particular advantages of doing it that way compared to using JSON

The key issue here is that we need a data model that can handle data of arbitrary subordination. By arbitrary I mean both arbitrary depth (eg, chapter-section-subsection-subsubsection), arbitrary naming conventions (Ruth, Maccabees, 77, etc.), and arbitrary combinations of chunk types (e.g., book-chapter-section, play-line, book-line). One of my guiding design principles, simple as possible, but no simpler.

(although it would come with the bonus of pretty rendering on github :) )

Readability on GitHub counts for something in my book! In this case I agree that YAML is not robust enough for an official CLTK markup, though I have always thought that GitHub would work as a great, easy interface for non-programming contribs.

Please let me know I don't cover anything in all this!

Also, to be honest, I am not sure what a JSONParseCorpusReader would look like, either in user API or internally. What kind of interface would be good for your needs?

cc'ing @diyclassics because he knows more about the NLTK's CorpusReader than I do.

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Jun 13, 2017

Are there currently any endpoints serving up texts according to that JSON format? I wouldn't mind supporting it in my reader if there are things to consume.

@jtauber My plans have been to serve this format with the cltk_api (a new version under slow development here). Earlier I thought that we would use this API to serve the website, however the frontend's exact needs have not been 100% nailed down, so I have decided to wait until this is worked through. @lukehollis and @pletcher are doing the hard work on this, not me.

But nevertheless here are examples of text serving what I've done so far:

@todd-cook

This comment has been minimized.

Copy link
Contributor

todd-cook commented Jan 9, 2019

#854 implements a JsonFileCorpusReader for the cltk_jsons folders/files
(the existing NLTK Json corpus reader was tweet focused, and not sufficient).

Greek and Latin Perseus files are currently supported. Other corpora may need different readers. I tried to work directly with TEI and XML copora, but wasn't satisfied.

kylepjohnson added a commit that referenced this issue Jan 16, 2019

Add CorpusReader for Greek and Latin perseus cltk json files #361 #615 (
#854)

* Initial releases with unit tests and doctests

* Added sections and preliminary documentation for:
Scansion of Poetry
About the use of macrons in poetry
HexameterScanner
Hexameter
ScansionConstants
Syllabifier
Metrical Validator
ScansionFormatter
StringUtils module

Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.

* corrected documentation & doctest comments that were causing errors.
doctests run with an added command line switch:
nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

* fixing broken doctest comment

* correcting documentation comment that causes doctest to err

* Corrections to make the build pass:
1. added install gensim to travis build script; its absence is causing an error in word2vec.py during the build.
2. Modified transcription.py so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."

* moved package import inside of main so that it does not prevent the build from completing;
soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?

* correcting documentation

* add JsonFile Corpus Reader for Perseus Greek and Latin cltk json corpora
add better corpus reader documentation
correct annotations and package naming
unit tests for JsonFile Corpus Readers

* improved documentation and a fix for tests

* remove unnecessary coerce to int for sorting sections and subsections

* switch print statement to log statement

* corrected JsonFileCorpusReader to work with arbitrary levels of nested dictionaries

* add perseus corpus types file for assemble_corpus functionality
revise assemble_corpus method to just return a CorpusReader instead of a tuple of CorpusReader and input params
correct latin library corpus types
Revised test_corpus.py file to use setUp; removed the download_test_corpora file, changed the travis script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment