New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for Accessing Corpora #615

Open
SigmaX opened this Issue Dec 12, 2017 · 19 comments

Comments

Projects
None yet
8 participants
@SigmaX
Copy link

SigmaX commented Dec 12, 2017

Either I'm missing something obvious (which is likely), or CLTK offers no documentation on how to use the various corpora the project provides.

After importing the greek_text_perseus corpus, for example, its README.md tells me

This repository holds the Greek files made available by the [Perseus Project](http://www.perseus.tufts.edu/hopper/opensource/do\
wnload). See the CLTK's docs for instructions on how to use these files.

The docs, however, only cover how to download corpora and how to process raw text stored in a Python variable, respectively, omitting the intermediate steps. There is no mention of how one might import a corpus after downloading it (which, I see from this external blog, seems to be a thing?), or how one might otherwise get ahold of a CorpusReader object (assuming such a thing exists, which is not clear from the docs).

From all this I infer that it seems we are intended to

  1. Use CLTK to conveniently download corpora, but not to load them.
  2. Use NLTK or some other 3rd-party tool to load the corpora directly from the resulting text or XML files.
  3. Proceed as usual with NLP analysis, turning to CLTK only when we need language-specific processing capabilities at a low level.

Am I correct in piecing together this puzzle? If so, I haven't seen such a scheme spelled out anywhere in the docs. Perhaps I am blind to something?

@clemsciences

This comment has been minimized.

Copy link
Member

clemsciences commented Dec 12, 2017

I totally agree. I had the same thoughts when I used for the first time cltk. My idea to explain that is to make a new tutorial with that intermediate part explained.

@diyclassics

This comment has been minimized.

Copy link
Contributor

diyclassics commented Dec 12, 2017

@SigmaX—thanks for starting this discussion. There is a lot of work that could be (and should be) done with making corpora easier to work with. This is why I wrote a PlaintextCorpusReader wrapper for the CLTK Latin Library corpus (basically, # 2 from your list); cf. https://disiectamembra.wordpress.com/2016/08/11/working-with-the-latin-library-corpus-in-cltk/. I'll be sure to add this functionality to the docs.

There has been some discussion here of adding more wrappers like this, esp. XMLCorpusReader wrappers for the Perseus texts (cf. #554). If there is interest, I can revisit this. I'd be happy to hear which other corpora you would like better access to as well.

Also, my guess is that # 3 from your list is the way CLTK is used for the most part. But in the interest of a self-contained NLP workflow, I think a better defined pipeline from corpus/data to analysis would be worth pursuing.

@SigmaX

This comment has been minimized.

Copy link
Author

SigmaX commented Dec 12, 2017

@diyclassics The first thing I did is hack together my own (probably flawed) XML reader for Perseus, so I agree that providing that would be a useful feature!

IMO, though, 80% of the problem can be solved just by pointing out in the docs that new users will have to find a way to parse the corpora themselves. In my case, it took me longer to figure out that I needed to manually parse the corpus I was interested in than it did to actually parse it!

Thanks so much for providing these tools! It's an exciting time to be alive.

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Dec 13, 2017

Thank you @SigmaX for raising this valid issue. The challenge with the "toolkit" idea is that things can get messy with more and more contributions come in (a good problem to have, I suppose :)

For tutorials, we have a special repo, however it's never quite been polished enough that I wanted to push it in the docs: https://github.com/cltk/tutorials. Someone could make some notebooks illustrating how to put the pieces together.

Our Greek corpus reader uses a 3rd party tool (MyCapytain) which currently only works for Greek: http://docs.cltk.org/en/latest/greek.html#tei-xml.

The first thing I did is hack together my own (probably flawed) XML reader for Perseus

We'd love to see it. Could you drop it in a gist, with an example of how to run it, so we can take a look?

@SigmaX

This comment has been minimized.

Copy link
Author

SigmaX commented Dec 13, 2017

@kylepjohnson : well, all I did was write a brittle loop that pulled out the text inside every <p> or <q> element, which seemed adequate for the specific text I was working with (Perseus' Meditations). The TEI DTD is very intricate, however, so it would take some work to tell exactly what is needed generalize accurately to arbitrary corpora.

I also took a stab at getting NLTK's XMLCorpusReader to work.

  • The first issue I encountered is that it it pulls ElementTree's text attribute from every tag (which seemed reasonable), but in ElementTree's (somewhat strange) DOM interpretation, text turns up empty or incomplete if empty tags occur inside the text. So I modified XMLCorpusReader to also pull ElementTree's tail attribute—this way it really does extract all text from every node.
  • The second is that XMLCorpusReader croaks on entity references defined in the TEI DTD. Trying to load ~/cltk_data/greek/text/greek_text_perseus/Epictetus/opensource/epictetus_gk.xml, for instance, yields ParseError: undefined entity &responsibility;: line 13, column 0.

Of course, you've already noted (#554) that the XMLCorpusReader strategy, when it does work, pulls unwanted metadata anyway. And now that I realize Captains/myCaptain exists (#560), I'll probably try going that route next.

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Dec 14, 2017

And now that I realize Captains/myCaptain exists (#560), I'll probably try going that route next.

Because of the complexity of TEI, I think this probably the best thing to do.

Since you're clearly an able coder, we'd be interested in seeing how you solve this one, even if you don't think the code is production-ready. And of course we're happy to help with any issues you want an extra pair of eyes on.

Should we close this?

@SigmaX

This comment has been minimized.

Copy link
Author

SigmaX commented Dec 14, 2017

Should we close this?

Personally I'd leave it open until there is at least a sentence in the docs either pointing to a tutorial or saying "you need to figure out how to import the corpora yourself" or such.

@kylepjohnson

This comment has been minimized.

Copy link
Member

kylepjohnson commented Dec 14, 2017

Sure. I'll take care of this and post back here for people to comment on :)

@jtauber

This comment has been minimized.

Copy link

jtauber commented Dec 14, 2017

The helper code that Eldarion is developing on top of MyCaptains for the new Perseus will likely help with this. It will hopefully be open source in the next month or so.

@markomanninen

This comment has been minimized.

Copy link

markomanninen commented Jan 29, 2018

Could someone point to the tutorial where greek perseus corpora is used to read any text say homer iliad as a plain readable format?

@diyclassics

This comment has been minimized.

Copy link
Contributor

diyclassics commented Jan 29, 2018

@markomanninen—Perseus reader is still an open issue (#361, e.g.).

There are some ways to go about it outside of CLTK though: 1. MyCapytain (cc: @PonteIneptique) http://mycapytain.readthedocs.io/en/latest/ is one option; and 2. I show in this tutorial/notebook how to get plaintext perseus texts using requests/lxml: https://github.com/diyclassics/perseus-experiments/blob/master/Perseus%20Plaintext%20Poetry.ipynb Let me know if how these options work for you and I'll be sure to move up XML readers in my CLTK work queue.

@jtauber

This comment has been minimized.

Copy link

jtauber commented Jan 29, 2018

Let me help a little by adding an issue to Scaife to provide a plain text render of a passage directly on Perseus.

scaife-viewer/scaife-viewer#213

@markomanninen

This comment has been minimized.

Copy link

markomanninen commented Jan 30, 2018

Thanks for quick feedback. Ill try that notebook. Looks good to me. But could I use already imported corpora from my local machine? I mean cltk import corpora does it fine and I can spot files from my home directory. Problem is that xml or json format should be parsed and some interface would be helpful to retrieve chapters and verses. Anyway let me try.

@PonteIneptique

This comment has been minimized.

Copy link
Member

PonteIneptique commented Jan 30, 2018

Hi @markomanninen , some context for local corpora use : Capitains.org guidelines are used by Perseus and the OpenGreekAndLatin project to encode their text. The requirements for the xml TEI is quite small (Guidelines) and so if you are working with xml files from other providers, you could easily "convert" them for being read by Capitains tools.

Once you have those files, there is few way to deal with them :

Finally, there has been a course using it in SunoikisisDC

@markomanninen

This comment has been minimized.

Copy link

markomanninen commented Jan 30, 2018

Thanks @PonteIneptique It looks like MyCapytain requires lxml.etree version lower than 3.8.0 which are not compatible with my windows systems. I could try to install earlier version of mycapytain, or try other ways of parsing xml data...

@PonteIneptique

This comment has been minimized.

Copy link
Member

PonteIneptique commented Jan 30, 2018

@markomanninen I think it should work with <3.8.0 . Maybe we could move the conversation elsewhere but it seems there is a Windows wheel for 3.8.0 : https://pypi.python.org/pypi/lxml/3.8.0 ?

@markomanninen

This comment has been minimized.

Copy link

markomanninen commented Jan 30, 2018

Yeah, this is another issue. My sys.version is:

3.5.4 |Continuum Analytics, Inc.| (default, Aug 14 2017, 13:41:13) [MSC v.1900 64 bit (AMD64)]

but from lxml import etree gives error. Only this one works:

import xml.etree.cElementTree as etree

which should be valid import for 2.5+ (ref: http://lxml.de/tutorial.html)

So to get mycapitain work in my system, I would need to fork and modify that module first...

@markomanninen

This comment has been minimized.

Copy link

markomanninen commented Jan 31, 2018

I made a gist for parsing local file, this is very raw version, not using any xml parser, thus might have some issues:

https://gist.github.com/markomanninen/a68f200b4e98f018d7618dab0365ffe5#file-perseus_local_file_parser-py

@swasheck

This comment has been minimized.

Copy link

swasheck commented May 28, 2018

This is still an issue. Once the corpus has been downloaded, what can we do? It seems like primary value is to use all of the sets as training sets and then ... uhh ... analyze something ... somehow. Additionally, "latest" documentation is broken ("Concordance" doesn't work as there's no method call write_concordance_from_file() anymore, I guess?).

kylepjohnson added a commit that referenced this issue Jan 16, 2019

Add CorpusReader for Greek and Latin perseus cltk json files #361 #615 (
#854)

* Initial releases with unit tests and doctests

* Added sections and preliminary documentation for:
Scansion of Poetry
About the use of macrons in poetry
HexameterScanner
Hexameter
ScansionConstants
Syllabifier
Metrical Validator
ScansionFormatter
StringUtils module

Made minor formatting corrections elsewhere to quiet warnings encountered during transpiling the rst file during testing and verification.

* corrected documentation & doctest comments that were causing errors.
doctests run with an added command line switch:
nosetests --no-skip --with-coverage --cover-package=cltk --with-doctest

* fixing broken doctest comment

* correcting documentation comment that causes doctest to err

* Corrections to make the build pass:
1. added install gensim to travis build script; its absence is causing an error in word2vec.py during the build.
2. Modified transcription.py so that the macronizer is initialized on instantiation of the Transcriber class and not at the module level; the macronizer file is 32MB and this also seems to cause an error with travis as github does not make large files displayable, and so it may not be available for the build. The macronizer object has been made a component of "self."

* moved package import inside of main so that it does not prevent the build from completing;
soon, we should move to update the dependencies of word2vec; gensim pulls in boto which isn't python3 compliant, there is a boto3 version which we may be able to slot in, but perhaps a larger question is boto necessary?

* correcting documentation

* add JsonFile Corpus Reader for Perseus Greek and Latin cltk json corpora
add better corpus reader documentation
correct annotations and package naming
unit tests for JsonFile Corpus Readers

* improved documentation and a fix for tests

* remove unnecessary coerce to int for sorting sections and subsections

* switch print statement to log statement

* corrected JsonFileCorpusReader to work with arbitrary levels of nested dictionaries

* add perseus corpus types file for assemble_corpus functionality
revise assemble_corpus method to just return a CorpusReader instead of a tuple of CorpusReader and input params
correct latin library corpus types
Revised test_corpus.py file to use setUp; removed the download_test_corpora file, changed the travis script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment