Skip to content
This repository has been archived by the owner on Aug 28, 2020. It is now read-only.

Strategies for parsing / ingesting corpora - latin_text_latin_library #4

Open
lukehollis opened this issue Nov 22, 2015 · 3 comments
Open

Comments

@lukehollis
Copy link
Member

The beginnings of a solution are in https://github.com/cltk/cltk_api/blob/ingest/ingest/learn/latin_library.py, but we've discussed the difficulty of incorporating TLL files here, and it's very likely that the added benefit at this stage is outweighed by the effort of programming attempting to parse/infer useful metadata.

@kylepjohnson
Copy link
Member

I want both TLL and Lacus Curtius in the API. My perception is that it will be easier to wait on these until we have settled upon a data structure we know will work.

Idea: how about we get through the first milestone of serving the api from texts, and then you also picking it up in the frontend? I say this for two reasons: (1) not to risk reduplicated effort; (2) the texts (especially TLL) are so inconsistently marked up that it may be better to find someone to copy-paste them into the form we want; (3) I would like to reach out to Bill Thayer to talk about getting the Greek LC files, too, since I never wrote a scraper for them years back. He knows those files well and could be of service for the corpora.

Does this sound logical?

@lukehollis
Copy link
Member Author

Sounds great to deprioritize these two for right now!

@kylepjohnson
Copy link
Member

I appreciate you looking forward. With all these included -- Pers, TLL, and
LC -- the site will be a formidable resource.

On Sunday, November 22, 2015, Luke Hollis notifications@github.com wrote:

Sounds great to deprioritize these two for right now!


Reply to this email directly or view it on GitHub
#4 (comment).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants