Add corpus for classical telugu #220

ghost · 2016-03-31T06:52:01Z

https://te.wikisource.org/wiki contains the classical telugu ithihasas, puranas, vedas, stothras, etc;
So I would like to scrape them and add as a new corpus.

Thank you.

kylepjohnson · 2016-04-01T00:51:10Z

Hi, wonderful task!

Once you've scraped this, we'll need your help to create an index with include author names in Latin characters and BC/AD dates.

For transliterating author and text names, consider making a little program for Telugu.

ghost · 2016-04-01T04:00:06Z

Thank you sir, I will get back once I have done extracting the corpus.

ghost · 2016-04-03T17:58:34Z

Hello sir, I have scraped pothana telugu bhagavatham :
/achaitanyasai/classical-telugu/pothana_telugu_bhagavatham

I will let you know about the other files in 2-3 days.

kylepjohnson · 2016-04-03T18:22:57Z

This is terrific! Before you're done, remember to add a README explain where you got the texts and a LICENSE file, too. Also put in here the scripts by which you scraped the site.

Once this is done, it'd be great to offer and index file giving dates for each of these authors.

ghost · 2016-04-03T18:33:28Z

Thank you for the feedback and yes I will add the README, LICENSE and the scrapper files.

ghost · 2016-04-03T18:38:50Z

And right now I am putting all the documents in /classical-telugu repository. Do I need to place each document in seperate repository.
I mean pothana bhagavatham in one repository,
brahma purana in other seperate repository, etc ?

kylepjohnson · 2016-04-03T18:53:54Z

Good question. Generally, we use 1 repo for each language for each website.

"Pothana Bhagavatam" is a translation by Pothana, right? And wil there be works by other authors? If so, in the root of the repo you can put Pothana, author2, author3, etc.. The work "Pothana Bhagavatam" would then go within the Pothana directory.

Am I understanding the authors and texts right?

ghost · 2016-04-03T18:58:25Z

"Pothana Bhagavatam" is a translation by Pothana, right?

Yes you are right.

And will there be works by other authors?

Yes, there are other authors as well.

The work "Pothana Bhagavatam" would then go within the Pothana directory.

Yeah I got the idea.

Am I understanding the authors and texts right?

Yes you are perfect :)

ghost · 2016-04-04T13:04:12Z

Ping @kylepjohnson : Added 4 more corpora : https://github.com/achaitanyasai/classical-telugu

I will add the README, LICENSE, scrapper files once I am done extracting the documents.

ghost · 2016-04-05T12:11:58Z

Hello sir, I have a small doubt : some documents are just the sanskrit documents which are in telugu script. For example valmiki ramayana.
So do I need to scrape them ?

kylepjohnson · 2016-04-05T17:33:41Z

To be clear: These are transliterations, not translations? If these are Sanskrit texts, just in Telugu characters, then no they do not belong in your Telugu repo.

Is there anything else special or unique about these texts?

ghost · 2016-04-05T17:38:44Z

Hello sir, firstly thanks for the clarification and yeah I meant transliterations.

Is there anything else special or unique about these texts?

No, nothing special/unique.

kylepjohnson · 2016-04-05T17:50:47Z

Ok, let's skip them, then. Thank you for being so careful.

ghost · 2016-04-05T18:05:54Z

Yeah I will skip them and scrape other documents.
On Apr 5, 2016 11:20 PM, "Kyle P. Johnson" notifications@github.com wrote:

Ok, let's skip them, then. Thank you for being so careful.

—
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#220 (comment)

ghost · 2016-04-11T17:42:02Z

Hello sir, I need few more days to complete this task, I will get back to you once I finish collecting the documents.

kylepjohnson · 2016-04-11T17:42:42Z

Thanks for the update. Good work.

ghost · 2016-04-19T13:12:18Z

Ping @kylepjohnson
Hello sir, I have extracted few more documents :

I will add the remaining documents in a day or two.
Also added LICENSE file and I will update the README file in a day.
DO I need to transer ownership to cltk now or do I need to transfer when it's done ?

Thank you.

kylepjohnson · 2016-04-19T18:29:07Z

Good work. Let's transfer it now, since it has a good open source license. I will make you leader of the new Telugu group.

Before doing so, rename it to telugu_text_wikisource.

How many more texts Classical texts remain? Also, because I know nothing about Telugu literature, I will need some help from you documenting the authors and what years these texts were written. One idea: link an author's English-language Wikipedia page in the Readme. How does that sound?

Thank you, very exciting! :)

ghost · 2016-04-20T03:12:46Z

Thank you sir for the information and I will transfer now itself.

How many more texts Classical texts remain?

There are about 20 - 25 texts. But they won't take much time to scrape. I will do it today itself.

And for the readme, I will document the authors and the dates in a day.

One idea: link an author's English-language Wikipedia page in the Readme. How does that sound?

Yeah, it sounds good, I will follow this idea if I don't get a better one.

Thank you.

ghost · 2016-04-20T06:22:44Z

Done transferring to cltk : https://github.com/cltk/telugu_text_wikisource
I will add the remaining texts and update the readme very soon.

ghost · 2016-04-20T12:41:39Z

Hello sir, I have added all the remaining texts : https://github.com/cltk/telugu_text_wikisource
I will update README very soon (In a day).

kylepjohnson · 2016-04-20T18:08:18Z

Wow, such a beautiful script. I wish I could read it!

You're doing a great job. Two things to change before finishing:

Put all of the texts in a directory called texts in the root of the repo.
Move the scrape.py scripts into a directory called scrapers in the root of the repo. Rename each script to explain which text it gets.

Thanks so much. Once you're happy with this task, you can make a PR to add these new languages (https://github.com/cltk/cltk/wiki/How-to-add-a-corpus-to-the-CLTK).

ghost · 2016-04-21T03:05:16Z

Thank you sir for the feedback and I will modify accordingly.

ghost · 2016-04-21T12:17:47Z

Done changing directory structure, is this fine : /cltk/telugu_text_wikisource ?
I will update the readme and make a PR.

kylepjohnson · 2016-04-21T16:41:38Z

This looks fantastic. I just updated the README.

I don't want the PR to wait for the index, so how about I make you a ticket just for adding an index?

Next step: I'd like you to make the PR (and add a new docs/telugu.rst and add telugu to docs/index.rst).

ghost · 2016-04-21T16:46:31Z

Yeah fine, I will make the PR now itself.
UPD: And make a seperate ticket for adding docs and adding indices.

ghost · 2016-04-21T17:12:32Z

Ping @kylepjohnson Done making PR: #263 You can look at it, and I will update readme, add docs very soon.
Thank you.

kylepjohnson · 2016-04-21T17:17:27Z

See my comments why I rejected it #263.

For telugu.rst, you can add this:

Telugu
********

Corpora
=======

Use ``CorpusImporter()`` or browse the `CLTK GitHub repository <https://github.com/cltk>`_ (anything beginning with ``telugu_``) to discover available Telugu corpora.

.. code-block:: python

   In [1]: from cltk.corpus.utils.importer import CorpusImporter

   In [2]: c = CorpusImporter('telugu')

   In [3]: c.list_corpora
   Out[3]:
   ['telugu_text_wikisource']

ghost · 2016-04-21T17:19:30Z

Yeah I have seen it. Thanks a lot.
UPD: I have updated and made a new PR, can you check it.

ghost · 2016-08-23T06:02:38Z

As far as I can see, I have parsed the texts from https://te.wikisource.org/wiki , in case if I missed something or found a new text, I will update accordingly. So now coming to your second point, "including author names in Latin characters and BC/AD dates", can you elaborate more ?

kylepjohnson · 2016-09-06T04:10:56Z

Thank you @achaitanyasai for seeing this through.

Let's DM on gitter about what else you want to help out on!

kylepjohnson assigned ghost Apr 1, 2016

kylepjohnson added the new corpus label Apr 1, 2016

kylepjohnson mentioned this issue Apr 21, 2016

Add index to telugu_text_wikisource repo #262

Closed

kylepjohnson closed this as completed Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add corpus for classical telugu #220

Add corpus for classical telugu #220

ghost commented Mar 31, 2016

kylepjohnson commented Apr 1, 2016

ghost commented Apr 1, 2016

ghost commented Apr 3, 2016

kylepjohnson commented Apr 3, 2016

ghost commented Apr 3, 2016

ghost commented Apr 3, 2016

kylepjohnson commented Apr 3, 2016

ghost commented Apr 3, 2016

ghost commented Apr 4, 2016

ghost commented Apr 5, 2016

kylepjohnson commented Apr 5, 2016

ghost commented Apr 5, 2016

kylepjohnson commented Apr 5, 2016

ghost commented Apr 5, 2016

ghost commented Apr 11, 2016

kylepjohnson commented Apr 11, 2016

ghost commented Apr 19, 2016 •

edited by ghost

kylepjohnson commented Apr 19, 2016

ghost commented Apr 20, 2016

ghost commented Apr 20, 2016

ghost commented Apr 20, 2016

kylepjohnson commented Apr 20, 2016

ghost commented Apr 21, 2016

ghost commented Apr 21, 2016

kylepjohnson commented Apr 21, 2016

ghost commented Apr 21, 2016 •

edited by ghost

ghost commented Apr 21, 2016

kylepjohnson commented Apr 21, 2016

ghost commented Apr 21, 2016 •

edited by ghost

ghost commented Aug 23, 2016

kylepjohnson commented Sep 6, 2016

Add corpus for classical telugu #220

Add corpus for classical telugu #220

Comments

ghost commented Mar 31, 2016

kylepjohnson commented Apr 1, 2016

ghost commented Apr 1, 2016

ghost commented Apr 3, 2016

kylepjohnson commented Apr 3, 2016

ghost commented Apr 3, 2016

ghost commented Apr 3, 2016

kylepjohnson commented Apr 3, 2016

ghost commented Apr 3, 2016

ghost commented Apr 4, 2016

ghost commented Apr 5, 2016

kylepjohnson commented Apr 5, 2016

ghost commented Apr 5, 2016

kylepjohnson commented Apr 5, 2016

ghost commented Apr 5, 2016

ghost commented Apr 11, 2016

kylepjohnson commented Apr 11, 2016

ghost commented Apr 19, 2016 • edited by ghost

kylepjohnson commented Apr 19, 2016

ghost commented Apr 20, 2016

ghost commented Apr 20, 2016

ghost commented Apr 20, 2016

kylepjohnson commented Apr 20, 2016

ghost commented Apr 21, 2016

ghost commented Apr 21, 2016

kylepjohnson commented Apr 21, 2016

ghost commented Apr 21, 2016 • edited by ghost

ghost commented Apr 21, 2016

kylepjohnson commented Apr 21, 2016

ghost commented Apr 21, 2016 • edited by ghost

ghost commented Aug 23, 2016

kylepjohnson commented Sep 6, 2016

ghost commented Apr 19, 2016 •

edited by ghost

ghost commented Apr 21, 2016 •

edited by ghost

ghost commented Apr 21, 2016 •

edited by ghost