Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add corpus for classical telugu #220

Closed
ghost opened this issue Mar 31, 2016 · 31 comments
Closed

Add corpus for classical telugu #220

ghost opened this issue Mar 31, 2016 · 31 comments

Comments

@ghost
Copy link

ghost commented Mar 31, 2016

https://te.wikisource.org/wiki contains the classical telugu ithihasas, puranas, vedas, stothras, etc;
So I would like to scrape them and add as a new corpus.

Thank you.

@kylepjohnson
Copy link
Member

Hi, wonderful task!

Once you've scraped this, we'll need your help to create an index with include author names in Latin characters and BC/AD dates.

For transliterating author and text names, consider making a little program for Telugu.

@ghost
Copy link
Author

ghost commented Apr 1, 2016

Thank you sir, I will get back once I have done extracting the corpus.

@ghost
Copy link
Author

ghost commented Apr 3, 2016

Hello sir, I have scraped pothana telugu bhagavatham :
/achaitanyasai/classical-telugu/pothana_telugu_bhagavatham

I will let you know about the other files in 2-3 days.

@kylepjohnson
Copy link
Member

This is terrific! Before you're done, remember to add a README explain where you got the texts and a LICENSE file, too. Also put in here the scripts by which you scraped the site.

Once this is done, it'd be great to offer and index file giving dates for each of these authors.

@ghost
Copy link
Author

ghost commented Apr 3, 2016

Thank you for the feedback and yes I will add the README, LICENSE and the scrapper files.

@ghost
Copy link
Author

ghost commented Apr 3, 2016

And right now I am putting all the documents in /classical-telugu repository. Do I need to place each document in seperate repository.
I mean pothana bhagavatham in one repository,
brahma purana in other seperate repository, etc ?

@kylepjohnson
Copy link
Member

Good question. Generally, we use 1 repo for each language for each website.

"Pothana Bhagavatam" is a translation by Pothana, right? And wil there be works by other authors? If so, in the root of the repo you can put Pothana, author2, author3, etc.. The work "Pothana Bhagavatam" would then go within the Pothana directory.

Am I understanding the authors and texts right?

@ghost
Copy link
Author

ghost commented Apr 3, 2016

"Pothana Bhagavatam" is a translation by Pothana, right?

Yes you are right.

And will there be works by other authors?

Yes, there are other authors as well.

The work "Pothana Bhagavatam" would then go within the Pothana directory.

Yeah I got the idea.

Am I understanding the authors and texts right?

Yes you are perfect :)

@ghost
Copy link
Author

ghost commented Apr 4, 2016

Ping @kylepjohnson : Added 4 more corpora : https://github.com/achaitanyasai/classical-telugu

I will add the README, LICENSE, scrapper files once I am done extracting the documents.

@ghost
Copy link
Author

ghost commented Apr 5, 2016

Hello sir, I have a small doubt : some documents are just the sanskrit documents which are in telugu script. For example valmiki ramayana.
So do I need to scrape them ?

@kylepjohnson
Copy link
Member

To be clear: These are transliterations, not translations? If these are Sanskrit texts, just in Telugu characters, then no they do not belong in your Telugu repo.

Is there anything else special or unique about these texts?

@ghost
Copy link
Author

ghost commented Apr 5, 2016

Hello sir, firstly thanks for the clarification and yeah I meant transliterations.

Is there anything else special or unique about these texts?

No, nothing special/unique.

@kylepjohnson
Copy link
Member

Ok, let's skip them, then. Thank you for being so careful.

@ghost
Copy link
Author

ghost commented Apr 5, 2016

Yeah I will skip them and scrape other documents.
On Apr 5, 2016 11:20 PM, "Kyle P. Johnson" notifications@github.com wrote:

Ok, let's skip them, then. Thank you for being so careful.


You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub
#220 (comment)

@ghost
Copy link
Author

ghost commented Apr 11, 2016

Hello sir, I need few more days to complete this task, I will get back to you once I finish collecting the documents.

@kylepjohnson
Copy link
Member

Thanks for the update. Good work.

@ghost
Copy link
Author

ghost commented Apr 19, 2016

Ping @kylepjohnson
Hello sir, I have extracted few more documents :

I will add the remaining documents in a day or two.
Also added LICENSE file and I will update the README file in a day.
DO I need to transer ownership to cltk now or do I need to transfer when it's done ?

Thank you.

@kylepjohnson
Copy link
Member

Good work. Let's transfer it now, since it has a good open source license. I will make you leader of the new Telugu group.

Before doing so, rename it to telugu_text_wikisource.

How many more texts Classical texts remain? Also, because I know nothing about Telugu literature, I will need some help from you documenting the authors and what years these texts were written. One idea: link an author's English-language Wikipedia page in the Readme. How does that sound?

Thank you, very exciting! :)

@ghost
Copy link
Author

ghost commented Apr 20, 2016

Thank you sir for the information and I will transfer now itself.

How many more texts Classical texts remain?

There are about 20 - 25 texts. But they won't take much time to scrape. I will do it today itself.

And for the readme, I will document the authors and the dates in a day.

One idea: link an author's English-language Wikipedia page in the Readme. How does that sound?

Yeah, it sounds good, I will follow this idea if I don't get a better one.

Thank you.

@ghost
Copy link
Author

ghost commented Apr 20, 2016

Done transferring to cltk : https://github.com/cltk/telugu_text_wikisource
I will add the remaining texts and update the readme very soon.

@ghost
Copy link
Author

ghost commented Apr 20, 2016

Hello sir, I have added all the remaining texts : https://github.com/cltk/telugu_text_wikisource
I will update README very soon (In a day).

@kylepjohnson
Copy link
Member

Wow, such a beautiful script. I wish I could read it!

You're doing a great job. Two things to change before finishing:

  • Put all of the texts in a directory called texts in the root of the repo.
  • Move the scrape.py scripts into a directory called scrapers in the root of the repo. Rename each script to explain which text it gets.

Thanks so much. Once you're happy with this task, you can make a PR to add these new languages (https://github.com/cltk/cltk/wiki/How-to-add-a-corpus-to-the-CLTK).

@ghost
Copy link
Author

ghost commented Apr 21, 2016

Thank you sir for the feedback and I will modify accordingly.

@ghost
Copy link
Author

ghost commented Apr 21, 2016

Done changing directory structure, is this fine : /cltk/telugu_text_wikisource ?
I will update the readme and make a PR.

@kylepjohnson
Copy link
Member

This looks fantastic. I just updated the README.

I don't want the PR to wait for the index, so how about I make you a ticket just for adding an index?

Next step: I'd like you to make the PR (and add a new docs/telugu.rst and add telugu to docs/index.rst).

@ghost
Copy link
Author

ghost commented Apr 21, 2016

Yeah fine, I will make the PR now itself.
UPD: And make a seperate ticket for adding docs and adding indices.

@ghost
Copy link
Author

ghost commented Apr 21, 2016

Ping @kylepjohnson Done making PR: #263 You can look at it, and I will update readme, add docs very soon.
Thank you.

@kylepjohnson
Copy link
Member

See my comments why I rejected it #263.

For telugu.rst, you can add this:

Telugu
********

Corpora
=======

Use ``CorpusImporter()`` or browse the `CLTK GitHub repository <https://github.com/cltk>`_ (anything beginning with ``telugu_``) to discover available Telugu corpora.

.. code-block:: python

   In [1]: from cltk.corpus.utils.importer import CorpusImporter

   In [2]: c = CorpusImporter('telugu')

   In [3]: c.list_corpora
   Out[3]:
   ['telugu_text_wikisource']

@ghost
Copy link
Author

ghost commented Apr 21, 2016

Yeah I have seen it. Thanks a lot.
UPD: I have updated and made a new PR, can you check it.

@ghost
Copy link
Author

ghost commented Aug 23, 2016

As far as I can see, I have parsed the texts from https://te.wikisource.org/wiki , in case if I missed something or found a new text, I will update accordingly. So now coming to your second point, "including author names in Latin characters and BC/AD dates", can you elaborate more ?

@kylepjohnson
Copy link
Member

Thank you @achaitanyasai for seeing this through.

Let's DM on gitter about what else you want to help out on!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant