New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add corpus for classical telugu #220
Comments
Hi, wonderful task! Once you've scraped this, we'll need your help to create an index with include author names in Latin characters and BC/AD dates. For transliterating author and text names, consider making a little program for Telugu. |
Thank you sir, I will get back once I have done extracting the corpus. |
Hello sir, I have scraped pothana telugu bhagavatham : I will let you know about the other files in 2-3 days. |
This is terrific! Before you're done, remember to add a README explain where you got the texts and a LICENSE file, too. Also put in here the scripts by which you scraped the site. Once this is done, it'd be great to offer and index file giving dates for each of these authors. |
Thank you for the feedback and yes I will add the README, LICENSE and the scrapper files. |
And right now I am putting all the documents in /classical-telugu repository. Do I need to place each document in seperate repository. |
Good question. Generally, we use 1 repo for each language for each website. "Pothana Bhagavatam" is a translation by Pothana, right? And wil there be works by other authors? If so, in the root of the repo you can put Am I understanding the authors and texts right? |
"Pothana Bhagavatam" is a translation by Pothana, right? Yes you are right. And will there be works by other authors? Yes, there are other authors as well. The work "Pothana Bhagavatam" would then go within the Pothana directory. Yeah I got the idea. Am I understanding the authors and texts right? Yes you are perfect :) |
Ping @kylepjohnson : Added 4 more corpora : https://github.com/achaitanyasai/classical-telugu I will add the README, LICENSE, scrapper files once I am done extracting the documents. |
Hello sir, I have a small doubt : some documents are just the sanskrit documents which are in telugu script. For example valmiki ramayana. |
To be clear: These are transliterations, not translations? If these are Sanskrit texts, just in Telugu characters, then no they do not belong in your Telugu repo. Is there anything else special or unique about these texts? |
Hello sir, firstly thanks for the clarification and yeah I meant transliterations. Is there anything else special or unique about these texts? No, nothing special/unique. |
Ok, let's skip them, then. Thank you for being so careful. |
Yeah I will skip them and scrape other documents.
|
Hello sir, I need few more days to complete this task, I will get back to you once I finish collecting the documents. |
Thanks for the update. Good work. |
Ping @kylepjohnson I will add the remaining documents in a day or two. Thank you. |
Good work. Let's transfer it now, since it has a good open source license. I will make you leader of the new Telugu group. Before doing so, rename it to How many more texts Classical texts remain? Also, because I know nothing about Telugu literature, I will need some help from you documenting the authors and what years these texts were written. One idea: link an author's English-language Wikipedia page in the Readme. How does that sound? Thank you, very exciting! :) |
Thank you sir for the information and I will transfer now itself. How many more texts Classical texts remain? There are about 20 - 25 texts. But they won't take much time to scrape. I will do it today itself. And for the readme, I will document the authors and the dates in a day. One idea: link an author's English-language Wikipedia page in the Readme. How does that sound? Yeah, it sounds good, I will follow this idea if I don't get a better one. Thank you. |
Done transferring to cltk : https://github.com/cltk/telugu_text_wikisource |
Hello sir, I have added all the remaining texts : https://github.com/cltk/telugu_text_wikisource |
Wow, such a beautiful script. I wish I could read it! You're doing a great job. Two things to change before finishing:
Thanks so much. Once you're happy with this task, you can make a PR to add these new languages (https://github.com/cltk/cltk/wiki/How-to-add-a-corpus-to-the-CLTK). |
Thank you sir for the feedback and I will modify accordingly. |
Done changing directory structure, is this fine : /cltk/telugu_text_wikisource ? |
This looks fantastic. I just updated the README. I don't want the PR to wait for the index, so how about I make you a ticket just for adding an index? Next step: I'd like you to make the PR (and add a new |
Yeah fine, I will make the PR now itself. |
Ping @kylepjohnson Done making PR: #263 You can look at it, and I will update readme, add docs very soon. |
See my comments why I rejected it #263. For
|
Yeah I have seen it. Thanks a lot. |
As far as I can see, I have parsed the texts from https://te.wikisource.org/wiki , in case if I missed something or found a new text, I will update accordingly. So now coming to your second point, "including author names in Latin characters and BC/AD dates", can you elaborate more ? |
Thank you @achaitanyasai for seeing this through. Let's DM on gitter about what else you want to help out on! |
https://te.wikisource.org/wiki contains the classical telugu ithihasas, puranas, vedas, stothras, etc;
So I would like to scrape them and add as a new corpus.
Thank you.
The text was updated successfully, but these errors were encountered: