New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraping Raw Classical Hindi Data #215
Comments
@Akirato Hindi is not one of the classical languages. Sanskrit and Pali are. |
The texts are from old Hindi(Mughal/15th century/Hindustani)period, Kabeerdas, Tulsidas, etc. |
This is the repository that I will be working in https://github.com/Akirato/cltk_classical_hindi_corpus |
@Akirato the script is not Devanagari for sure.. is it itrans? |
@mineshmathew Thanks for the help. I will try those sites too. And the text is in "Shusha" font. |
|
@kylepjohnson Sure, I would like to work on making a converter. |
Hi, this is looking good.
|
The "COPYING" license can be found in the folders if downloaded from the website http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html |
Thanks, that answers my question about licensing. About the transliterators, I'll open another ticket for you. |
@Akirato I'll wait for you to transfer ownership of this repo to |
HI @Akirato just checking where we stand on this issue. Thanks! |
@kylepjohnson I didn't find any open source shusha to unicode converters. So, I am using online interfaces and doing it manually. So, I would like a little more time. I have pushed the original data though. |
OK, thank you for the update! |
@kylepjohnson |
Cool!
|
@kylepjohnson |
Great work! Repo now at https://github.com/cltk/hindi_text_ltrc and belongs to the Hindi group (of which you're a member). |
I have extracted and cleaned all the data. The repository can be used for any tool now. |
So exciting! Final step is to add Hindi as a corpus download: https://github.com/cltk/cltk/tree/master/cltk/corpus. Add this and I'll merge the PR right away. |
Hi @kylepjohnson , |
I don't see the PR. You've done this before, right? |
Yes I updated in my repository. So PR #253 got updated with it. |
We are miscommunicating here. This ticket is for adding It is my practice to try for 1 PR per ticket (and never 1 PR for 2 tickets), |
Hi @kylepjohnson , |
@Akirato @kylepjohnson You may check out the parallel corpus here. It is one of the largest parallel corpus available for Hindi. And is tokenized. But not sure if the content is purely classical Hindi |
@mineshmathew Thanks for your help. But the newer version of it: HindEnCorp 0.5 is maintained by my lab LTRC, IIIT Hyderabad. You can read that in the description too. |
@kylepjohnson |
woo hoo! |
I am scraping Raw Classical Hindi Data from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html
@kylepjohnson
The text was updated successfully, but these errors were encountered: