Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping Raw Classical Hindi Data #215

Closed
Akirato opened this issue Mar 28, 2016 · 29 comments
Closed

Scraping Raw Classical Hindi Data #215

Akirato opened this issue Mar 28, 2016 · 29 comments
Assignees

Comments

@Akirato
Copy link
Member

Akirato commented Mar 28, 2016

I am scraping Raw Classical Hindi Data from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html
@kylepjohnson

@mineshmathew
Copy link

@Akirato Hindi is not one of the classical languages. Sanskrit and Pali are.
And the texts in the link , are they in Unicode. I am not able to read anything

@Akirato
Copy link
Member Author

Akirato commented Mar 28, 2016

The texts are from old Hindi(Mughal/15th century/Hindustani)period, Kabeerdas, Tulsidas, etc.
Its in Shusha font. Font is the main cleaning part.
Is this period OK?

@Akirato
Copy link
Member Author

Akirato commented Mar 28, 2016

This is the repository that I will be working in https://github.com/Akirato/cltk_classical_hindi_corpus

@mineshmathew
Copy link

@Akirato the script is not Devanagari for sure.. is it itrans?
Please check wiki library too . and sites like these http://www.hindisahitya.org/hindipoet/surdas-poems-in-hindi, where they have Hindustani poems in Devanagari script itself

@Akirato
Copy link
Member Author

Akirato commented Mar 28, 2016

@mineshmathew Thanks for the help. I will try those sites too. And the text is in "Shusha" font.

@kylepjohnson
Copy link
Member

  • About Hindi as a Classical language: I do know that there are very old texts in what's generally considered Hindi/Hindustani: https://en.wikipedia.org/wiki/History_of_Hindustani#Timeline. For end-date, I am not the best to say, however it seems that the beginning of the Colonial period/first printed book (according to the Wikipedia link, the year 1796) is reasonable.
  • I have added you both to a new Hindi team.
  • @Akirato What website to the texts in https://github.com/Akirato/cltk_classical_hindi_corpus come from? We always need to cite this and the texts' license in the README.md
  • Is Shusha a font or its own alphabet? Ideally our texts will be in Unicoode, however I think we should always give the texts as they are found in the original corpus, too.
  • About ITRANS: Would either of you be interested in making a converter for this? I bet this work would be pretty easy.

@Akirato
Copy link
Member Author

Akirato commented Mar 30, 2016

@kylepjohnson
The corpus has text of Kabir's poems(1398-1518) and a lot of other such old texts by Meera, Suradaas, Tulasidaas.
I updated the Readme with source and the license in the file "COPYING" in the repository.
"Shusha" is a font, used because it takes two bytes where as Unicode takes 2 bytes to represent.
I will push the original corpus too.

Sure, I would like to work on making a converter.

@kylepjohnson
Copy link
Member

Hi, this is looking good.

  • IMPORTANT: I see you have added GPL v.1 to the LICENSE. However I do not see this lincense on the original http://ltrc.iiit.ac.in website. We cannot just add whatever license we want to their data! :) Would you please send any licensing explained on their website? If there isn't anything, I would like your help writing to the maintainers for permission to re-host for non-commercial research use.
  • For ITRANS, I'd recommend starting with the Wikipedia page. I don't know if the Harvard-Kyoto system is much used, but it could be done at the same time.

@Akirato
Copy link
Member Author

Akirato commented Apr 1, 2016

The "COPYING" license can be found in the folders if downloaded from the website http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html
That is the one I added.
I have finished the basic code. But, Wikipedia mapping is not enough. So I have to extend it.
https://github.com/Akirato/itransConverter
I will do the Harvard-Kyoto System after this.

@kylepjohnson
Copy link
Member

Thanks, that answers my question about licensing.

About the transliterators, I'll open another ticket for you.

@kylepjohnson
Copy link
Member

@Akirato I'll wait for you to transfer ownership of this repo to cltk org and to the Sanskrit user group, and for the PR to add the corpus to the core software.

@kylepjohnson
Copy link
Member

HI @Akirato just checking where we stand on this issue. Thanks!

@Akirato
Copy link
Member Author

Akirato commented Apr 8, 2016

@kylepjohnson I didn't find any open source shusha to unicode converters. So, I am using online interfaces and doing it manually. So, I would like a little more time. I have pushed the original data though.

@kylepjohnson
Copy link
Member

OK, thank you for the update!

@Akirato
Copy link
Member Author

Akirato commented Apr 16, 2016

@kylepjohnson
I have converted text to unicode. Some of it is still left to be converted. You can find them here: https://github.com/Akirato/cltk_classical_hindi_corpus
Which repository should I put it in cltk ?

@kylepjohnson
Copy link
Member

Cool!

  • Rename the repo to hindi_text_ltrc
  • Transfer ownership to the cltk organization.
  • Add and # About and # License section to the README.md. Also say a little more about the LTRC organization under About.
  • Add the files cltk/corpora/hindi/__init__.py and cltk/corpora/hindi/corpora.py. Follow the pattern of the other languages for what to put in corpora.py.

@Akirato
Copy link
Member Author

Akirato commented Apr 17, 2016

@kylepjohnson
Transferred ownership to cltk.
It has parse.py. I will remove it once all of them are done.
That will take like 2-3 more days. 😀

@kylepjohnson
Copy link
Member

Great work! Repo now at https://github.com/cltk/hindi_text_ltrc and belongs to the Hindi group (of which you're a member).

@Akirato
Copy link
Member Author

Akirato commented Apr 19, 2016

I have extracted and cleaned all the data. The repository can be used for any tool now.
This should be a good start for classical Hindi 🎉

@kylepjohnson
Copy link
Member

So exciting!

Final step is to add Hindi as a corpus download: https://github.com/cltk/cltk/tree/master/cltk/corpus.

Add this and I'll merge the PR right away.

@Akirato
Copy link
Member Author

Akirato commented Apr 19, 2016

Hi @kylepjohnson ,
Done 😄

@kylepjohnson
Copy link
Member

I don't see the PR. You've done this before, right?

@Akirato
Copy link
Member Author

Akirato commented Apr 19, 2016

Yes I updated in my repository. So PR #253 got updated with it.

@kylepjohnson
Copy link
Member

We are miscommunicating here. This ticket is for adding cltk/corpus/hindi/corpora.py.

It is my practice to try for 1 PR per ticket (and never 1 PR for 2 tickets),

@Akirato
Copy link
Member Author

Akirato commented Apr 19, 2016

Hi @kylepjohnson ,
Sorry for all that confusion. The PR for hindi corpora is #255 .

@mineshmathew
Copy link

@Akirato @kylepjohnson You may check out the parallel corpus here. It is one of the largest parallel corpus available for Hindi. And is tokenized. But not sure if the content is purely classical Hindi
https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-BD17-1

@Akirato
Copy link
Member Author

Akirato commented Apr 20, 2016

@mineshmathew Thanks for your help. But the newer version of it: HindEnCorp 0.5 is maintained by my lab LTRC, IIIT Hyderabad. You can read that in the description too.
We use it for Machine Translation in Hindi and I don't think its a corpus for Classical Hindi.

@Akirato
Copy link
Member Author

Akirato commented Apr 20, 2016

@kylepjohnson
PR #256 should take care of it.

@kylepjohnson
Copy link
Member

woo hoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants