Scraping Raw Classical Hindi Data #215

Akirato · 2016-03-28T14:54:21Z

I am scraping Raw Classical Hindi Data from http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html
@kylepjohnson

mineshmathew · 2016-03-28T15:14:32Z

@Akirato Hindi is not one of the classical languages. Sanskrit and Pali are.
And the texts in the link , are they in Unicode. I am not able to read anything

Akirato · 2016-03-28T15:34:34Z

The texts are from old Hindi(Mughal/15th century/Hindustani)period, Kabeerdas, Tulsidas, etc.
Its in Shusha font. Font is the main cleaning part.
Is this period OK?

Akirato · 2016-03-28T16:22:30Z

This is the repository that I will be working in https://github.com/Akirato/cltk_classical_hindi_corpus

mineshmathew · 2016-03-28T16:31:35Z

@Akirato the script is not Devanagari for sure.. is it itrans?
Please check wiki library too . and sites like these http://www.hindisahitya.org/hindipoet/surdas-poems-in-hindi, where they have Hindustani poems in Devanagari script itself

Akirato · 2016-03-28T16:43:38Z

@mineshmathew Thanks for the help. I will try those sites too. And the text is in "Shusha" font.

kylepjohnson · 2016-03-30T17:43:31Z

About Hindi as a Classical language: I do know that there are very old texts in what's generally considered Hindi/Hindustani: https://en.wikipedia.org/wiki/History_of_Hindustani#Timeline. For end-date, I am not the best to say, however it seems that the beginning of the Colonial period/first printed book (according to the Wikipedia link, the year 1796) is reasonable.
I have added you both to a new Hindi team.
@Akirato What website to the texts in https://github.com/Akirato/cltk_classical_hindi_corpus come from? We always need to cite this and the texts' license in the README.md
Is Shusha a font or its own alphabet? Ideally our texts will be in Unicoode, however I think we should always give the texts as they are found in the original corpus, too.
About ITRANS: Would either of you be interested in making a converter for this? I bet this work would be pretty easy.

Akirato · 2016-03-30T19:10:24Z

@kylepjohnson
The corpus has text of Kabir's poems(1398-1518) and a lot of other such old texts by Meera, Suradaas, Tulasidaas.
I updated the Readme with source and the license in the file "COPYING" in the repository.
"Shusha" is a font, used because it takes two bytes where as Unicode takes 2 bytes to represent.
I will push the original corpus too.

Sure, I would like to work on making a converter.

kylepjohnson · 2016-04-01T01:10:52Z

Hi, this is looking good.

IMPORTANT: I see you have added GPL v.1 to the LICENSE. However I do not see this lincense on the original http://ltrc.iiit.ac.in website. We cannot just add whatever license we want to their data! :) Would you please send any licensing explained on their website? If there isn't anything, I would like your help writing to the maintainers for permission to re-host for non-commercial research use.
For ITRANS, I'd recommend starting with the Wikipedia page. I don't know if the Harvard-Kyoto system is much used, but it could be done at the same time.

Akirato · 2016-04-01T08:31:50Z

The "COPYING" license can be found in the folders if downloaded from the website http://ltrc.iiit.ac.in/showfile.php?filename=downloads/Classical_Hindi_Literature/SHUSHA/index.html
That is the one I added.
I have finished the basic code. But, Wikipedia mapping is not enough. So I have to extend it.
https://github.com/Akirato/itransConverter
I will do the Harvard-Kyoto System after this.

kylepjohnson · 2016-04-01T15:44:27Z

Thanks, that answers my question about licensing.

About the transliterators, I'll open another ticket for you.

kylepjohnson · 2016-04-01T15:45:25Z

@Akirato I'll wait for you to transfer ownership of this repo to cltk org and to the Sanskrit user group, and for the PR to add the corpus to the core software.

kylepjohnson · 2016-04-08T16:54:25Z

HI @Akirato just checking where we stand on this issue. Thanks!

Akirato · 2016-04-08T20:13:06Z

@kylepjohnson I didn't find any open source shusha to unicode converters. So, I am using online interfaces and doing it manually. So, I would like a little more time. I have pushed the original data though.

kylepjohnson · 2016-04-11T17:41:51Z

OK, thank you for the update!

Akirato · 2016-04-16T13:08:59Z

@kylepjohnson
I have converted text to unicode. Some of it is still left to be converted. You can find them here: https://github.com/Akirato/cltk_classical_hindi_corpus
Which repository should I put it in cltk ?

kylepjohnson · 2016-04-16T19:38:06Z

Cool!

Rename the repo to hindi_text_ltrc
Transfer ownership to the cltk organization.
Add and # About and # License section to the README.md. Also say a little more about the LTRC organization under About.
Add the files cltk/corpora/hindi/__init__.py and cltk/corpora/hindi/corpora.py. Follow the pattern of the other languages for what to put in corpora.py.

Akirato · 2016-04-17T14:53:11Z

@kylepjohnson
Transferred ownership to cltk.
It has parse.py. I will remove it once all of them are done.
That will take like 2-3 more days. 😀

kylepjohnson · 2016-04-19T18:25:34Z

Great work! Repo now at https://github.com/cltk/hindi_text_ltrc and belongs to the Hindi group (of which you're a member).

Akirato · 2016-04-19T19:21:56Z

I have extracted and cleaned all the data. The repository can be used for any tool now.
This should be a good start for classical Hindi 🎉

kylepjohnson · 2016-04-19T19:49:52Z

So exciting!

Final step is to add Hindi as a corpus download: https://github.com/cltk/cltk/tree/master/cltk/corpus.

Add this and I'll merge the PR right away.

Akirato · 2016-04-19T20:38:44Z

Hi @kylepjohnson ,
Done 😄

kylepjohnson · 2016-04-19T21:09:16Z

I don't see the PR. You've done this before, right?

Akirato · 2016-04-19T21:25:23Z

Yes I updated in my repository. So PR #253 got updated with it.

kylepjohnson · 2016-04-19T21:43:19Z

We are miscommunicating here. This ticket is for adding cltk/corpus/hindi/corpora.py.

It is my practice to try for 1 PR per ticket (and never 1 PR for 2 tickets),

Akirato · 2016-04-19T23:43:44Z

Hi @kylepjohnson ,
Sorry for all that confusion. The PR for hindi corpora is #255 .

mineshmathew · 2016-04-20T11:32:54Z

@Akirato @kylepjohnson You may check out the parallel corpus here. It is one of the largest parallel corpus available for Hindi. And is tokenized. But not sure if the content is purely classical Hindi
https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0001-BD17-1

Akirato · 2016-04-20T12:42:41Z

@mineshmathew Thanks for your help. But the newer version of it: HindEnCorp 0.5 is maintained by my lab LTRC, IIIT Hyderabad. You can read that in the description too.
We use it for Machine Translation in Hindi and I don't think its a corpus for Classical Hindi.

Akirato · 2016-04-20T15:36:40Z

@kylepjohnson
PR #256 should take care of it.

kylepjohnson · 2016-04-20T17:38:57Z

woo hoo!

kylepjohnson added the new corpus label Mar 30, 2016

kylepjohnson assigned Akirato Apr 1, 2016

kylepjohnson closed this as completed Apr 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraping Raw Classical Hindi Data #215

Scraping Raw Classical Hindi Data #215

Akirato commented Mar 28, 2016

mineshmathew commented Mar 28, 2016

Akirato commented Mar 28, 2016

Akirato commented Mar 28, 2016

mineshmathew commented Mar 28, 2016

Akirato commented Mar 28, 2016

kylepjohnson commented Mar 30, 2016

Akirato commented Mar 30, 2016

kylepjohnson commented Apr 1, 2016

Akirato commented Apr 1, 2016

kylepjohnson commented Apr 1, 2016

kylepjohnson commented Apr 1, 2016

kylepjohnson commented Apr 8, 2016

Akirato commented Apr 8, 2016

kylepjohnson commented Apr 11, 2016

Akirato commented Apr 16, 2016

kylepjohnson commented Apr 16, 2016

Akirato commented Apr 17, 2016

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016 •

edited

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016

mineshmathew commented Apr 20, 2016

Akirato commented Apr 20, 2016

Akirato commented Apr 20, 2016

kylepjohnson commented Apr 20, 2016

Scraping Raw Classical Hindi Data #215

Scraping Raw Classical Hindi Data #215

Comments

Akirato commented Mar 28, 2016

mineshmathew commented Mar 28, 2016

Akirato commented Mar 28, 2016

Akirato commented Mar 28, 2016

mineshmathew commented Mar 28, 2016

Akirato commented Mar 28, 2016

kylepjohnson commented Mar 30, 2016

Akirato commented Mar 30, 2016

kylepjohnson commented Apr 1, 2016

Akirato commented Apr 1, 2016

kylepjohnson commented Apr 1, 2016

kylepjohnson commented Apr 1, 2016

kylepjohnson commented Apr 8, 2016

Akirato commented Apr 8, 2016

kylepjohnson commented Apr 11, 2016

Akirato commented Apr 16, 2016

kylepjohnson commented Apr 16, 2016

Akirato commented Apr 17, 2016

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016 • edited

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016

kylepjohnson commented Apr 19, 2016

Akirato commented Apr 19, 2016

mineshmathew commented Apr 20, 2016

Akirato commented Apr 20, 2016

Akirato commented Apr 20, 2016

kylepjohnson commented Apr 20, 2016

Akirato commented Apr 19, 2016 •

edited