Added Stemmer for Marathi #719

the-ethan-hunt · 2018-02-27T05:39:13Z

In context to #697 , a suffix stripping algorithm has been used for a stemmer.

Stemmer in stem/marathi/stem.py
Stemmer has been added to tests
Stemmer added in marathi.rst

The previous PR was closed due to a mistake of branches. 😅

the-ethan-hunt · 2018-02-27T15:00:34Z

@kylepjohnson , the Travis CI informs me of the issue as:
ModuleNotFoundError: No module named 'cltk.tokenize.indian_tokenizer'

Is there something I am missing? 😅

inishchith · 2018-02-27T15:11:41Z

@the-ethan-hunt IMO , you haven't synced your fork's master after this merge .

greenat92 · 2018-02-28T14:00:37Z

cltk/tests/test_stem.py

@@ -16,6 +16,8 @@
 from cltk.stem.akkadian.stem import Stemmer as AkkadianStemmer
 from cltk.stem.akkadian.syllabifier import Syllabifier as AkkadianSyllabifier
 from cltk.stem.french.stem import stem
+from cltk.stem.marathi.stem import stem
+


remove this blank line

greenat92 · 2018-02-28T14:01:10Z

cltk/tests/test_stem.py

+        target="मी वाच आहे"
+        self.assertEqual(stemmed_text,target)
+
+


this same for this and the following

greenat92 · 2018-02-28T14:02:27Z

cltk/stem/marathi/stem.py

+
+		return word
+
+


the same for this and the following

greenat92 · 2018-02-28T14:03:47Z

cltk/corpus/swadesh.py

 swadesh_syc=['ܐܢܐ‎','ܐܢܬ‎, ܐܢܬܝ‎', 'ܗܘ‎', 'ܚܢܢ‎,, ܐܢܚܢܢ‎', 'ܐܢܬܘܢ‎ , ܐܢܬܝܢ‎ ', 'ܗܢܘܢ‎ , ܗܢܝܢ‎', 'ܗܢܐ‎, ܗܕܐ‎', 'ܗܘ‎, ܗܝ‎', 'ܗܪܟܐ‎', 'ܬܡܢ‎', 'ܡܢ‎', 'ܡܐ‎, ܡܢ‎, ܡܢܐ‎, ܡܘܢ‎', 'ܐܝܟܐ‎', 'ܐܡܬܝ‎', 'ܐܝܟܢ‎,, ܐܝܟܢܐ‎', 'ܠܐ‎', 'ܟܠ‎', 'ܣܓܝ‎	', 'ܟܡܐ‎	', 'ܒܨܝܪܐ‎', 'ܐܚܪܢܐ‎, ܐܚܪܬܐ‎', 'ܚܕ‎ , ܚܕܐ‎', 'ܬܪܝܢ‎, ܬܪܬܝܢ‎', 'ܬܠܬܐ‎, ܬܠܬ‎', 'ܐܪܒܥܐ‎, ܐܪܒܥ‎', 'ܚܡܫܐ‎, ܚܡܫ‎', 'ܪܒܐ‎, ܟܒܝܪܐ‎	', 'ܐܪܝܟܐ‎', 'ܪܘܝܚܐ‎, ܦܬܝܐ‎', 'ܥܒܝܛܐ‎', 'ܢܛܝܠܐ‎, ܝܩܘܪܐ‎	', 'ܙܥܘܪܐ‎', 'ܟܪܝܐ‎', 'ܥܝܩܐ‎', 'ܪܩܝܩܐ‎, ܛܠܝܚܐ‎', 'ܐܢܬܬܐ‎', 'ܓܒܪܐ‎', 'ܐܢܫܐ‎', 'ܝܠܘܕܐ‎', 'ܐܢܬܬܐ‎', 'ܒܥܠܐ‎', 'ܐܡܐ‎', 'ܐܒܐ‎', 'ܚܝܘܬܐ‎', 'ܢܘܢܐ‎', 'ܛܝܪܐ‎, ܨܦܪܐ‎', 'ܟܠܒܐ‎', 'ܩܠܡܐ‎', 'ܚܘܝܐ‎', 'ܬܘܠܥܐ‎', 'ܐܝܠܢܐ‎', 'ܥܒܐ‎', 'ܩܝܣܐ‎', 'ܦܐܪܐ‎', 'ܙܪܥܐ‎', 'ܛܪܦܐ‎', 'ܫܪܫܐ‎	', 'ܩܠܦܬܐ‎', 'ܗܒܒܐ‎', 'ܓܠܐ‎', 'ܚܒܠܐ‎', 'ܓܠܕܐ‎	', 'ܒܣܪܐ‎', 'ܕܡܐ‎', 'ܓܪܡܐ‎', 'ܕܗܢܐ‎, ܫܘܡܢܐ‎', 'ܒܝܥܬܐ‎', 'ܩܪܢܐ‎', 'ܕܘܢܒܐ‎', 'ܐܒܪܐ‎', 'ܣܥܪܐ‎', 'ܪܝܫܐ‎', 'ܐܕܢܐ‎', 'ܥܝܢܐ‎', 'ܢܚܝܪܐ‎	', 'ܦܘܡܐ‎', 'ܫܢܐ‎, ܟܟܐ‎', 'ܠܫܢܐ‎', 'ܛܦܪܐ‎	', 'ܥܩܠܐ‎', 'ܪܓܠܐ‎', 'ܒܘܪܟܐ‎', 'ܐܝܕܐ‎', 'ܟܢܦܐ‎	', 'ܒܛܢܐ‎, ܟܪܣܐ‎	', 'ܡܥܝܐ‎, ܓܘܐ‎', 'ܨܘܪܐ‎, ܩܕܠܐ‎', 'ܚܨܐ‎, ܒܣܬܪܐ‎', 'ܚܕܝܐ‎', 'ܠܒܐ‎', 'ܟܒܕܐ‎', 'ܫܬܐ‎', 'ܐܟܠ‎', 'ܢܟܬ‎', 'ܡܨ‎	', 'ܪܩ‎', 'ܓܥܛ‎', 'ܢܦܚ‎', 'ܢܦܫ‎, ܢܫܡ‎', 'ܓܚܟ‎	', 'ܚܙܐ‎', 'ܫܡܥ‎', 'ܝܕܥ‎', 'ܚܫܒ‎', 'ܡܚ‎, ܣܩ‎', 'ܕܚܠ‎, ܟܘܪ‎', 'ܕܡܟ‎', 'ܚܝܐ‎	', 'ܡܝܬ‎', 'ܩܛܠ‎', 'ܟܬܫ‎', 'ܨܝܕ‎	', 'ܡܚܐ‎, ܢܩܫ‎', 'ܓܕܡ‎, ܩܛܥ‎', 'ܫܪܩ‎, ܦܕܥ‎, ܦܪܬ‎', 'ܕܓܫ‎', 'ܚܟ‎, ܣܪܛ‎', 'ܚܦܪ‎', 'ܣܚܐ‎', 'ܦܪܚ‎	', 'ܗܠܟ‎	', 'ܐܬܐ‎	', 'ܫܟܒ‎, ܡܟ‎', 'ܝܬܒ‎', 'ܬܪܨ‎', 'ܦܢܐ‎, ܥܛܦ‎	', 'ܢܦܠ‎', 'ܝܗܒ‎, ܢܬܠ‎', 'ܐܚܕ‎', 'ܩܡܛ‎, ܥܨܪ‎', 'ܫܦ‎, ܚܟ‎', 'ܚܠܠ‎, ܦܝܥ‎', 'ܟܦܪ‎', 'ܓܪܫ‎', 'ܙܥܦ‎	', 'ܪܡܐ‎', 'ܐܣܪ‎, ܩܛܪ‎', 'ܚܝܛ‎', 'ܡܢܐ‎', 'ܐܡܪ‎', 'ܙܡܪ‎', 'ܫܥܐ‎', 'ܛܦ‎', 'ܪܣܡ‎, ܫܚܠ‎', 'ܓܠܕ‎, ܩܪܫ‎', 'ܙܘܐ‎, ܥܒܐ‎', 'ܫܡܫܐ‎', 'ܣܗܪܐ‎', 'ܟܘܟܒܐ‎', 'ܡܝܐ‎	', 'ܡܛܪܐ‎', 'ܢܗܪܐ‎', 'ܝܡܬܐ‎', 'ܝܡܐ‎', 'ܡܠܚܐ‎	', 'ܟܐܦܐ‎, ܐܒܢܐ‎, ܫܘܥܐ‎', 'ܚܠܐ‎', 'ܐܒܩܐ‎, ܕܩܬܐ‎', 'ܐܪܥܐ‎', 'ܥܢܢܐ‎, ܥܝܡܐ‎, ܥܝܒܐ‎', 'ܥܪܦܠܐ‎	', 'ܫܡܝܐ‎', 'ܪܘܚܐ‎	', 'ܬܠܓܐ‎', 'ܓܠܝܕܐ‎', 'ܬܢܢܐ‎	', 'ܢܘܪܐ‎, ܐܫܬܐ‎', 'ܩܛܡܐ‎	', 'ܝܩܕ‎', 'ܐܘܪܚܐ‎', 'ܛܘܪܐ‎', 'ܣܘܡܩܐ‎', 'ܝܘܪܩܐ‎', 'ܫܥܘܬܐ‎', 'ܚܘܪܐ‎', 'ܐܘܟܡܐ‎	', 'ܠܠܝܐ‎	', 'ܝܘܡܐ‎	', 'ܫܢܬܐ‎', 'ܫܚܝܢܐ‎', 'ܩܪܝܪܐ‎', 'ܡܠܝܐ‎', 'ܚܕܬܐ‎', 'ܥܬܝܩܐ‎', 'ܛܒܐ‎', 'ܒܝܫܐ‎', 'ܒܩܝܩܐ‎ ܚܪܝܒܐ‎', 'ܫܘܚܬܢܐ‎', 'ܬܪܝܨܐ‎	', 'ܚܘܕܪܢܝܐ‎', 'ܚܪܝܦܐ‎', 'ܩܗܝܐ‎', 'ܦܫܝܩܐ‎', 'ܪܛܝܒܐ‎, ܬܠܝܠܐ‎', 'ܝܒܝܫܐ‎', 'ܬܪܝܨܐ‎	', 'ܩܪܝܒܐ‎', 'ܪܚܝܩܐ‎', 'ܝܡܝܢܐ‎', 'ܣܡܠܐ‎', 'ܒ-‎, ܠܘܬ‎', 'ܥܡ‎', 'ܐܢ‎', '-ܡܛܠ ܕ‎, ܒܥܠܬ‎', 'ܫܡܐ‎']

+


the-ethan-hunt · 2018-03-03T12:11:06Z

@LBenzahia does this look good now? 😄

greenat92 · 2018-03-04T09:59:53Z

What about this travis error ERROR: Failure: ModuleNotFoundError (No module named 'cltk.tokenize.indian_tokenizer') ?, Check this out please, I'll test it locally ASAP 👍 .

the-ethan-hunt · 2018-03-04T15:55:22Z

@LBenzahia , I have correct the error. It was the change of name of a package of a PR pulled after this one. 😅
P.S. Can you also review PR #687 ? Thank you! 😄

the-ethan-hunt · 2018-03-10T07:38:06Z

@kylepjohnson , could you please have a look at this, #687 and #706 ? 😅

bhosalems · 2018-03-23T04:00:14Z

Hi @the-ethan-hunt , are still looking at this? What about if we could also add lemmatizer after this? Let me know your thoughts.

the-ethan-hunt · 2018-03-23T13:08:07Z

@maheshbhosale , the PR is yet to be reviewed by the maintainers. 😅 . We can work on the lemmatizer after that.

bhosalems · 2018-03-25T17:39:40Z

Cool, I will watch out when it gets merged.

kylepjohnson · 2019-04-02T18:44:53Z

@the-ethan-hunt I know this PR this is very old -- it never got merged, if I recall correctly, because of several merge conflicts.

For the stemmer, we need to know more about how it works and at least some idea of its accuracy. For example, is it based on a known algorithm for other Indian languages? If it is simply stripping off suffixes (verb_endings) in a linear order -- for every language I have studied, this method would fail very quickly.

Also, the OF swadesh shouldn't be included here. This would need to be reviewed in a separate PR.

the-ethan-hunt and others added 16 commits February 9, 2018 10:59

Added Old English swadesh list

3227cae

Merge branch 'master' into master

2c21277

Update swadesh.py

d27e8e7

Added Old French swadesh list

b1e582d

Added Swadesh to docs

0812393

Merge branch 'master' into master

3ff4c32

Update french.rst

26d99f0

Added Stem for Marathi

255e51a

Add files via upload

1fd2eae

Added test for Marathi stem

6e41256

Added verb_endings

6bc5533

Added stemmer to docs

11066d6

Merge branch 'master' into second-pr-for-marathi

377c8b2

Corrected indentation error

bdbd940

Corrected indentation error

93f1cd1

Rewrote file to avoid CI fail

2b83185

greenat92 requested changes Feb 28, 2018

View reviewed changes

the-ethan-hunt added 2 commits February 28, 2018 23:02

Removed blanks

12c4f15

Removed blanks

2fb7aac

Update stem.py

f5de45e

Merge branch 'master' into second-pr-for-marathi

3cc0b00

Replaced Indian tokenization module name

3da672d

the-ethan-hunt mentioned this pull request Mar 16, 2018

Stemmer for Marathi #697

Closed

Merge branch 'master' into second-pr-for-marathi

5fd4b04

kylepjohnson closed this Apr 2, 2019

kylepjohnson mentioned this pull request Apr 2, 2019

Closing miscellaneous old PRs #892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Stemmer for Marathi #719

Added Stemmer for Marathi #719

the-ethan-hunt commented Feb 27, 2018

the-ethan-hunt commented Feb 27, 2018

inishchith commented Feb 27, 2018

greenat92 Feb 28, 2018

greenat92 Feb 28, 2018

greenat92 Feb 28, 2018

greenat92 Feb 28, 2018

the-ethan-hunt commented Mar 3, 2018

greenat92 commented Mar 4, 2018 •

edited

the-ethan-hunt commented Mar 4, 2018

the-ethan-hunt commented Mar 10, 2018 •

edited

bhosalems commented Mar 23, 2018

the-ethan-hunt commented Mar 23, 2018

bhosalems commented Mar 25, 2018

kylepjohnson commented Apr 2, 2019

		target="मी वाच आहे"
		self.assertEqual(stemmed_text,target)

		swadesh_syc=['ܐܢܐ‎','ܐܢܬ‎, ܐܢܬܝ‎', 'ܗܘ‎', 'ܚܢܢ‎,, ܐܢܚܢܢ‎', 'ܐܢܬܘܢ‎ , ܐܢܬܝܢ‎ ', 'ܗܢܘܢ‎ , ܗܢܝܢ‎', 'ܗܢܐ‎, ܗܕܐ‎', 'ܗܘ‎, ܗܝ‎', 'ܗܪܟܐ‎', 'ܬܡܢ‎', 'ܡܢ‎', 'ܡܐ‎, ܡܢ‎, ܡܢܐ‎, ܡܘܢ‎', 'ܐܝܟܐ‎', 'ܐܡܬܝ‎', 'ܐܝܟܢ‎,, ܐܝܟܢܐ‎', 'ܠܐ‎', 'ܟܠ‎', 'ܣܓܝ‎ ', 'ܟܡܐ‎ ', 'ܒܨܝܪܐ‎', 'ܐܚܪܢܐ‎, ܐܚܪܬܐ‎', 'ܚܕ‎ , ܚܕܐ‎', 'ܬܪܝܢ‎, ܬܪܬܝܢ‎', 'ܬܠܬܐ‎, ܬܠܬ‎', 'ܐܪܒܥܐ‎, ܐܪܒܥ‎', 'ܚܡܫܐ‎, ܚܡܫ‎', 'ܪܒܐ‎, ܟܒܝܪܐ‎ ', 'ܐܪܝܟܐ‎', 'ܪܘܝܚܐ‎, ܦܬܝܐ‎', 'ܥܒܝܛܐ‎', 'ܢܛܝܠܐ‎, ܝܩܘܪܐ‎ ', 'ܙܥܘܪܐ‎', 'ܟܪܝܐ‎', 'ܥܝܩܐ‎', 'ܪܩܝܩܐ‎, ܛܠܝܚܐ‎', 'ܐܢܬܬܐ‎', 'ܓܒܪܐ‎', 'ܐܢܫܐ‎', 'ܝܠܘܕܐ‎', 'ܐܢܬܬܐ‎', 'ܒܥܠܐ‎', 'ܐܡܐ‎', 'ܐܒܐ‎', 'ܚܝܘܬܐ‎', 'ܢܘܢܐ‎', 'ܛܝܪܐ‎, ܨܦܪܐ‎', 'ܟܠܒܐ‎', 'ܩܠܡܐ‎', 'ܚܘܝܐ‎', 'ܬܘܠܥܐ‎', 'ܐܝܠܢܐ‎', 'ܥܒܐ‎', 'ܩܝܣܐ‎', 'ܦܐܪܐ‎', 'ܙܪܥܐ‎', 'ܛܪܦܐ‎', 'ܫܪܫܐ‎ ', 'ܩܠܦܬܐ‎', 'ܗܒܒܐ‎', 'ܓܠܐ‎', 'ܚܒܠܐ‎', 'ܓܠܕܐ‎ ', 'ܒܣܪܐ‎', 'ܕܡܐ‎', 'ܓܪܡܐ‎', 'ܕܗܢܐ‎, ܫܘܡܢܐ‎', 'ܒܝܥܬܐ‎', 'ܩܪܢܐ‎', 'ܕܘܢܒܐ‎', 'ܐܒܪܐ‎', 'ܣܥܪܐ‎', 'ܪܝܫܐ‎', 'ܐܕܢܐ‎', 'ܥܝܢܐ‎', 'ܢܚܝܪܐ‎ ', 'ܦܘܡܐ‎', 'ܫܢܐ‎, ܟܟܐ‎', 'ܠܫܢܐ‎', 'ܛܦܪܐ‎ ', 'ܥܩܠܐ‎', 'ܪܓܠܐ‎', 'ܒܘܪܟܐ‎', 'ܐܝܕܐ‎', 'ܟܢܦܐ‎ ', 'ܒܛܢܐ‎, ܟܪܣܐ‎ ', 'ܡܥܝܐ‎, ܓܘܐ‎', 'ܨܘܪܐ‎, ܩܕܠܐ‎', 'ܚܨܐ‎, ܒܣܬܪܐ‎', 'ܚܕܝܐ‎', 'ܠܒܐ‎', 'ܟܒܕܐ‎', 'ܫܬܐ‎', 'ܐܟܠ‎', 'ܢܟܬ‎', 'ܡܨ‎ ', 'ܪܩ‎', 'ܓܥܛ‎', 'ܢܦܚ‎', 'ܢܦܫ‎, ܢܫܡ‎', 'ܓܚܟ‎ ', 'ܚܙܐ‎', 'ܫܡܥ‎', 'ܝܕܥ‎', 'ܚܫܒ‎', 'ܡܚ‎, ܣܩ‎', 'ܕܚܠ‎, ܟܘܪ‎', 'ܕܡܟ‎', 'ܚܝܐ‎ ', 'ܡܝܬ‎', 'ܩܛܠ‎', 'ܟܬܫ‎', 'ܨܝܕ‎ ', 'ܡܚܐ‎, ܢܩܫ‎', 'ܓܕܡ‎, ܩܛܥ‎', 'ܫܪܩ‎, ܦܕܥ‎, ܦܪܬ‎', 'ܕܓܫ‎', 'ܚܟ‎, ܣܪܛ‎', 'ܚܦܪ‎', 'ܣܚܐ‎', 'ܦܪܚ‎ ', 'ܗܠܟ‎ ', 'ܐܬܐ‎ ', 'ܫܟܒ‎, ܡܟ‎', 'ܝܬܒ‎', 'ܬܪܨ‎', 'ܦܢܐ‎, ܥܛܦ‎ ', 'ܢܦܠ‎', 'ܝܗܒ‎, ܢܬܠ‎', 'ܐܚܕ‎', 'ܩܡܛ‎, ܥܨܪ‎', 'ܫܦ‎, ܚܟ‎', 'ܚܠܠ‎, ܦܝܥ‎', 'ܟܦܪ‎', 'ܓܪܫ‎', 'ܙܥܦ‎ ', 'ܪܡܐ‎', 'ܐܣܪ‎, ܩܛܪ‎', 'ܚܝܛ‎', 'ܡܢܐ‎', 'ܐܡܪ‎', 'ܙܡܪ‎', 'ܫܥܐ‎', 'ܛܦ‎', 'ܪܣܡ‎, ܫܚܠ‎', 'ܓܠܕ‎, ܩܪܫ‎', 'ܙܘܐ‎, ܥܒܐ‎', 'ܫܡܫܐ‎', 'ܣܗܪܐ‎', 'ܟܘܟܒܐ‎', 'ܡܝܐ‎ ', 'ܡܛܪܐ‎', 'ܢܗܪܐ‎', 'ܝܡܬܐ‎', 'ܝܡܐ‎', 'ܡܠܚܐ‎ ', 'ܟܐܦܐ‎, ܐܒܢܐ‎, ܫܘܥܐ‎', 'ܚܠܐ‎', 'ܐܒܩܐ‎, ܕܩܬܐ‎', 'ܐܪܥܐ‎', 'ܥܢܢܐ‎, ܥܝܡܐ‎, ܥܝܒܐ‎', 'ܥܪܦܠܐ‎ ', 'ܫܡܝܐ‎', 'ܪܘܚܐ‎ ', 'ܬܠܓܐ‎', 'ܓܠܝܕܐ‎', 'ܬܢܢܐ‎ ', 'ܢܘܪܐ‎, ܐܫܬܐ‎', 'ܩܛܡܐ‎ ', 'ܝܩܕ‎', 'ܐܘܪܚܐ‎', 'ܛܘܪܐ‎', 'ܣܘܡܩܐ‎', 'ܝܘܪܩܐ‎', 'ܫܥܘܬܐ‎', 'ܚܘܪܐ‎', 'ܐܘܟܡܐ‎ ', 'ܠܠܝܐ‎ ', 'ܝܘܡܐ‎ ', 'ܫܢܬܐ‎', 'ܫܚܝܢܐ‎', 'ܩܪܝܪܐ‎', 'ܡܠܝܐ‎', 'ܚܕܬܐ‎', 'ܥܬܝܩܐ‎', 'ܛܒܐ‎', 'ܒܝܫܐ‎', 'ܒܩܝܩܐ‎ ܚܪܝܒܐ‎', 'ܫܘܚܬܢܐ‎', 'ܬܪܝܨܐ‎ ', 'ܚܘܕܪܢܝܐ‎', 'ܚܪܝܦܐ‎', 'ܩܗܝܐ‎', 'ܦܫܝܩܐ‎', 'ܪܛܝܒܐ‎, ܬܠܝܠܐ‎', 'ܝܒܝܫܐ‎', 'ܬܪܝܨܐ‎ ', 'ܩܪܝܒܐ‎', 'ܪܚܝܩܐ‎', 'ܝܡܝܢܐ‎', 'ܣܡܠܐ‎', 'ܒ-‎, ܠܘܬ‎', 'ܥܡ‎', 'ܐܢ‎', '-ܡܛܠ ܕ‎, ܒܥܠܬ‎', 'ܫܡܐ‎']

Added Stemmer for Marathi #719

Added Stemmer for Marathi #719

Conversation

the-ethan-hunt commented Feb 27, 2018

the-ethan-hunt commented Feb 27, 2018

inishchith commented Feb 27, 2018

greenat92 Feb 28, 2018

Choose a reason for hiding this comment

greenat92 Feb 28, 2018

Choose a reason for hiding this comment

greenat92 Feb 28, 2018

Choose a reason for hiding this comment

greenat92 Feb 28, 2018

Choose a reason for hiding this comment

the-ethan-hunt commented Mar 3, 2018

greenat92 commented Mar 4, 2018 • edited

the-ethan-hunt commented Mar 4, 2018

the-ethan-hunt commented Mar 10, 2018 • edited

bhosalems commented Mar 23, 2018

the-ethan-hunt commented Mar 23, 2018

bhosalems commented Mar 25, 2018

kylepjohnson commented Apr 2, 2019

greenat92 commented Mar 4, 2018 •

edited

the-ethan-hunt commented Mar 10, 2018 •

edited