Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Stemmer for Marathi #719

Closed

Conversation

the-ethan-hunt
Copy link
Contributor

In context to #697 , a suffix stripping algorithm has been used for a stemmer.

  • Stemmer in stem/marathi/stem.py
  • Stemmer has been added to tests
  • Stemmer added in marathi.rst

The previous PR was closed due to a mistake of branches. 😅

@the-ethan-hunt
Copy link
Contributor Author

@kylepjohnson , the Travis CI informs me of the issue as:
ModuleNotFoundError: No module named 'cltk.tokenize.indian_tokenizer'

Is there something I am missing? 😅

@inishchith
Copy link
Member

@the-ethan-hunt IMO , you haven't synced your fork's master after this merge .

@@ -16,6 +16,8 @@
from cltk.stem.akkadian.stem import Stemmer as AkkadianStemmer
from cltk.stem.akkadian.syllabifier import Syllabifier as AkkadianSyllabifier
from cltk.stem.french.stem import stem
from cltk.stem.marathi.stem import stem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this blank line

target="मी वाच आहे"
self.assertEqual(stemmed_text,target)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this same for this and the following


return word


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same for this and the following

swadesh_syc=['ܐܢܐ‎','ܐܢܬ‎, ܐܢܬܝ‎', 'ܗܘ‎', 'ܚܢܢ‎,, ܐܢܚܢܢ‎', 'ܐܢܬܘܢ‎ , ܐܢܬܝܢ‎ ', 'ܗܢܘܢ‎ , ܗܢܝܢ‎', 'ܗܢܐ‎, ܗܕܐ‎', 'ܗܘ‎, ܗܝ‎', 'ܗܪܟܐ‎', 'ܬܡܢ‎', 'ܡܢ‎', 'ܡܐ‎, ܡܢ‎, ܡܢܐ‎, ܡܘܢ‎', 'ܐܝܟܐ‎', 'ܐܡܬܝ‎', 'ܐܝܟܢ‎,, ܐܝܟܢܐ‎', 'ܠܐ‎', 'ܟܠ‎', 'ܣܓܝ‎ ', 'ܟܡܐ‎ ', 'ܒܨܝܪܐ‎', 'ܐܚܪܢܐ‎, ܐܚܪܬܐ‎', 'ܚܕ‎ , ܚܕܐ‎', 'ܬܪܝܢ‎, ܬܪܬܝܢ‎', 'ܬܠܬܐ‎, ܬܠܬ‎', 'ܐܪܒܥܐ‎, ܐܪܒܥ‎', 'ܚܡܫܐ‎, ܚܡܫ‎', 'ܪܒܐ‎, ܟܒܝܪܐ‎ ', 'ܐܪܝܟܐ‎', 'ܪܘܝܚܐ‎, ܦܬܝܐ‎', 'ܥܒܝܛܐ‎', 'ܢܛܝܠܐ‎, ܝܩܘܪܐ‎ ', 'ܙܥܘܪܐ‎', 'ܟܪܝܐ‎', 'ܥܝܩܐ‎', 'ܪܩܝܩܐ‎, ܛܠܝܚܐ‎', 'ܐܢܬܬܐ‎', 'ܓܒܪܐ‎', 'ܐܢܫܐ‎', 'ܝܠܘܕܐ‎', 'ܐܢܬܬܐ‎', 'ܒܥܠܐ‎', 'ܐܡܐ‎', 'ܐܒܐ‎', 'ܚܝܘܬܐ‎', 'ܢܘܢܐ‎', 'ܛܝܪܐ‎, ܨܦܪܐ‎', 'ܟܠܒܐ‎', 'ܩܠܡܐ‎', 'ܚܘܝܐ‎', 'ܬܘܠܥܐ‎', 'ܐܝܠܢܐ‎', 'ܥܒܐ‎', 'ܩܝܣܐ‎', 'ܦܐܪܐ‎', 'ܙܪܥܐ‎', 'ܛܪܦܐ‎', 'ܫܪܫܐ‎ ', 'ܩܠܦܬܐ‎', 'ܗܒܒܐ‎', 'ܓܠܐ‎', 'ܚܒܠܐ‎', 'ܓܠܕܐ‎ ', 'ܒܣܪܐ‎', 'ܕܡܐ‎', 'ܓܪܡܐ‎', 'ܕܗܢܐ‎, ܫܘܡܢܐ‎', 'ܒܝܥܬܐ‎', 'ܩܪܢܐ‎', 'ܕܘܢܒܐ‎', 'ܐܒܪܐ‎', 'ܣܥܪܐ‎', 'ܪܝܫܐ‎', 'ܐܕܢܐ‎', 'ܥܝܢܐ‎', 'ܢܚܝܪܐ‎ ', 'ܦܘܡܐ‎', 'ܫܢܐ‎, ܟܟܐ‎', 'ܠܫܢܐ‎', 'ܛܦܪܐ‎ ', 'ܥܩܠܐ‎', 'ܪܓܠܐ‎', 'ܒܘܪܟܐ‎', 'ܐܝܕܐ‎', 'ܟܢܦܐ‎ ', 'ܒܛܢܐ‎, ܟܪܣܐ‎ ', 'ܡܥܝܐ‎, ܓܘܐ‎', 'ܨܘܪܐ‎, ܩܕܠܐ‎', 'ܚܨܐ‎, ܒܣܬܪܐ‎', 'ܚܕܝܐ‎', 'ܠܒܐ‎', 'ܟܒܕܐ‎', 'ܫܬܐ‎', 'ܐܟܠ‎', 'ܢܟܬ‎', 'ܡܨ‎ ', 'ܪܩ‎', 'ܓܥܛ‎', 'ܢܦܚ‎', 'ܢܦܫ‎, ܢܫܡ‎', 'ܓܚܟ‎ ', 'ܚܙܐ‎', 'ܫܡܥ‎', 'ܝܕܥ‎', 'ܚܫܒ‎', 'ܡܚ‎, ܣܩ‎', 'ܕܚܠ‎, ܟܘܪ‎', 'ܕܡܟ‎', 'ܚܝܐ‎ ', 'ܡܝܬ‎', 'ܩܛܠ‎', 'ܟܬܫ‎', 'ܨܝܕ‎ ', 'ܡܚܐ‎, ܢܩܫ‎', 'ܓܕܡ‎, ܩܛܥ‎', 'ܫܪܩ‎, ܦܕܥ‎, ܦܪܬ‎', 'ܕܓܫ‎', 'ܚܟ‎, ܣܪܛ‎', 'ܚܦܪ‎', 'ܣܚܐ‎', 'ܦܪܚ‎ ', 'ܗܠܟ‎ ', 'ܐܬܐ‎ ', 'ܫܟܒ‎, ܡܟ‎', 'ܝܬܒ‎', 'ܬܪܨ‎', 'ܦܢܐ‎, ܥܛܦ‎ ', 'ܢܦܠ‎', 'ܝܗܒ‎, ܢܬܠ‎', 'ܐܚܕ‎', 'ܩܡܛ‎, ܥܨܪ‎', 'ܫܦ‎, ܚܟ‎', 'ܚܠܠ‎, ܦܝܥ‎', 'ܟܦܪ‎', 'ܓܪܫ‎', 'ܙܥܦ‎ ', 'ܪܡܐ‎', 'ܐܣܪ‎, ܩܛܪ‎', 'ܚܝܛ‎', 'ܡܢܐ‎', 'ܐܡܪ‎', 'ܙܡܪ‎', 'ܫܥܐ‎', 'ܛܦ‎', 'ܪܣܡ‎, ܫܚܠ‎', 'ܓܠܕ‎, ܩܪܫ‎', 'ܙܘܐ‎, ܥܒܐ‎', 'ܫܡܫܐ‎', 'ܣܗܪܐ‎', 'ܟܘܟܒܐ‎', 'ܡܝܐ‎ ', 'ܡܛܪܐ‎', 'ܢܗܪܐ‎', 'ܝܡܬܐ‎', 'ܝܡܐ‎', 'ܡܠܚܐ‎ ', 'ܟܐܦܐ‎, ܐܒܢܐ‎, ܫܘܥܐ‎', 'ܚܠܐ‎', 'ܐܒܩܐ‎, ܕܩܬܐ‎', 'ܐܪܥܐ‎', 'ܥܢܢܐ‎, ܥܝܡܐ‎, ܥܝܒܐ‎', 'ܥܪܦܠܐ‎ ', 'ܫܡܝܐ‎', 'ܪܘܚܐ‎ ', 'ܬܠܓܐ‎', 'ܓܠܝܕܐ‎', 'ܬܢܢܐ‎ ', 'ܢܘܪܐ‎, ܐܫܬܐ‎', 'ܩܛܡܐ‎ ', 'ܝܩܕ‎', 'ܐܘܪܚܐ‎', 'ܛܘܪܐ‎', 'ܣܘܡܩܐ‎', 'ܝܘܪܩܐ‎', 'ܫܥܘܬܐ‎', 'ܚܘܪܐ‎', 'ܐܘܟܡܐ‎ ', 'ܠܠܝܐ‎ ', 'ܝܘܡܐ‎ ', 'ܫܢܬܐ‎', 'ܫܚܝܢܐ‎', 'ܩܪܝܪܐ‎', 'ܡܠܝܐ‎', 'ܚܕܬܐ‎', 'ܥܬܝܩܐ‎', 'ܛܒܐ‎', 'ܒܝܫܐ‎', 'ܒܩܝܩܐ‎ ܚܪܝܒܐ‎', 'ܫܘܚܬܢܐ‎', 'ܬܪܝܨܐ‎ ', 'ܚܘܕܪܢܝܐ‎', 'ܚܪܝܦܐ‎', 'ܩܗܝܐ‎', 'ܦܫܝܩܐ‎', 'ܪܛܝܒܐ‎, ܬܠܝܠܐ‎', 'ܝܒܝܫܐ‎', 'ܬܪܝܨܐ‎ ', 'ܩܪܝܒܐ‎', 'ܪܚܝܩܐ‎', 'ܝܡܝܢܐ‎', 'ܣܡܠܐ‎', 'ܒ-‎, ܠܘܬ‎', 'ܥܡ‎', 'ܐܢ‎', '-ܡܛܠ ܕ‎, ܒܥܠܬ‎', 'ܫܡܐ‎']


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same

@the-ethan-hunt
Copy link
Contributor Author

@LBenzahia does this look good now? 😄

@greenat92
Copy link
Member

greenat92 commented Mar 4, 2018

What about this travis error ERROR: Failure: ModuleNotFoundError (No module named 'cltk.tokenize.indian_tokenizer') ?, Check this out please, I'll test it locally ASAP 👍 .

@the-ethan-hunt
Copy link
Contributor Author

@LBenzahia , I have correct the error. It was the change of name of a package of a PR pulled after this one. 😅
P.S. Can you also review PR #687 ? Thank you! 😄

@the-ethan-hunt
Copy link
Contributor Author

the-ethan-hunt commented Mar 10, 2018

@kylepjohnson , could you please have a look at this, #687 and #706 ? 😅

@bhosalems
Copy link
Member

Hi @the-ethan-hunt , are still looking at this? What about if we could also add lemmatizer after this? Let me know your thoughts.

@the-ethan-hunt
Copy link
Contributor Author

@maheshbhosale , the PR is yet to be reviewed by the maintainers. 😅 . We can work on the lemmatizer after that.

@bhosalems
Copy link
Member

Cool, I will watch out when it gets merged.

@kylepjohnson
Copy link
Member

@the-ethan-hunt I know this PR this is very old -- it never got merged, if I recall correctly, because of several merge conflicts.

For the stemmer, we need to know more about how it works and at least some idea of its accuracy. For example, is it based on a known algorithm for other Indian languages? If it is simply stripping off suffixes (verb_endings) in a linear order -- for every language I have studied, this method would fail very quickly.

Also, the OF swadesh shouldn't be included here. This would need to be reviewed in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants