Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Languages supported? #27

Closed
kafechew opened this issue Apr 12, 2015 · 5 comments
Closed

Languages supported? #27

kafechew opened this issue Apr 12, 2015 · 5 comments

Comments

@kafechew
Copy link

What are the languages supported for the limdu classifications?
I assume all the language using a-z alphabet.

How's about the Hebrew, Chinese, Hindi, Korea... those are not a-z alphabet?

Thanks :-)

@erelsgl
Copy link
Owner

erelsgl commented Apr 13, 2015

Hi Kai,

limdu should work for any language. If you encounter any specific problem
in working with limdu in your language, please report it and we will check.

Erel

On Sun, Apr 12, 2015 at 1:19 PM, Kai Chew notifications@github.com wrote:

What are the languages supported for the limdu classifications?
I assume all the language using a-z alphabet.

How's about the Hebrew, Chinese, Hindi, Korea... those are not a-z
alphabet?

Thanks :-)


Reply to this email directly or view it on GitHub
#27.

@kafechew
Copy link
Author

Cool~ Noted with thanks~

@kafechew
Copy link
Author

I'm trying to analyse the Chinese.
For english, in which 1-gram of word will be simple and straight forward...
like "I am Max" will become "I", "am", "Max"

The problem with Chinese, different from english, it doesn't have "space".
我是马氏 (something like IamMax)
So, it will become "我是马氏" or ("IamMax"), instead of "我", "是", "马", "氏" ("I", "am", "Max")

My temporarily solution:
If the content is in Chinese or Japanese, it will use n-gram of letter,
[limdu.features.NGramsOfLetters(1), limdu.features.NGramsOfLetters(2)]
ps: minor english mixed major chinese will be an issue... Max = "M", "a", "x"

If English as usual (or Hebrew, Korean...), n-gram of word will do.
[limdu.features.NGramsOfWords(1), limdu.features.NGramsOfWords(2)]

Do you have any better solution with Limdu for this kind of issue (Tokenisation)?
Thanks in advanced!

@kafechew kafechew reopened this Apr 16, 2015
@erelsgl
Copy link
Owner

erelsgl commented Apr 18, 2015

Hi Kai,

Currently limdu contains only a small number of feature extractors, which
are used mainly as examples. There is a feature extractor that extracts
words: "limdu.features.NGramsOfWords", and one that extracts letters: "
limdu.features.NGramsOfLetters". You can try to use the second one and
see if it works.

On Thu, Apr 16, 2015 at 6:40 PM, Kai Chew notifications@github.com wrote:

I'm trying to analyse the Chinese.
For english, in which 1-gram of word will be simple and straight
forward...
like "I am Max" will become "I", "am", "Max"

The problem with Chinese, different from english, it doesn't have "space".
我是马氏 (something like IamMax)
So, it will become "我是马氏" or ("IamMax"), instead of "我", "是", "马", "氏"
("I", "am", "Max")

Do you have any solution with Limdu for this kind of issue (Tokenisation)?
Thanks in advanced!


Reply to this email directly or view it on GitHub
#27 (comment).

@kafechew
Copy link
Author

Hi erelsgl,

Yep. Referring to my previous comment, I'm using both at this moment.
Just looking for better recommendations :-)
Thanks~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants