Languages supported? #27

kafechew · 2015-04-12T10:19:16Z

What are the languages supported for the limdu classifications?
I assume all the language using a-z alphabet.

How's about the Hebrew, Chinese, Hindi, Korea... those are not a-z alphabet?

Thanks :-)

erelsgl · 2015-04-13T05:26:36Z

Hi Kai,

limdu should work for any language. If you encounter any specific problem
in working with limdu in your language, please report it and we will check.

Erel

On Sun, Apr 12, 2015 at 1:19 PM, Kai Chew notifications@github.com wrote:

What are the languages supported for the limdu classifications?
I assume all the language using a-z alphabet.

How's about the Hebrew, Chinese, Hindi, Korea... those are not a-z
alphabet?

Thanks :-)

—
Reply to this email directly or view it on GitHub
#27.

kafechew · 2015-04-13T15:00:00Z

Cool~ Noted with thanks~

kafechew · 2015-04-16T15:40:56Z

I'm trying to analyse the Chinese.
For english, in which 1-gram of word will be simple and straight forward...
like "I am Max" will become "I", "am", "Max"

The problem with Chinese, different from english, it doesn't have "space".
我是马氏 (something like IamMax)
So, it will become "我是马氏" or ("IamMax"), instead of "我", "是", "马", "氏" ("I", "am", "Max")

My temporarily solution:
If the content is in Chinese or Japanese, it will use n-gram of letter,
[limdu.features.NGramsOfLetters(1), limdu.features.NGramsOfLetters(2)]
ps: minor english mixed major chinese will be an issue... Max = "M", "a", "x"

If English as usual (or Hebrew, Korean...), n-gram of word will do.
[limdu.features.NGramsOfWords(1), limdu.features.NGramsOfWords(2)]

Do you have any better solution with Limdu for this kind of issue (Tokenisation)?
Thanks in advanced!

erelsgl · 2015-04-18T18:06:22Z

Hi Kai,

Currently limdu contains only a small number of feature extractors, which
are used mainly as examples. There is a feature extractor that extracts
words: "limdu.features.NGramsOfWords", and one that extracts letters: "
limdu.features.NGramsOfLetters". You can try to use the second one and
see if it works.

On Thu, Apr 16, 2015 at 6:40 PM, Kai Chew notifications@github.com wrote:

I'm trying to analyse the Chinese.
For english, in which 1-gram of word will be simple and straight
forward...
like "I am Max" will become "I", "am", "Max"

The problem with Chinese, different from english, it doesn't have "space".
我是马氏 (something like IamMax)
So, it will become "我是马氏" or ("IamMax"), instead of "我", "是", "马", "氏"
("I", "am", "Max")

Do you have any solution with Limdu for this kind of issue (Tokenisation)?
Thanks in advanced!

—
Reply to this email directly or view it on GitHub
#27 (comment).

kafechew · 2015-04-19T11:35:50Z

Hi erelsgl,

Yep. Referring to my previous comment, I'm using both at this moment.
Just looking for better recommendations :-)
Thanks~

kafechew closed this as completed Apr 13, 2015

kafechew reopened this Apr 16, 2015

kafechew closed this as completed Apr 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Languages supported? #27

Languages supported? #27

kafechew commented Apr 12, 2015

erelsgl commented Apr 13, 2015

kafechew commented Apr 13, 2015

kafechew commented Apr 16, 2015

erelsgl commented Apr 18, 2015

kafechew commented Apr 19, 2015

Languages supported? #27

Languages supported? #27

Comments

kafechew commented Apr 12, 2015

erelsgl commented Apr 13, 2015

kafechew commented Apr 13, 2015

kafechew commented Apr 16, 2015

erelsgl commented Apr 18, 2015

kafechew commented Apr 19, 2015