any example for topic modeling using api (not a real issue) #135

macinux · 2014-04-06T10:34:50Z

Hi,

I am totally a beginner and try to learn topic modeling. The user guide on the website has an example for topic modeling with command line. I just wonder if you guys have a quick example for doing topic modeling with API code (using factories as a library to do topic modeling )

Thanks in advance.

James

jasonbaldridge · 2014-04-06T20:50:19Z

I set up a pretty straightforward example here for Mallet and FACTORIE:

https://github.com/jasonbaldridge/maul

It's a bit out of date, but it should either still work or be pretty easy to adapt. (I'd happily accept a pull request to update it to the latest version of FACTORIE.)

macinux · 2014-04-08T00:57:08Z

Thank you! it still works if we just add one more import. I can have a pull request but it works pretty well actually

macinux · 2014-04-08T07:39:59Z

Jason,

I was studying your example. It will not be working if you use chinese character in the document or pass chinese in the string. I guess it can not find the right tokens. Here is the code with chinese string as input

object WordSeqDomain extends CategoricalSeqDomain[String]
val model = DirectedModel()
val lda = new LDA(WordSeqDomain, opts.numTopics.value, opts.alpha.value,
opts.beta.value, opts.optimizeBurnIn.value)(model,random)
val mySegmenter = new cc.factorie.app.strings.RegexSegmenter(opts.tokenRegex.value.r)
val cdoc=Document.fromString(WordSeqDomain,null,"我喜欢我的电脑",segmenter=mySegmenter)
lda.addDocument(cdoc,random)

Do you have any advice on this? or which segmenter we should use? Is there any segmenter working with any languages?

Thank you very much again!
James

oskarsinger · 2014-04-08T13:35:41Z

Hi Jason and Macinux,

I just wrote a Chinese word segmenter that can be integrated into the document pipeline using the "process" method. I have been doing some feature engineering on it because it is just below state-of-the-art for closed-track (no outside knowledge, only info from the training corpus can be used). Its not quite done yet, but its very usable. I can make a pull request later today.

Unfortunately, you will have to train it yourselves because we don't yet have permission to post trained models with the corpora I used. However, training and serializing a model should take only two method calls:

segmenter.train([file path of training corpus]);
segmenter.serialize([file path of location to save the serialized model])

After that, populating a new instance with the pre-trained model is as easy as:

segmenter.deserialize([file path of serialized, trained model])

The training and testing corpora can be found at:

http://www.sighan.org/bakeoff2005/

If you are dealing with traditional characters, use the Academia Sinica corpus (from Taiwan) or the City University of Hong Kong corpus (from Hong Kong). If you are dealing with simplified characters, use the Peiking University or Microsoft Research Asia corpora (both from Beijing).

Read the licensing on the data!

macinux · 2014-04-08T17:24:46Z

Thank you oskarsinger!

For other languages, we have to do something similar to chinese? I wonder if there is a more generic way to handle most languages. Otherwise, It will be complicated to handle other languages.

James

oskarsinger · 2014-04-08T20:22:28Z

It depends on the language. The reason its necessary for Chinese is because Chinese sentences don't have their words separated by spaces, and, because of syntactic and semantic ambiguity, its not possible to segment a Chinese sentence deterministically.

For other languages, you will need to check if they are space delimited on a by-language basis, then modify your pipeline accordingly. If you want to automate this process, you can do this with n-gram tables (for languages that share an alphabet and some vocabulary, e.g. the romantic languages). For languages with unique unicode character sets, just check the unicode values of the in-coming characters.

macinux · 2014-04-08T20:42:25Z

Thank you oskarsinger. I am totally beginner for this. So Japanese/Korean will be similar to chinese, right?

oskarsinger · 2014-04-08T22:42:48Z

Yes, you can check for Japanese/Korean using Unicode (although it gets a little complicated because Japanese borrows one of its three writing systems from Chinese, and I am not sure that the Unicode makes distinctions between the two languages for their shared characters). You can actually do this for any language with a unique writing system that is expressible in Unicode. Here is a link to the comprehensive list of Unicode characters:

http://www.unicode.org/charts/

and a place to look up Unicode character by hex code or hex code by character:

http://unicodelookup.com/

As for tokenization/word segmentation, I am not really sure about Japanese and Korean. You will have to just check out the Wikipedia pages on those languages. I have found Wikipedia to be a good place to start learning about the unique characteristics of a language.

oskarsinger · 2014-04-08T23:36:14Z

By the way, the Chinese word segmenter is now available. Make sure to notify us if there are any issues, and I will take care of them as soon as I can!

macinux · 2014-04-10T17:54:23Z

Thank you oskarsinger. We will try that and let you know.

oskarsinger · 2014-04-10T18:01:42Z

Great! Glad I could help.

strubell · 2014-11-19T17:11:45Z

This seems to be resolved

strubell closed this as completed Nov 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

any example for topic modeling using api (not a real issue) #135

any example for topic modeling using api (not a real issue) #135

macinux commented Apr 6, 2014

jasonbaldridge commented Apr 6, 2014

macinux commented Apr 8, 2014

macinux commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

macinux commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

macinux commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

macinux commented Apr 10, 2014

oskarsinger commented Apr 10, 2014

strubell commented Nov 19, 2014

any example for topic modeling using api (not a real issue) #135

any example for topic modeling using api (not a real issue) #135

Comments

macinux commented Apr 6, 2014

jasonbaldridge commented Apr 6, 2014

macinux commented Apr 8, 2014

macinux commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

macinux commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

macinux commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

oskarsinger commented Apr 8, 2014

macinux commented Apr 10, 2014

oskarsinger commented Apr 10, 2014

strubell commented Nov 19, 2014