Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any example for topic modeling using api (not a real issue) #135

Closed
macinux opened this issue Apr 6, 2014 · 12 comments
Closed

any example for topic modeling using api (not a real issue) #135

macinux opened this issue Apr 6, 2014 · 12 comments

Comments

@macinux
Copy link

macinux commented Apr 6, 2014

Hi,

I am totally a beginner and try to learn topic modeling. The user guide on the website has an example for topic modeling with command line. I just wonder if you guys have a quick example for doing topic modeling with API code (using factories as a library to do topic modeling )

Thanks in advance.

James

@jasonbaldridge
Copy link
Contributor

I set up a pretty straightforward example here for Mallet and FACTORIE:

https://github.com/jasonbaldridge/maul

It's a bit out of date, but it should either still work or be pretty easy to adapt. (I'd happily accept a pull request to update it to the latest version of FACTORIE.)

@macinux
Copy link
Author

macinux commented Apr 8, 2014

Thank you! it still works if we just add one more import. I can have a pull request but it works pretty well actually

@macinux
Copy link
Author

macinux commented Apr 8, 2014

Jason,

I was studying your example. It will not be working if you use chinese character in the document or pass chinese in the string. I guess it can not find the right tokens. Here is the code with chinese string as input

object WordSeqDomain extends CategoricalSeqDomain[String]
val model = DirectedModel()
val lda = new LDA(WordSeqDomain, opts.numTopics.value, opts.alpha.value,
opts.beta.value, opts.optimizeBurnIn.value)(model,random)
val mySegmenter = new cc.factorie.app.strings.RegexSegmenter(opts.tokenRegex.value.r)
val cdoc=Document.fromString(WordSeqDomain,null,"我喜欢我的电脑",segmenter=mySegmenter)
lda.addDocument(cdoc,random)

Do you have any advice on this? or which segmenter we should use? Is there any segmenter working with any languages?

Thank you very much again!
James

@oskarsinger
Copy link
Contributor

Hi Jason and Macinux,

I just wrote a Chinese word segmenter that can be integrated into the document pipeline using the "process" method. I have been doing some feature engineering on it because it is just below state-of-the-art for closed-track (no outside knowledge, only info from the training corpus can be used). Its not quite done yet, but its very usable. I can make a pull request later today.

Unfortunately, you will have to train it yourselves because we don't yet have permission to post trained models with the corpora I used. However, training and serializing a model should take only two method calls:

segmenter.train([file path of training corpus]);
segmenter.serialize([file path of location to save the serialized model])

After that, populating a new instance with the pre-trained model is as easy as:

segmenter.deserialize([file path of serialized, trained model])

The training and testing corpora can be found at:

http://www.sighan.org/bakeoff2005/

If you are dealing with traditional characters, use the Academia Sinica corpus (from Taiwan) or the City University of Hong Kong corpus (from Hong Kong). If you are dealing with simplified characters, use the Peiking University or Microsoft Research Asia corpora (both from Beijing).

Read the licensing on the data!

@macinux
Copy link
Author

macinux commented Apr 8, 2014

Thank you oskarsinger!

For other languages, we have to do something similar to chinese? I wonder if there is a more generic way to handle most languages. Otherwise, It will be complicated to handle other languages.

James

@oskarsinger
Copy link
Contributor

It depends on the language. The reason its necessary for Chinese is because Chinese sentences don't have their words separated by spaces, and, because of syntactic and semantic ambiguity, its not possible to segment a Chinese sentence deterministically.

For other languages, you will need to check if they are space delimited on a by-language basis, then modify your pipeline accordingly. If you want to automate this process, you can do this with n-gram tables (for languages that share an alphabet and some vocabulary, e.g. the romantic languages). For languages with unique unicode character sets, just check the unicode values of the in-coming characters.

@macinux
Copy link
Author

macinux commented Apr 8, 2014

Thank you oskarsinger. I am totally beginner for this. So Japanese/Korean will be similar to chinese, right?

@oskarsinger
Copy link
Contributor

Yes, you can check for Japanese/Korean using Unicode (although it gets a little complicated because Japanese borrows one of its three writing systems from Chinese, and I am not sure that the Unicode makes distinctions between the two languages for their shared characters). You can actually do this for any language with a unique writing system that is expressible in Unicode. Here is a link to the comprehensive list of Unicode characters:

http://www.unicode.org/charts/

and a place to look up Unicode character by hex code or hex code by character:

http://unicodelookup.com/

As for tokenization/word segmentation, I am not really sure about Japanese and Korean. You will have to just check out the Wikipedia pages on those languages. I have found Wikipedia to be a good place to start learning about the unique characteristics of a language.

@oskarsinger
Copy link
Contributor

By the way, the Chinese word segmenter is now available. Make sure to notify us if there are any issues, and I will take care of them as soon as I can!

@macinux
Copy link
Author

macinux commented Apr 10, 2014

Thank you oskarsinger. We will try that and let you know.

@oskarsinger
Copy link
Contributor

Great! Glad I could help.

@strubell
Copy link
Member

This seems to be resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants