New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
any example for topic modeling using api (not a real issue) #135
Comments
I set up a pretty straightforward example here for Mallet and FACTORIE: https://github.com/jasonbaldridge/maul It's a bit out of date, but it should either still work or be pretty easy to adapt. (I'd happily accept a pull request to update it to the latest version of FACTORIE.) |
Thank you! it still works if we just add one more import. I can have a pull request but it works pretty well actually |
Jason, I was studying your example. It will not be working if you use chinese character in the document or pass chinese in the string. I guess it can not find the right tokens. Here is the code with chinese string as input
Do you have any advice on this? or which segmenter we should use? Is there any segmenter working with any languages? Thank you very much again! |
Hi Jason and Macinux, I just wrote a Chinese word segmenter that can be integrated into the document pipeline using the "process" method. I have been doing some feature engineering on it because it is just below state-of-the-art for closed-track (no outside knowledge, only info from the training corpus can be used). Its not quite done yet, but its very usable. I can make a pull request later today. Unfortunately, you will have to train it yourselves because we don't yet have permission to post trained models with the corpora I used. However, training and serializing a model should take only two method calls: segmenter.train([file path of training corpus]); After that, populating a new instance with the pre-trained model is as easy as: segmenter.deserialize([file path of serialized, trained model]) The training and testing corpora can be found at: http://www.sighan.org/bakeoff2005/ If you are dealing with traditional characters, use the Academia Sinica corpus (from Taiwan) or the City University of Hong Kong corpus (from Hong Kong). If you are dealing with simplified characters, use the Peiking University or Microsoft Research Asia corpora (both from Beijing). Read the licensing on the data! |
Thank you oskarsinger! For other languages, we have to do something similar to chinese? I wonder if there is a more generic way to handle most languages. Otherwise, It will be complicated to handle other languages. James |
It depends on the language. The reason its necessary for Chinese is because Chinese sentences don't have their words separated by spaces, and, because of syntactic and semantic ambiguity, its not possible to segment a Chinese sentence deterministically. For other languages, you will need to check if they are space delimited on a by-language basis, then modify your pipeline accordingly. If you want to automate this process, you can do this with n-gram tables (for languages that share an alphabet and some vocabulary, e.g. the romantic languages). For languages with unique unicode character sets, just check the unicode values of the in-coming characters. |
Thank you oskarsinger. I am totally beginner for this. So Japanese/Korean will be similar to chinese, right? |
Yes, you can check for Japanese/Korean using Unicode (although it gets a little complicated because Japanese borrows one of its three writing systems from Chinese, and I am not sure that the Unicode makes distinctions between the two languages for their shared characters). You can actually do this for any language with a unique writing system that is expressible in Unicode. Here is a link to the comprehensive list of Unicode characters: http://www.unicode.org/charts/ and a place to look up Unicode character by hex code or hex code by character: As for tokenization/word segmentation, I am not really sure about Japanese and Korean. You will have to just check out the Wikipedia pages on those languages. I have found Wikipedia to be a good place to start learning about the unique characteristics of a language. |
By the way, the Chinese word segmenter is now available. Make sure to notify us if there are any issues, and I will take care of them as soon as I can! |
Thank you oskarsinger. We will try that and let you know. |
Great! Glad I could help. |
This seems to be resolved |
Hi,
I am totally a beginner and try to learn topic modeling. The user guide on the website has an example for topic modeling with command line. I just wonder if you guys have a quick example for doing topic modeling with API code (using factories as a library to do topic modeling )
Thanks in advance.
James
The text was updated successfully, but these errors were encountered: