Bayesian Classifier #172

Merged
merged 58 commits into from Jun 21, 2012

Conversation

Projects
None yet
4 participants
Owner

josh commented Jun 7, 2012

Initial work on bayesian classifier.

This includes the custom tokenizer for input into the classifier.

/cc @TwP

This pull request passes (merged e0cbe81 into 4df3199).

This pull request passes (merged f747b49 into 4df3199).

@josh josh and 1 other commented on an outdated diff Jun 7, 2012

lib/linguist/classifier.rb
+ scores.sort { |a, b| b[1] <=> a[1] }
+ end
+
+ def tokens_probability(tokens, language)
+ tokens.inject(1.0) do |sum, token|
+ sum *= token_probability(token, language)
+ end
+ end
+
+ def token_probability(token, language)
+ if @tokens[language][token] == 0
+ 1 / @tokens_total.to_f
+ else
+ @tokens[language][token].to_f / @languages[language].to_f
+ end
+ end
@josh

josh Jun 7, 2012

Owner

@TwP This needs some tweaking.

The issue is how to weigh tokens we haven't seen yet. We don't want the probability to be 0, otherwise its going to reduce the entire probability of matching the language to 0.

@TwP

TwP Jun 16, 2012

Owner

During the classification phase, if we find tokens not in the language then we have to score them as zero. It is during the training phase that we build up the probabilities that a token belongs to the language. As new languages come in, we have to train the classifier to recognize them. It is the process of building up the token probabilities.

There are algorithms for continuous training bayesian classifiers. Should definitely check one of those out.

This pull request passes (merged 9ecab36 into 4df3199).

This pull request passes (merged 543922c into 4df3199).

This pull request passes (merged 12cfab6 into 8a9d8a1).

This pull request passes (merged ecb2397 into 55e1259).

This pull request passes (merged ddf3ec4 into 55e1259).

Owner

josh commented Jun 19, 2012

/cc @sbryant

This pull request passes (merged 645a87d into 6113e6d).

Owner

josh commented on 4484011 Jun 19, 2012

@TwP @sbryant this is fishy

This pull request passes (merged 48ecae0 into a10e52a).

This pull request passes (merged 8c83cbe into 1263b4c).

Contributor

michaelficarra commented on 26f9550 Jun 20, 2012

Cool. You can use gists as an unlimited source of prewritten test cases. But you can also use them to train, right?

Owner

josh replied Jun 20, 2012

I'm planning to avoid training against user data for licensing reasons. I want the classifier db to be fully MIT and I don't think its right include users data in it without their permission. So you can opt in and donate your code to the linguist samples repo.

Contributor

michaelficarra replied Jun 20, 2012

Sorry, my code is all 3-clause BSD. MIT is scary.

Owner

josh replied Jun 20, 2012

It doesn't have to be real code, just a sample file. Hopefully the output of the coffeescript compiler isn't BSD licensed :P

This pull request fails (merged ac23d64 into 37f971a).

This pull request fails (merged 6252f12 into 37f971a).

This pull request passes (merged 5568489 into 37f971a).

This pull request passes (merged 497da86 into 37f971a).

This pull request passes (merged 540f2a0 into 37f971a).

This pull request passes (merged 076bf7d into 37f971a).

This pull request passes (merged 77a6a41 into f90c022).

josh merged commit 2a324c6 into master Jun 21, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment