NBC

My Naive Bayes Classifier for ham/spam message classification.

Theory

The probability of belonging to a class $Q_k = (ham,spam)$ is calculated using the next formula: $$P(Q_k | x) = P(Q_k)\prod\limits_{i = 1}^n P(x_i|Q_k), $$ where $x$ - message, $x_i$ - each word in $x$.

To avoid the problem of unfamiliar words, additive smoothing is used: $$P(x_i|Q_k) = \frac{\alpha + N_{ik}}{\alpha M + N_k}$$

Because of the small values, it is easier to work with logarithms by converting the formula according to the rule of logarithms $\log{ab} = \log{a}+\log{b}:$ $$P(Q_k | x) = \log{P(Q_k)} + \sum\limits_{i = 1}^n \log{P(x_i|Q_k)}.$$

How to run

You can find file with training set spam.txt in resources folder. Used data can be found here. All data should consist of two fields: type and message, diveded by '\t'.

In Main file you can change filepath of training set and message variable to classify.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.idea		.idea
src/main		src/main
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBC

Theory

How to run

About

Releases

Packages

Languages

astronik00/Bayes-Classifier

Folders and files

Latest commit

History

Repository files navigation

NBC

Theory

How to run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages