## Definitions

**Naive Bayes (NB)** is a family of probability classifiers based on the Probability Theory by Thomas Bayes. It's calles **naive** or **independent** becase it's based on an assumptions that the estimated **features** (also called vectors, or problem instances) are not causally connected.

*Example:* If an object has features of being red, round, 10cm in diameter, these features independently add to the probability of it calssified as an apple.

**Naive Bayes** is a *supervised* ML argorithm, it needs to be trained with labeled data before it can work.
**Naive Bayes** is based on the formula of *Conditional Probability*.

**Kernel Density Estimation** is a function with smoothes probability density of a variable. It is widely used with NB, and together they make NB very competitive.

## Naive Bayes

NB relies on causal (posterior) probablility "A because B" 

### Formula for conditional probability

$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}$

- $P(A \mid B)$ - Conditional probability. Probability of $A$ occuring given $B$.
- $P(B \mid A)$ - Probbility of $B$ ocurring if $A$ is true - *detected causal probability* - this is the probabilities glimpsed from analisys.
- $P(A)$ and $P(B)$ are probabilities of observing $A$ and $B$ without any conditions (marginal probability).

Thus, humanly speaking, posterior probability is observed causal probability multiplied by probability of observed effect divided by probability of observed cause:

$P(A \mid B) = \frac{P(B \cap A)}{P(B)}$

The conditional probability is the joined probability (union) of $A$ and $B$ occuring, divided by the marginal probability of $B$ occuring. Where $B$ is cause, $A$ is effect.

## Pros and cons

Naive Bayes has some pros and cons, which influence it's use.

### Pros

- **Easy** to implement, easy to maintain.
- **Performant** (due to feature independence). Does not require expensive equipment.
- Can work with **categorical inputs**, not only numberic.

### Cons

- **Zero-frequency problem**: can not handle categories which are not in dataset.
- **Not very precise** in it's results.
- Taking each feature independently, it **misses the connections** between features.

## Application

Naive Bayes is used for classification in cases where text is involved, performance is required, and the features interdependence recognition id not critical. With text classification, NB is known to be reliable and performant, and this is the main area of its application.

- **Sentiment analysis**. What sentiment does the text reflect?
- **SPAM analysis**. Is the text spam or ham?
- **Recommendations systems**. Used with *collaborative filtering* to predict if a user is likely to use a product based on the list of their currently used products.

## Naive Bayes Example

In [164]:
import csv
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from functools import reduce
import sklearn.metrics

token_re = re.compile(r"\$?\d*(?:[.,]\d+)+|\w+-\w+|\w+", re.U)

# Getting stopwords for each use is *very* slow, so prepare them here.
stopwords = stopwords.words('english')


def _tokenize(text):
    return list(filter(lambda s: len(s) > 2 and s not in stopwords, token_re.findall(text.lower())))

# Import and tokenize the dataset.
with open('assets/emails.csv') as fh:
    data = list(csv.reader(fh))

# Split the data.
x_train, x_test = train_test_split(data[1:], test_size=.2, shuffle=True)

# Tokenize the train dataset.
for i, entry in enumerate(x_train):
    x_train[i][0] = _tokenize(entry[0])

# Flatten the dataset.
flat_data = []
for entry in x_train:
    for token in entry[0]:
        flat_data.append([token, entry[1]])

# Flat dataset contains tokens with spam indicator (1 or 0).
# There can be multiple cases on of token in this flat list.
print(flat_data[:10])

[['chapter', '0'], ['vince', '0'], ['things', '0'], ['well', '0'], ['hope', '0'], ['latest', '0'], ['version', '0'], ['part', '0'], ['chapter', '0'], ['think', '0']]


In [165]:
# 0 - spam count, 1 - ham count, 2 - spam index, 3 - ham index
model = {}
spam_total = 0
ham_total = 0

# Count spam/ham per token.
for entry in flat_data:
    if not entry[0] in model:
        model[entry[0]] = [0, 0]
    if entry[1] == '1':
        model[entry[0]][0] += 1
        spam_total += 1
    else:
        model[entry[0]][1] += 1
        ham_total += 1

# Now, model has a token, it's spam and ham counts.
print(list(model.items())[:5])

[('chapter', [2, 160]), ('vince', [1, 6867]), ('things', [58, 226]), ('well', [127, 758]), ('hope', [25, 544])]


In [166]:
# Smooth spam probability index per word.
for token in model:
    model[token].append((model[token][0] + 1) / (model[token][0] + model[token][1] + 2))
     
# Probability of spam and ham.
prob_spam = spam_total / (spam_total + ham_total)
prob_ham = ham_total / (spam_total + ham_total)

# Each token now has a "spam" index value between 1 and 0.
print(list(model.items())[:5])

[('chapter', [2, 160, 0.018292682926829267]), ('vince', [1, 6867, 0.0002911208151382824]), ('things', [58, 226, 0.2062937062937063]), ('well', [127, 758, 0.14430665163472378]), ('hope', [25, 544, 0.04553415061295972])]


In [167]:
# Classification function.
def calculate_word(word):
    test_si = model[word][2] if word in model else 1 / (spam_total + 2)
    test_hi = (1 - model[word][2]) if word in model else 1 / (ham_total + 2)
    return (test_si * prob_spam) / ((test_si * prob_spam) + (test_hi * prob_ham))

# Test some words.
words = ['really', 'identity', 'abracadabra']
for word in words:
    print('"' + word + '": ' + str(calculate_word(word)))

"really": 0.06775089249265955
"identity": 0.8140876430074708
"abracadabra": 0.4999970085028959


In [168]:
# Calculate spam probability of a text.
def calculate_text(text):
    words = _tokenize(text)
    ratings = []
    for word in words:
        ratings.append(calculate_word(word))
        
    # Google B https://patents.google.com/patent/US7523168
    spamminess = (pow(reduce(lambda x, y: (1 - x) * (1 - y), ratings), 1 / len(ratings))).real
    hamminess = (1 - pow(reduce(lambda x, y: x * y, ratings), 1 / len(ratings))).real
    combined = (spamminess - hamminess) / (spamminess + hamminess)
    normalized = (1 + combined) / 2
   
    return normalized


# Classify text. A wrapper around calculate_text.
def classify(text):
    spam_prob = calculate_text(text)
    return True if spam_prob > 0.51 else False  

words = ['really', 'identity', 'abracadabra']
for word in words:
    print('"' + word + '": ' + str(calculate_text(word)))


"really": 0.06775089249265953
"identity": 0.8140876430074708
"abracadabra": 0.49999700850289597


In [185]:
# Validate with confusion matrix.
confusion_matrix = [[0, 0], [0, 0]]

for entry in x_test[:10]:
    response = calculate_text(entry[0])
    # print(str(response) + ':' + str(classify(entry[0])) + ':' + str(entry[1]))

for entry in x_test:
    response = classify(entry[0])
    
    # True positive.
    if response and entry[1] == '1':
        confusion_matrix[0][0] += 1
        
    # False positive.
    elif response and entry[1] == '0':
        confusion_matrix[0][1] += 1
        
    # True negative.
    elif not response and entry[1] == '0':
        confusion_matrix[1][1] += 1
    
    # False negative.
    elif not response and entry[1] == '1':
        confusion_matrix[1][0] += 1

print(confusion_matrix)
print('p = ' + str((confusion_matrix[0][0] + confusion_matrix[1][1]) / len(x_test)))

# Publish in "TEXT MINING" folder.

[[258, 29], [15, 844]]
p = 0.9616055846422339
