# Spam Filters, Naive Bayes, and Wrangling
This tutorial (ch.4) discusses how Naive Bayes could be used for spam filtering.
\begin{equation}
    p(y|x)p(x)=p(x,y)=p(x|y)p(y)
\end{equation}
Using this equation:
\begin{equation}
p(y|x) = \frac{p(x|y)p(y)}{p(x)}
\end{equation}

- Naive Bayes is a spam filter that combines words.
- Using a binary vector $x$ to represent an email and $c$ to denote "is spam", Naive Bayes evaluates:
\begin{equation}
p(x|c) = \prod_j\theta_{jc}^{x_j}(1-\theta_{jc})^{1-x_j}
\end{equation}
where $\theta$ is the probability that an individual word is present in a spam email.

We model the words *independently* (aka *independent trials*), which is why we take the product on the righthand side of the preceding formula and don't count how many times they are present. That's why this called *"naive"*, because we know that there are actually certain words that tend to appear together, and we're ignoring this.

We apply log to both sides to transform the preceding equation to summation:
\begin{equation}
log(p(x|c)) = \sum_jx_jlog\left(\frac{\theta_j}{1-\theta_j}\right)+\sum_jlog(1-\theta_j)
\end{equation}

The term $log(\theta_j/(1-\theta_j))$ does not depend on a given email, just the word, so let's rename it $w_j$ and assume we've computed it once and stored it. Same with quantity $\sum_j\log(1-\theta_j)=w_0$. Now  we have:
\begin{equation}
log(p(x|c)) = \sum_jx_jw_j+w_0
\end{equation}

The accompanying code tutorial is from [here](https://dzone.com/articles/naive-bayes-tutorial-naive-bayes-classifier-in-pyt)

Data

## step by step implementation
Stages:
1. Handle data
2. Summarize data
3. Make predictions
4. Evaluate accuracy

**Step 1: Handle data**

In [1]:
import csv
import math
import random

In [2]:
lines = csv.reader(open(r'diabetes.csv'))

In [3]:
dataset = list(lines)

In [4]:
len(dataset)

769

In [5]:
for i in range(1, len(dataset)):
    dataset[i] = [float(x) for x in dataset[i]]

In [6]:
def splitDataset(dataset, splitRatio):
    trainSize = int(len(dataset) * splitRatio)
    trainSet = []
    copy = list(dataset)
    while len(trainSet) < trainSize:
        index = random.randrange(len(copy))
        trainSet.append(copy.pop(index))
    return [trainSet, copy]

**Step 2: Summarize the data**

In [7]:
def separateByClass(dataset):
    separated = {}
    for i in range(len(dataset)):
        vector = dataset[i]
        if (vector[-1] not in separated):
            separated[vector[-1]] = []
        separated[vector[-1]].append(vector)
    return separated

In [8]:
dataset.pop(0)

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

In [9]:
print("classes:{}".format(list(separateByClass(dataset).keys())))

classes:[1.0, 0.0]


In [10]:
def mean(numbers):
    return sum(numbers)/float(len(numbers))

In [11]:
# calculate standard deviation
def stdev(numbers):
    avg = mean(numbers)
    variance = sum([pow(x-avg, 2) for x in numbers])/float(len(numbers)-1)
    return math.sqrt(variance)

In [12]:
def summarize(dataset):
    summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
    del summaries[-1]
    return summaries

In [13]:
summarize(dataset)

[(3.8450520833333335, 3.3695780626988623),
 (120.89453125, 31.97261819513622),
 (69.10546875, 19.355807170644777),
 (20.536458333333332, 15.952217567727677),
 (79.79947916666667, 115.24400235133837),
 (31.992578124999977, 7.8841603203754405),
 (0.4718763020833327, 0.33132859501277484),
 (33.240885416666664, 11.76023154067868)]

In [14]:
def summarizeByClass(dataset):
    separated = separateByClass(dataset)
    summaries = {}
    for classValue, instances in separated.items():
        summaries[classValue] = summarize(instances)
    return summaries

In [15]:
summarizeByClass(dataset)

{1.0: [(4.865671641791045, 3.741239044041554),
  (141.25746268656715, 31.939622058007195),
  (70.82462686567165, 21.49181165060413),
  (22.16417910447761, 17.67971140046571),
  (100.33582089552239, 138.6891247315351),
  (35.14253731343278, 7.262967242346376),
  (0.5505, 0.372354483554611),
  (37.06716417910448, 10.968253652367915)],
 0.0: [(3.298, 3.01718458262189),
  (109.98, 26.14119975535359),
  (68.184, 18.063075413305828),
  (19.664, 14.889947113744254),
  (68.792, 98.86528929231767),
  (30.30419999999996, 7.689855011650112),
  (0.42973400000000017, 0.29908530435741093),
  (31.19, 11.667654791631156)]}

**Step 3: Making predictions**

In [16]:
# calculate the Gaussian probability density function
def calculateProbability(x, mean, stdev):
    exponent = math.exp(-(math.pow(x-mean, 2)/(2 * math.pow(stdev, 2))))
    return (1/(math.sqrt(2*math.pi)*stdev))*exponent

In [17]:
# calculate class probabilities
def calculateClassProbabilities(summaries, inputVector):
    probabilities = {}
    for classValue, classSummaries in summaries.items():
        probabilities[classValue] = 1
        for i in range(len(classSummaries)):
            mean, stdev = classSummaries[i]
            x = inputVector[i]
            probabilities[classValue] *= calculateProbability(x, mean, stdev)
    return probabilities

In [27]:
def predict(summaries, inputVector):
    probabilities = calculateClassProbabilities(summaries, inputVector)
    bestLabel, bestProb = None, -1
    for classValue, probability in probabilities.items():
        print(classValue, probability)
        if bestLabel is None or probability > bestProb:
            bestProb = probability
            bestLabel = classValue
    return bestLabel

In [28]:
def getPredictions(summaries, testSet):
    predictions = []
    for i in range(len(testSet)):
        result = predict(summaries, testSet[i])
        predictions.append(result)
    return predictions

In [29]:
def getAccuracy(testSet, predictions):
    correct = 0
    for i in range(len(testSet)):
        if testSet[i][-1] == predictions[i]:
            correct += 1
    return (correct/float(len(testSet))) * 100.0

**Putting it all together**

In [30]:
trainingSet, testSet = splitDataset(dataset, 0.67)

In [31]:
len(trainingSet)

514

In [32]:
len(testSet)

254

In [33]:
print('Split {0} rows into train = {1} and test = {2} rows'.format(len(dataset),len(trainingSet),len(testSet)))

Split 768 rows into train = 514 and test = 254 rows


In [34]:
summaries = summarizeByClass(trainingSet)

In [35]:
summaries

{1.0: [(4.728260869565218, 3.7768442051737336),
  (138.71195652173913, 33.25491324853364),
  (71.18478260869566, 21.076197616349393),
  (22.91304347826087, 18.053443535057486),
  (107.68478260869566, 149.06414281868547),
  (35.49945652173913, 7.300134719039922),
  (0.5432119565217396, 0.3662077311716966),
  (36.85326086956522, 11.338082750455008)],
 0.0: [(3.272727272727273, 2.9768379142887422),
  (109.85757575757576, 27.04956084348783),
  (67.64848484848486, 18.363703482687626),
  (20.112121212121213, 14.64128853049784),
  (69.01212121212122, 94.30597475190113),
  (30.270606060606063, 7.474630827977583),
  (0.42626363636363657, 0.2872557721141152),
  (31.3, 12.0118102976254)]}

In [40]:
testSet[0]

[8.0, 183.0, 64.0, 0.0, 0.0, 23.3, 0.672, 32.0, 1.0]

In [41]:
predictions = getPredictions(summaries, [testSet[0]])

1.0 5.800698110864363e-14
0.0 1.1678931189605206e-14


In [45]:
len(predictions)

254

In [52]:
accuracy = getAccuracy(testSet, predictions)

In [53]:
print('Accuracy: {}%'.format(accuracy))

Accuracy: 74.40944881889764%


## Naive Bayes using Scikit-Learn 

In [65]:
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB, BernoulliNB

In [55]:
dataset = datasets.load_iris()

In [56]:
model = GaussianNB()

In [57]:
model.fit(dataset.data, dataset.target)

GaussianNB(priors=None)

In [58]:
expected = dataset.target

In [59]:
len(expected)

150

In [60]:
predicted = model.predict(dataset.data)

In [61]:
len(predicted)

150

In [63]:
print(metrics.classification_report(expected, predicted))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       0.94      0.94      0.94        50
          2       0.94      0.94      0.94        50

avg / total       0.96      0.96      0.96       150



In [64]:
print(metrics.confusion_matrix(expected, predicted))

[[50  0  0]
 [ 0 47  3]
 [ 0  3 47]]


In [66]:
bernModel = BernoulliNB()

In [67]:
bernModel.fit(dataset.data, dataset.target)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [68]:
predicted = bernModel.predict(dataset.data)

In [69]:
print(metrics.classification_report(expected, predicted))

             precision    recall  f1-score   support

          0       0.33      1.00      0.50        50
          1       0.00      0.00      0.00        50
          2       0.00      0.00      0.00        50

avg / total       0.11      0.33      0.17       150



  'precision', 'predicted', average, warn_for)


In [71]:
print(metrics.confusion_matrix(expected, predicted))

[[50  0  0]
 [50  0  0]
 [50  0  0]]
