# Spam filter

Training data is from https://archive.ics.uci.edu/ml/datasets/spambase.

Here, I implement a spam filter by naive-Bayes algorithm.

## 1 Read data

**Data organization in file**

The data is organized as a set of strings deliminated by commas. Each element denotes the frequency of certain word in an email. below is the first two lines of the data:
```
0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,1
0.21,0.28,0.5,0,0.14,0.28,0.21,0.07,0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0,1.59,0,0.43,0.43,0,0,0,0,0,0,0,0,0,0,0,0,0.07,0,0,0,0,0,0,0,0,0,0,0,0,0.132,0,0.372,0.18,0.048,5.114,101,1028,1
...
```
The data has 4600 lines in total, each line consisting of 58 elements. The last element is either 0 (non-spam) or 1 (spam).

**Data in code**

The data will be organized in two numpy arrays: data array and label array. Data array has a shape of (4601, 57), label array has a shape of (4601, 1). For the Naive-Bayes algorithm implementation, we will convert all the non-zero value in data array to 1.

In [14]:
import numpy as np
data_all = np.loadtxt('data\spambase.data', delimiter=',')
data_all[data_all > 0] = 1

## 2 Naive Bayes algorithm

## 3 Neural network

In [32]:
data_all = np.loadtxt('data\spambase.data', delimiter=',')
data = data_all[:, :57]
label = data_all[:, -1:]

In [90]:
import tensorflow as tf
inputs = tf.keras.Input(shape=(57, ))

In [104]:
x = tf.keras.layers.Dense(10, activation=tf.nn.relu)(inputs)
x1 = tf.keras.layers.Dense(5, activation=tf.nn.relu)(x)
x2 = tf.keras.layers.Dense(3, activation=tf.nn.relu)(x1)
outputs = tf.keras.layers.Dense(1, activation=tf.sigmoid)(x2)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [105]:
loss = tf.keras.losses.BinaryCrossentropy()
model.compile(optimizer='adam', loss=loss, metrics=None)

In [127]:
model.fit(x=data, y=label, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x15986b116c8>

In [112]:
n = 3000
pred = model.predict(data[n, :].reshape(1, 57))
if int(pred*2) == label[n]:
    print('correct prediction')
else:
    print('wrong prediction')   

correct prediction


In [128]:
pred = model.predict(data)
det = np.floor(2*pred) - label
np.count_nonzero(det) / len(det)

0.06911540969354488

**Daily summary 10242020**

- The network training is now working. The structure of the network is still primitive and needs to be improved.
- The use of data needs to be improved: divide data into train/dev/test sets.
- Consider constructing an original dataset.