# Spam filter

Training data is from https://archive.ics.uci.edu/ml/datasets/spambase.

Here, I implement a spam filter by naive-Bayes algorithm.

## 1 Read data

**Data organization in file**

The data is organized as a set of strings deliminated by commas. Each element denotes the frequency of certain word in an email. below is the first two lines of the data:
```
0,0.64,0.64,0,0.32,0,0,0,0,0,0,0.64,0,0,0,0.32,0,1.29,1.93,0,0.96,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.778,0,0,3.756,61,278,1
0.21,0.28,0.5,0,0.14,0.28,0.21,0.07,0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0,1.59,0,0.43,0.43,0,0,0,0,0,0,0,0,0,0,0,0,0.07,0,0,0,0,0,0,0,0,0,0,0,0,0.132,0,0.372,0.18,0.048,5.114,101,1028,1
...
```
The data has 4600 lines in total, each line consisting of 58 elements. The last element is either 0 (non-spam) or 1 (spam).

**Data in code**

The data will be organized in two numpy arrays: data array and label array. Data array has a shape of (4601, 57), label array has a shape of (4601, 1). For the Naive-Bayes algorithm implementation, we will convert all the non-zero value in data array to 1.

In [None]:
import numpy as np
import tensorflow as tf

In [1]:
data_all = np.loadtxt('data\spambase.data', delimiter=',')
data_all[data_all > 0] = 1

## 2 Naive Bayes algorithm

## 3 Neural network

In [2]:
data_all = np.loadtxt('data\spambase.data', delimiter=',')
data = data_all[:, :57]
label = data_all[:, -1:]

In [33]:
inputs = tf.keras.Input(shape=(57, ))
x = tf.keras.layers.Dense(20, activation=tf.nn.relu)(inputs)
x1 = tf.keras.layers.Dense(5, activation=tf.nn.relu)(x)
x2 = tf.keras.layers.Dense(3, activation=tf.nn.relu)(x1)
outputs = tf.keras.layers.Dense(1, activation=tf.sigmoid)(x2)
model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [18]:
loss = tf.keras.losses.BinaryCrossentropy()
metric = tf.keras.metrics.BinaryCrossentropy()
model.compile(optimizer='adam', loss=loss, metrics=None)

In [23]:
model.fit(x=data, y=label, epochs=5)

Train on 4601 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x15322a6f390>

In [112]:
n = 3000
pred = model.predict(data[n, :].reshape(1, 57))
if int(pred*2) == label[n]:
    print('correct prediction')
else:
    print('wrong prediction')   

correct prediction


In [24]:
pred = model.predict(data)
det = np.floor(2*pred) - label
np.count_nonzero(det) / len(det)

0.08606824603347099

In [28]:
sess = tf.compat.v1.Session()

In [40]:
dir(x)

['OVERLOADABLE_OPERATORS',
 '_USE_EQUALITY',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_priority__',
 '__bool__',
 '__class__',
 '__copy__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__div__',
 '__doc__',
 '__eq__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__invert__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdiv__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rmatmul__',
 '__rmod__',
 '__rmul__',
 '__ror__',
 '__rpow__',
 '__rsub__',
 '__rtruediv__',
 '__rxor__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__truediv__',
 '__weakref__',
 '__xor__',
 '_as_node_def_input',
 '_as_tf_output',
 '_c_api_shape',
 '_consumers',
 '_create_with_tf_output',
 '_d

In [37]:
dir(sess.graph)

['_ControlDependenciesController',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_add_control_dependencies',
 '_add_device_to_stack',
 '_add_function',
 '_add_new_tf_operations',
 '_add_op',
 '_apply_device_functions',
 '_as_graph_def',
 '_as_graph_element_locked',
 '_attr_scope',
 '_attr_scope_map',
 '_auto_cast_variable_read_dtype',
 '_bcast_grad_args_cache',
 '_building_function',
 '_c_graph',
 '_check_not_finalized',
 '_collections',
 '_colocate_with_for_gradient',
 '_colocation_stack',
 '_container',
 '_control_dependencies_for_inputs',
 '_control_dependencies_stack',
 '_control_flow_context',
 '_copy_functions_to_graph_def',
 '_create_op_from_tf_operation',
 '

**Daily summary 10242020**

- The network training is now working. The structure of the network is still primitive and needs to be improved.
- The use of data needs to be improved: divide data into train/dev/test sets.
- Consider constructing an original dataset.