# ANN (TensorFlow, Keras) for Spam Detection.

This is an example of an ANN with Tensorflow + Keras.

Built upon examples from the following sources:
* https://medium.com/analytics-vidhya/spam-classification-with-tensorflow-keras-7e9fb8ace263
* https://www.thepythoncode.com/article/build-spam-classifier-keras-python


## Definitions and early explanations.

**TensorFlow** is a free and open source machine learning library and engine
 written and published by Google. See https://www.tensorflow.org.

**A tensor** is a data array, similar to NumPy arrays. In a wide sense, all
matrices are tensors. But in a narrow sense when used with **TensorFlow**,
tensors are not alterable matrices, formatted for the use with GPUs.

**Keras** is a free and open source API for TensorFlow, developed by the same
team, that makes it faster and easier to use TensorFlow in general common
scenarios. See https://keras.io.

In [12]:
# Basic stuff.
import pandas as pd
import string
import math

# Feature engineering.
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Tensorflow and Keras.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Reporting and matrix.
from sklearn.metrics import classification_report, confusion_matrix

## Hyper-parameters.

Hyper-parameters are the settings of our operations, especially as related to
the Neural network.

Changing the hyper-parameters can sometimes significantly influence the resulting
performance and accuracy of results.

**Dimensionality** of each layer is the size of it's output array, the amount of
its artificial neurons.

**DROP_SIZE** Is the severity of the dropout layer. If the drop size is 0.5,
then the next NN layer will be proportionally smaller.

The Dimensionality Base **DIM_BASE** hyper-parameter has started from 8000 and
was gradually decreased until it started to affect the results in a negative
way. It was possible to decrease the parameter from 8000 to 80 - 100 times,
and that also proportionally decreased the training time of each layer from
120s to 1s 5ms.

In [13]:
# Proportional size of the test subset.
TEST_SIZE = 0.2

# Number of training epochs.
EPOCHS = 5

# The severity of the dropout layers.
DROP_SIZE = 0.5

# NN layer output dimensionality base.
# There will be 4 such layers, each 2x smaller then the previous.
DIM_BASE = 80

## Load the dataset.

The source training csv file is a set of labeled emails, with "spam" value of 1
and "ham" value of 0. The dataset was taken from
https://github.com/Balakishan77/Spam-Email-Classifier/blob/master/spamham.csv

In [14]:
df = pd.read_csv('emails.csv')
df = df[['text', 'spam']]
df.head()

Unnamed: 0,text,spam
0,naturally irresistible your corporate identity...,1
1,the stock trading gunslinger fanny is merrill...,1
2,unbelievable new homes made easy im wanting t...,1
3,4 color printing special request additional i...,1
4,"do not have money , get software cds from here...",1


## Preprocess the dataset.

Our dataset needs to be cleaned up - stopwords, punctuation, multiple spaces,
etc.

Pandas dataframe class has an `apply()` method that allows us to apply a custom
function along the axis of a dataframe. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html.

In [15]:
# Preprocess the text - stopwords, punctuation, convert to lowercase.
def text_processing(message):
    stop_words = stopwords.words('english')
    no_punctuation = [char for char in message if char not in string.punctuation]
    no_punctuation = ''.join(no_punctuation)
    return ' '.join([word for word in no_punctuation.split() if word.lower() not in stop_words])

df['text'] = df['text'].apply(text_processing)
df.head()

Unnamed: 0,text,spam
0,naturally irresistible corporate identity lt r...,1
1,stock trading gunslinger fanny merrill muzo co...,1
2,unbelievable new homes made easy im wanting sh...,1
3,4 color printing special request additional in...,1
4,money get software cds software compatibility ...,1


## Split the dataset into training and testing subsets.

In [16]:
X = df['text'].values
y = df['spam'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=32)

## Feature Engineering.

### Vectorization.
The text needs to be turned into numbers. `CountVectorizer()` will create a
dictionary from the text (*fit*) and use to to onvert the text into these
numbers (*transform*).

We use the larger text body as a source of *fit*, and then serve that logic to
itself, and the smaller training text body in *transform*.

### Tf-Idf Vectorization.

**TF-IDF** - Term Frequency times Inverse Document Frequency. It is a term
weighing technique, where a given term is weighed higher if it's frequent to a
document, but then weighed lower if it's a common word in other documents.

The **tf-idf** is a probability calculator, that similar to how *Bayes* handles
its probability calculation.

See https://en.wikipedia.org/wiki/Tf-idf
See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html.

Resulting arrays are tokenized vectorized text with the token probability map.

In [17]:
# Vectorization.
bow = CountVectorizer()
X_train = bow.fit_transform(X_train)
X_test = bow.transform(X_test)

# Term Frequency, Inverse Document Frequency.
tfidf = TfidfTransformer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)
X_train = X_train.toarray()
X_test = X_test.toarray()

## Building the Model.

### Terminology.

**Dense layers** - **linear** layers, a simple linear stack of NN layers.

**Sequential model** - a model where each layer receives and returns 1 tensor.

**Layer output dimensionality** - The size of the array the layer will output.

### Explanation.

The model class for this task is `Sequential()`, which is suitable for simple
NNs where each layer has 1 input and 1 output tensor.

See https://keras.io/guides/sequential_model.

#### The dense (linear) layers.
The model consists of 5 `Dense()` layers with decreasing output space
dimensionality ("units"). 4 of which are activated by the *ReLU*, and the final
one by the *sigmoid* activation function.

Each new layer with smaller dimensionality thus raises the level of learning
abstraction, until the final layer with the dimensionality of 1 remains and the
final abstraction is *spam* or *ham* (1 or 0).

See https://keras.io/api/layers/core_layers/dense.

#### The dropout layers.
To prevent bias and overfitting, we drop neurons at random by a certain rate. The dropout performed by the `Dropout()` layer. Is connected to the
`Dense()` layer in such a way, that half of the neurons are dropped each time,
and the each following layer is half-size smaller.

See https://keras.io/api/layers/regularization_layers/dropout.

In [18]:
model = Sequential()
model.add(Dense(units=DIM_BASE, activation='relu'))
model.add(Dropout(DROP_SIZE))
model.add(Dense(units=math.floor(DIM_BASE * DROP_SIZE), activation='relu'))
model.add(Dropout(DROP_SIZE))
model.add(Dense(units=math.floor(DIM_BASE * (DROP_SIZE * 2)), activation='relu'))
model.add(Dropout(DROP_SIZE))
model.add(Dense(units=math.floor(DIM_BASE* (DROP_SIZE * 4)), activation='relu'))
model.add(Dropout(DROP_SIZE))
model.add(Dense(units=1, activation='sigmoid'))

## Compiling and training the model.

After a model has been built, it needs to be compiled and then trained.

### Compiler.
Compiler prepared the built model for use. Compiler needs to be specified the
parameters for *optimizer* and *loss* algorithms.

**Optimizer** - the learning algorithm which adjusts the NN parameters,
especially the *weights*, so as to achieve learning.

See https://keras.io/api/optimizers.

**Loss** - the loss evaluation algorithm. In our case, the *cross-entropy* loss
algorithm calculates the cross-entropy loss between true labels and predicted
labels.

See https://keras.io/api/losses.

See https://keras.io/api/losses/probabilistic_losses/#binarycrossentropy-class.

### Trainer.

The model is trained with the `fit()` function. The callback parameter contains
the *yearly_stop* instance, which will stop the learning process if the learning
stops to improve the monitored results.

In [19]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='accuracy')

# Stop training when the monitored metric has stopped improving.
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=10)

# Perform the training of the model.
model.fit(x=X_train, y=y_train, epochs=EPOCHS, validation_data=(X_test, y_test), verbose=1, callbacks=[early_stop])

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f2fec6eeb20>

## Testing the model.

Predictions are performed on a test class. The *confusion matrix* and the
*predictions classifications report* are calculated.

**Classification report results explained:**
* **Precision** – the accuracy of the predictions.
* **Recall** – the accuracy of the *True Positives* predictions
* **f1-score** – the accuracy of *True Positives* VS *False Positives* success.
* **support** – the amount of the items of this class in the corpus.

See https://en.wikipedia.org/wiki/Confusion_matrix.

See https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html.

**NOTE:** The model accuracy is reporting as 0.9956, which is an indication that
the model is probably overfit. As I said above, we could lower the amount of
neurons to 40 and still keep the fit of the model good.

In [20]:
model.evaluate(X_test, y_test)

predictions = (model.predict(X_test) > 0.5).astype("int32")

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

[[877   2]
 [  3 264]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       879
           1       0.99      0.99      0.99       267

    accuracy                           1.00      1146
   macro avg       0.99      0.99      0.99      1146
weighted avg       1.00      1.00      1.00      1146

