<a href="https://colab.research.google.com/github/claudio-bon/spam-detection-r/blob/main/spam_detection_r.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spam Detection
In this project it will be performed a text classification task on a dataset composed of a textual field that contain a message and a related label which indicate wether the message it either a spam or a ham (not a spam).<br>
The classfication will be performed through the deployment of two machine learning models. An evaluation of the confidence interval and a comparison between the two will also be done.<br>
Moreover it will also be attempted an LSA transformation of the features' space and a successive reassessment of the previously used models with the newly created features.

In [None]:
if(!require("magrittr"))
    install.packages("magrittr")
library(magrittr)

if(!require("tokenizers"))
    install.packages("tokenizers")
library(tokenizers)

if(!require("data.table"))
    install.packages("data.table")
library(data.table)

if(!require("text2vec"))
    install.packages("text2vec")
library(text2vec)

if(!require("qdap"))
    install.packages("qdap")
library(qdap)

if(!require("class"))
    install.packages("class")
library(class)

if(!require("MLmetrics"))
    install.packages("MLmetrics")
library(MLmetrics)

if(!require("glmnet"))
    install.packages("glmnet")
library(glmnet)

if(!require("stats"))
    install.packages("stats")
library(stats)

Loading required package: magrittr

Loading required package: tokenizers

“there is no package called ‘tokenizers’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘SnowballC’


Loading required package: data.table

“there is no package called ‘data.table’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Loading required package: text2vec

“there is no package called ‘text2vec’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘float’, ‘RhpcBLASctl’, ‘RcppArmadillo’, ‘rsparse’, ‘mlapi’, ‘lgr’


Loading required package: qdap

“there is no package called ‘qdap’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘openNLPdata’, ‘rJava’, ‘bitops’, ‘plyr’, ‘slam’, ‘qdapDictionaries’, ‘qdapRegex’, ‘qdapTools’, ‘chron’, ‘gender’, ‘gridExtra’, ‘igraph’, ‘NLP’, ‘op

# Dataset Preparation

### Download
Download and extraction of the dataset.

In [None]:
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip",
              "smsspamcollection.zip")
unzip("smsspamcollection.zip")
file.remove("readme")

### Load on Table
The data will be loaded on a `data.table` structure type.

In [None]:
table <- fread("SMSSpamCollection" ,sep="\t", header=FALSE, col.names=c("class","text"), quote="")

Dataset exploration.

In [None]:
i<-0
while (i<5570) {
    print(i)
    print(table[i,])
    i<-i+1
}

# Text Preprocessing
From data exploration it can be observed that many spam messages shares some common features such as: presence of links, emails, telephone numbers and money quantities.<br>
In order for the models to capture this pattern it should be attempted a normalization of the fetures described above trying to group them all using the same tokens.

In [None]:
lower <- function(text) tolower(text)
tokenize <- function(text) tokenize_words(text)
#substitute link as "<LINK>"
link_sub <- function(text) gsub( "(https?://)?(www\\.)?((\\w|-)+\\.\\s?){1,2}(co\\.\\s?)?(com|net|biz|uk|org|tv|ac)([A-Za-z0-9?&/!\\-\\=]*)", " <LINK>", text)
#substitute emails with "<EMAIL>"
email_sub <- function(text) gsub("[A-Za-z0-9.]+@[A-Za-z0-9]+\\.((co|uk)\\.)?(com|net|biz|uk|org|tv|ac)", "<EMAIL>", text)
#money quantity (e.g. £100 or £1.5) to "<MONEY>"
money_sub <- function(text) gsub("(£[0-9]+((\\.|,)[0-9]+)?)|([0-9]+(p/min|ppm|p\\sper\\sminute))", "<MONEY>", text)
#long number (generally phone numbers)
long_number_sub <- function(text) gsub("(\\+?\\d{5,})|(([0-9]{3,}-)+[0-9]{3,})|(([0-9]{3,}\\s)+[0-9]{3,})", "<LONGNUM>", text)
#\u0092
apostrophe_code_sub <- function(text) gsub("\\\\u0092", "'", text)
#ukn char code
ukn_code_sub <- function (text) gsub("&lt;#&gt;", "<UKNCODE>", text)

All the defined preprocessing function will be chained in order to form a preprocessing pipeline.

In [None]:
preprocess_text <- function(text) {
    text %>%
        lower() %>%
        link_sub() %>%
        email_sub() %>%
        money_sub() %>%
        long_number_sub() %>%
        apostrophe_code_sub() %>%
        ukn_code_sub() %>%
        replace_contraction()
}

In [None]:
preprocess_text(table$text)

# Split the Dataset
The dataset is splitted in training and test set in a proportion of 75 - 25.

In [None]:
smp_size <- floor(0.75 * nrow(table))
train_idx <- sample(seq_len(nrow(table)), size = smp_size)
train <- table[train_idx, ]
test <- table[-train_idx, ]

# Feature Generation
The feature used to train the models is TF-IDF.

In [None]:
it_train = itoken(train$text,
            preprocessor = preprocess_text,
            tokenizer = word_tokenizer,
            progressbar = FALSE)

vocab = create_vocabulary(it_train)
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(it_train, vectorizer)


tfidf = TfIdf$new()
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)

# apply pre-trained tf-idf transformation to test data
it_test = itoken(test$text,
            preprocessor = preprocess_text,
            tokenizer = word_tokenizer,
            progressbar = FALSE)

dtm_test = create_dtm(it_test, vectorizer)
dtm_test_tfidf = transform(dtm_test, tfidf)

# Classification
In this section two models will be adopted in the attempt to perform a classification task in order to discern ham vs. spam messages.<br>
The models that will be used are KNN and Linear Regression.

### KNN

Since KNN has no training phase it can be immediately performed the prediction on the test set.

In [None]:
#time >20 minutes
knn_preds <- knn(train = dtm_train_tfidf, test = dtm_test_tfidf, cl = train$class, k=10)

Evaluation:

In [None]:
cat("Confusion Matrix:")
ConfusionDF(y_pred = knn_preds, y_true = test$class)
cat("\nF1 score:")
F1_Score(y_pred = knn_preds, y_true = test$class)
cat("\nAccuracy:")
Accuracy(y_pred = knn_preds, y_true = test$class)
cat("\nPrecision:")
Precision(y_pred = knn_preds, y_true = test$class)
cat("\nRecall:")
Recall(y_pred = knn_preds, y_true = test$class)

Confusion Matrix:

y_true,y_pred,Freq
<chr>,<chr>,<int>
ham,ham,1212
spam,ham,124
ham,spam,1
spam,spam,57



F1 score:


Accuracy:


Precision:


Recall:

#### Confidence
Let's start by definind the functions that will compute the confidence interval of the models.

In [None]:
compute_z <- function(confidence) {
    alpha <- 1-confidence
    qnorm(1-(alpha/2))
}
compute_pmax <- function(N, acc, Z) (2*N*acc + Z^2 + Z*sqrt(Z^2 + 4*N*acc - 4*N*acc^2))/(2*(N + Z^2))
compute_pmin <- function(N, acc, Z) (2*N*acc + Z^2 - Z*sqrt(Z^2 + 4*N*acc - 4*N*acc^2))/(2*(N + Z^2))

In [None]:
get_num_pos <- function(y_pred, y_true) {
    confusion_mat <- ConfusionDF(y_pred = y_pred, y_true = y_true)
    pos_matrix <- subset(confusion_mat, y_true == y_pred)
    pos <- sum(pos_matrix$Freq)
}

get_confidence_interval <- function(confidence, y_pred, y_true) {
    pos <- get_num_pos(y_pred = y_pred, y_true = y_true)
    n_trials <- length(y_true)
    acc <- pos/n_trials

    #Compute Z
    Z <- compute_z(confidence = confidence)

    #Compute p-value
    pmin <- compute_pmin(N = n_trials, acc = acc, Z = Z)
    pmax <- compute_pmax(N = n_trials, acc = acc, Z = Z)

    list("min" = pmin, "max" = pmax)
}

Let's compute the confidence interval of the model's prediction with a confidence level of $0.95$.

In [None]:
p_knn <- get_confidence_interval(confidence = 0.95, y_pred = knn_preds, y_true = test$class)
cat("Confidence interval: ")
cat("(",p_knn$min,", ",p_knn$max,")")

Confidence interval: ( 0.8941824 ,  0.9242223 )

### Logistic Regression

Train the model:

In [None]:
binomial_model <- cv.glmnet(x = dtm_train_tfidf, y = train$class, family = "binomial")

“from glmnet Fortran code (error code -62); Convergence for 62th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
“from glmnet Fortran code (error code -61); Convergence for 61th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
“from glmnet Fortran code (error code -60); Convergence for 60th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
“from glmnet Fortran code (error code -60); Convergence for 60th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
“from glmnet Fortran code (error code -61); Convergence for 61th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”
“from glmnet Fortran code (error code -61); Convergence for 61th lambda value not reached after maxit=100000 iterations; solutions for larger lambdas returned”


Use the trained model to predict on the test set:

In [None]:
binomial_probabilities <- predict(binomial_model, newx = dtm_test_tfidf, type = "response")
head(binomial_probabilities)

Unnamed: 0,1
1,0.03391054
2,0.03391054
3,0.8453471
4,0.03391054
5,0.32865097
6,0.03391054


Cast probabilities to class predictions:

In [None]:
threshold <- function(x) as.integer(x >=0.5)
idx_to_class <- function(x) if (x==0) "ham" else "spam"
p2c <- function(x) {x %>% threshold %>% idx_to_class}
preds_to_class <- function(preds) sapply(preds, p2c)

binomial_preds <- preds_to_class(binomial_probabilities)

Show measures:

In [None]:
cat("Confusion Matrix:")
ConfusionDF(y_pred = binomial_preds, y_true = test$class)
cat("\nF1 score:")
F1_Score(y_pred = binomial_preds, y_true = test$class)
cat("\nAccuracy:")
Accuracy(y_pred = binomial_preds, y_true = test$class)
cat("\nPrecision:")
Precision(y_pred = binomial_preds, y_true = test$class)
cat("\nRecall:")
Recall(y_pred = binomial_preds, y_true = test$class)

Confusion Matrix:

y_true,y_pred,Freq
<chr>,<chr>,<int>
ham,ham,1209
spam,ham,36
ham,spam,4
spam,spam,145



F1 score:


Accuracy:


Precision:


Recall:

#### Confidence
Confidence interval for a confidence level of $0.95$.

In [None]:
p_bin <- get_confidence_interval(confidence = 0.95, y_pred = binomial_preds, y_true = test$class)
cat("Confidence interval: ")
cat("(",p_bin$min,", ",p_bin$max,")")

Confidence interval: ( 0.9611633 ,  0.9788575 )

### Comparing Models

Functions to compute variance $\hat\sigma_i$ and error rates $e_i$ of the single models.

In [None]:
get_e <- function(N, N_pos) (N - N_pos)/N
get_var <- function(N, e) (e*(1-e))/N

Functions to computes the difference of the models' errors $d$ and the sum of the models' variances $\hat\sigma_t$

In [None]:
get_d <- function(e_1, e_2) abs(e_1 - e_2)
get_var_t <- function(n_trials, e_1, e_2) {
    var_1 <- get_var(n_trials, e_1)
    var_2 <- get_var(n_trials, e_2)
    var_1 + var_2
}

Function to compute the error interval of the two models $d_t$.

In [None]:
get_dt <- function(Z, d, var_t) {
    list("max" = d + Z*sqrt(var_t), "min" = d - Z*sqrt(var_t))
}

Now let's compare the KNN and the Linear Regression model.

In [None]:
n_trials <- length(test$class)
pos_knn <- get_num_pos(knn_preds, test$class)
pos_bin <- get_num_pos(binomial_preds, test$class)

e_knn <- get_e(n_trials, pos_knn)
e_bin <- get_e(n_trials, pos_bin)

d <- get_d(e_knn, e_bin)
var_t <- get_var_t(n_trials, e_knn, e_bin)

In [None]:
Z <- compute_z(confidence = 0.95)
dt <- get_dt(Z, d, var_t)

cat("Confidence interval: ")
cat("(",dt$min,", ",dt$max,")")

Confidence interval: ( 0.04360463 ,  0.07834659 )

Since 0 is not present in the interval, for confidence value of 0.95 the difference between the two models can be said to be significant.<br>
In order to find the confidence value such that the difference between the two models is negligible it's required to find the value of $Z_{\alpha/2}$ such that $Z_{\alpha/2}\hat{\sigma}_t\geq d\Rightarrow Z_{\alpha/2}\geq \frac{d}{\hat{\sigma}_t}$

In [None]:
compute_confidence <- function(Z) 1-2*(1-pnorm(Z))

In [None]:
min_z = d/sqrt(var_t)
cat("Z =",min_z,"\n")

negligible_confidence <- compute_confidence(Z = min_z)
cat("Negligible confidence level:",negligible_confidence,"\n")

dt_negl <- get_dt(min_z, d, var_t)
cat("Confidence interval: ")
cat("(",dt_negl$min,", ",dt_negl$max,")")

Z = 6.879863 
Negligible confidence level: 1 
Confidence interval: ( 0 ,  0.1219512 )

It can indeed be seen that for $Z_{\alpha /2}=8.80$, $0$ is included in the confidence interval.

# LSA
The feature space created by the TF-IDF procedure is very large (with high dimensionality) and sparse as well. The LSA procedure is able to fix these two problems by creating a smaller dense space with the addition of bringing out the latent relationship of the TF-IDF features.

Since applying LSA imply also reducing the feature space it would be helpful to know the dimension of the starting feature space (the one generated by the TF-IDF procedure).

In [None]:
dim(vocab)[1]

Let's define a LSA transformation that holds 300 features.

In [None]:
lsa_m = LatentSemanticAnalysis$new(300)
train_lsa = lsa_m$fit_transform(dtm_train_tfidf)
test_lsa = lsa_m$transform(dtm_test_tfidf)

INFO  [12:57:32.326] soft_als: iter 001, frobenious norm change 11.730 loss NA  
INFO  [12:57:33.292] soft_als: iter 002, frobenious norm change 0.498 loss NA  
INFO  [12:57:34.332] soft_als: iter 003, frobenious norm change 0.065 loss NA  
INFO  [12:57:35.325] soft_als: iter 004, frobenious norm change 0.020 loss NA  
INFO  [12:57:36.243] soft_als: iter 005, frobenious norm change 0.009 loss NA  
INFO  [12:57:37.263] soft_als: iter 006, frobenious norm change 0.004 loss NA  
INFO  [12:57:38.252] soft_als: iter 007, frobenious norm change 0.003 loss NA  
INFO  [12:57:39.207] soft_als: iter 008, frobenious norm change 0.002 loss NA  
INFO  [12:57:40.138] soft_als: iter 009, frobenious norm change 0.001 loss NA  
INFO  [12:57:41.106] soft_als: iter 010, frobenious norm change 0.001 loss NA  
INFO  [12:57:41.109] soft_impute: converged with tol 0.001000 after 10 iter 


### KNN
Let's now repeat the experiment on the KNN classified using LSA features instead of TF-IDF.

Prediction phase:

In [None]:
knn_preds_lsa <- knn(train = train_lsa, test = test_lsa, cl = train$class, k=11)

Evaluation:

In [None]:
cat("Confusion Matrix:")
ConfusionDF(y_pred = knn_preds_lsa, y_true = test$class)
cat("\nF1 score:")
F1_Score(y_pred = knn_preds_lsa, y_true = test$class)
cat("\nAccuracy:")
Accuracy(y_pred = knn_preds_lsa, y_true = test$class)
cat("\nPrecision:")
Precision(y_pred = knn_preds_lsa, y_true = test$class)
cat("\nRecall:")
Recall(y_pred = knn_preds_lsa, y_true = test$class)

Confusion Matrix:

y_true,y_pred,Freq
<chr>,<chr>,<int>
ham,ham,1184
spam,ham,15
ham,spam,29
spam,spam,166



F1 score:


Accuracy:


Precision:


Recall:

It can be observed that the results are indeed better compared with the prediction done on KNN without LSA features.

#### Confidence

In [None]:
p_knn_lsa <- get_confidence_interval(confidence = 0.95, y_pred = knn_preds_lsa, y_true = test$class)
cat("Confidence interval: ")
cat("(",p_knn_lsa$min,", ",p_knn_lsa$max,")")

Confidence interval: ( 0.9578935 ,  0.9764042 )

As expected, the confidence interval is higher as well.

### Logistic Regression
Let's repeat the experiment with LSA feature with Logistic Regression as well.

Training phase:

In [None]:
binomial_model_lsa <- cv.glmnet(x = train_lsa, y = train$class, family = "binomial")

Prediction:

In [None]:
binomial_probabilities_lsa <- predict(binomial_model_lsa, newx = test_lsa, type = "response")
binomial_preds_lsa <- preds_to_class(binomial_probabilities_lsa)

Evaluation:

In [None]:
cat("Confusion Matrix:")
ConfusionDF(y_pred = binomial_preds_lsa, y_true = test$class)
cat("\nF1 score:")
F1_Score(y_pred = binomial_preds_lsa, y_true = test$class)
cat("\nAccuracy:")
Accuracy(y_pred = binomial_preds_lsa, y_true = test$class)
cat("\nPrecision:")
Precision(y_pred = binomial_preds_lsa, y_true = test$class)
cat("\nRecall:")
Recall(y_pred = binomial_preds_lsa, y_true = test$class)

Confusion Matrix:

y_true,y_pred,Freq
<chr>,<chr>,<int>
ham,ham,1208
spam,ham,23
ham,spam,5
spam,spam,158



F1 score:


Accuracy:


Precision:


Recall:

Also for the case of Linear Regression can be observed a general improvement of the measurements.

#### Confidence

In [None]:
p_bin_lsa <- get_confidence_interval(confidence = 0.95, y_pred = binomial_preds_lsa, y_true = test$class)
cat("Confidence interval: ")
cat("(",p_bin_lsa$min,", ",p_bin_lsa$max,")")

Confidence interval: ( 0.9711231 ,  0.986067 )

And the confidence interval improves as well.

### Comparing Models
Let's now proceed with the comparation of the two model trained on LSA features.

In [None]:
pos_knn_lsa <- get_num_pos(knn_preds_lsa, test$class)
pos_bin_lsa <- get_num_pos(binomial_preds_lsa, test$class)

e_knn_lsa <- get_e(n_trials, pos_knn_lsa)
e_bin_lsa <- get_e(n_trials, pos_bin_lsa)

d_lsa <- get_d(e_knn_lsa, e_bin_lsa)
var_t_lsa <- get_var_t(n_trials, e_knn_lsa, e_bin_lsa)

In [None]:
Z_lsa <- compute_z(confidence = 0.95)
dt_lsa <- get_dt(Z_lsa, d_lsa, var_t_lsa)

cat("Confidence interval: ")
cat("(",dt_lsa$min,", ",dt_lsa$max,")")

Confidence interval: ( -0.0002897761 ,  0.0232453 )

The difference between the two models fed with TF-IDF features processed with LSA is negligible with a confidence level of $0.95$.<br>
The minimum confidence level that makes the difference negligible can be found in the following way:

In [None]:
min_z_lsa = d_lsa/sqrt(var_t_lsa)
cat("Z =",min_z_lsa,"\n")

negligible_confidence_lsa <- compute_confidence(Z = min_z_lsa)
cat("Negligible confidence level:",negligible_confidence_lsa,"\n")

dt_negl_lsa <- get_dt(min_z_lsa, d_lsa, var_t_lsa)
cat("Confidence interval: ")
cat("(",dt_negl_lsa$min,", ",dt_negl_lsa$max,")")

Z = 1.9117 
Negligible confidence level: 0.9440853 
Confidence interval: ( 0 ,  0.02295552 )