# Binary Text Classification - SMS Spam classifier
* Notebook by Adam Lang
* Date: 8/2/2024

# Overview
* In this notebook we will build a binary text classification machine learning model to predict ham vs. span for email messages.
* This is a popular dataset from Kaggle and the UCI Machine Learning Repository.
* The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
  * dataset link: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

# Machine Learning Model
* We will use a Multinomial Naive Bayes Classifier for binary prediction.
* The multinomial naïve Bayes is commonly used for assigning documents to classes based on the **statistical analysis** of their contents.
* It provides an alternative to the "heavy" AI-based semantic analysis and drastically simplifies text classification tasks.
   * This method assigns fragments of text (i.e. documents) to classes by determining the **probability that a document belongs to the class of other documents, having the same subject.**
   * Each document consists of multiple words (i.e. terms), that contribute to an understanding of a document’s contents.
   * A class is a tag of one or multiple documents, referring to the same subject.
* Unlike similar AI and machine learning (ML), used for content-based texts classification, the **multinomial Bayesian classifiers are entirely a data mining approach**, that allows predicting classes for texts, introduced to the model, without its continuous training.
   * However, to prevent early convergence and cold start issues, encountered in the multinomial models, it is recommended to use semi-supervised learning algorithms to train the model for improved prediction.

# Workflow
1. Load Dataset
2. Pre-process data
3. Feature engineering and Model Building
  * a. Create Meta features
  * b. Counting Nouns and Verbs
  * c. Model building for meta features
  * d. Tf-idf Features
  * e. Model Building for Complete Feature Set

## 1. Loading the dataset

In [1]:
# imports
import pandas as pd
import string #python string library

In [2]:
# mount drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# set data path
data_path = '/content/drive/MyDrive/Colab Notebooks/Classical NLP/spam.csv'

### Determine Encoding of File

In [4]:
#!pip install chardet

In [5]:
# look at the first ten thousand bytes to guess the character encoding
import chardet

with open(data_path, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.7261670208776098, 'language': ''}


In [6]:
# load dataset
data = pd.read_csv(data_path, encoding='Windows-1252')
#data head
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [7]:
## data info
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [8]:
print(type(data))

<class 'pandas.core.frame.DataFrame'>


In [9]:
## data columns
data.columns

Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [10]:
## rename columns
data.columns = ['label','text','3','4','5']
data.head()

Unnamed: 0,label,text,3,4,5
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [11]:
## drop the last 3 columns
data.drop(['3','4','5'],axis=1, inplace=True)

In [12]:
data.columns

Index(['label', 'text'], dtype='object')

In [13]:
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [14]:
## get class distribution
data.label.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
label,Unnamed: 1_level_1
ham,0.865937
spam,0.134063


Summary:
* Typical problem we see above is imbalanced target label with more ham than spam.

## 2. Data pre-processing
* First test these processes on a sample of the data, then implement.

Lowercase text

In [15]:
# sample document - lowercase
cleaned = data['text'][0].lower()

In [16]:
# sample document
cleaned

'go until jurong point, crazy.. available only in bugis n great world la e buffet... cine there got amore wat...'

Punctuation removal

In [17]:
# pre init list of punctuations
punct = string.punctuation

In [18]:
# list of punctuations
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [19]:
# sample doc: remove punctuations --> concatenate
cleaned = "".join(character for character in cleaned if character not in punct)

In [20]:
# sample doc
cleaned

'go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat'

Stopword removal

In [21]:
# import spacy
from spacy.lang.en import English

# load English tokenizer, tagger, parser, NER and word vectors
nlp = English()

In [22]:
# spacy doc creation
my_doc = nlp(cleaned)

2 step process to remove stop words

In [23]:
#1. create list of word tokens
token_list = []
for token in my_doc:
  token_list.append(token.text)

In [24]:
# sample the tokens
token_list[0:5]

['go', 'until', 'jurong', 'point', 'crazy']

In [25]:
# import spacy stopwords
from spacy.lang.en.stop_words import STOP_WORDS

In [26]:
#2. iterate and remove stop words
filtered_sent = []

# iterate over tokens
for word in token_list:
  # get token text
  lexeme = nlp.vocab[word]
  # check if stopword or not
  if lexeme.is_stop == False:
    filtered_sent.append(word)

# print tokens and filtered sentence
print(token_list)
print(filtered_sent)
cleaned = filtered_sent

['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']
['jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat']


In [27]:
## now join tokenized words in sample doc
cleaned = " ".join(cleaned)
cleaned

'jurong point crazy available bugis n great world la e buffet cine got amore wat'

Now we can create a function that does all of the above

In [28]:
# text preprocessing function
def clean_text(text):
  ## lower case text
  cleaned = text.lower()

  # punctuation removal
  punctuations = string.punctuation
  cleaned = "".join(character for character in cleaned if character not in punctuations)

  ## create tokens to remove stopwords
  my_doc = nlp(cleaned)
  token_list = []
  for token in my_doc:
    token_list.append(token.text)

  ## remove stop words
  filtered_sent = []

  for word in token_list:
    lexeme = nlp.vocab[word]
    if lexeme.is_stop == False:
      filtered_sent.append(word)

  ## store cleaned document
  cleaned = filtered_sent
  cleaned = " ".join(cleaned)

  return cleaned

In [29]:
## apply clean_text function
data['cleaned'] = data['text'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,text,cleaned
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives


## 3. Feature Engineering and Model Building

### a. Create Meta Features
     1. Number of words in original text.
     2. Number of words in cleaned text.
     3. Number of characters including spaces in cleaned text.
     4. Number of characters excluding spaces in cleaned text.
     5. Number of digits in cleaned text.

In [30]:
## meta feature creation

#1. number of words in original text
data['word_count'] = data['text'].apply(lambda x: len(x.split()))

#2. number of words in cleaned text
data['word_count_cleaned'] = data['cleaned'].apply(lambda x: len(x.split()))

#3. Number of characters including spaces in cleaned text
data['char_count'] = data['cleaned'].apply(lambda x: len(x))

#4. Number of characters excluding spaces in cleaned_text
data['char_count_without_spaces'] = data['cleaned'].apply(lambda x: len(x.replace(" ", "")))

#5. Number of digits in cleaned text
data['num_dig'] = data['cleaned'].apply(lambda x: sum([1 if w.isdigit() else 0 for w in x.split()]))

In [31]:
# print dataset
data.head()

Unnamed: 0,label,text,cleaned,word_count,word_count_cleaned,char_count,char_count_without_spaces,num_dig
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...,20,15,79,65,0
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,6,6,23,18,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,28,22,131,110,3
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c,11,6,19,14,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives,13,6,27,22,0


In [32]:
## nlp stats
data.describe()

Unnamed: 0,word_count,word_count_cleaned,char_count,char_count_without_spaces,num_dig
count,5572.0,5572.0,5572.0,5572.0,5572.0
mean,15.494436,8.520998,48.82771,40.79397,0.420136
std,11.329427,6.51806,40.006992,33.226067,0.980831
min,1.0,0.0,0.0,0.0,0.0
25%,7.0,4.0,20.0,17.0,0.0
50%,12.0,6.0,35.0,29.0,0.0
75%,23.0,12.0,70.0,58.0,0.0
max,171.0,73.0,460.0,346.0,9.0


### b. Count Nouns and Verbs
* POS or part of speech tags.

In [33]:
# import spacy english model
import spacy
nlp = spacy.load("en_core_web_sm")

In [34]:
# create spacy document - sample for testing
document = nlp(data['cleaned'][0])

In [35]:
# sample doc for testing
document

jurong point crazy available bugis n great world la e buffet cine got amore wat

In [36]:
## sample POS tags
all_tags = []
for w in document:
  all_tags.append(w.tag_)

In [37]:
# POS tags
all_tags

['NNP',
 'VBP',
 'NNP',
 'JJ',
 'NNP',
 'CC',
 'JJ',
 'NN',
 'NNP',
 'NNP',
 'NNP',
 'NNP',
 'VBD',
 'NNP',
 'NN']

In [38]:
## dictionary of nouns and verb POS tags -- granular tags
pos_dict = {"noun": ["NNP", "NN", "NNS", "NNPS"], "verb": ["VBZ", "VB", "VBD", "VBG", "VBN"]}

In [39]:
## Sample document: Noun count
count = 0
for tag in all_tags:
  if tag in pos_dict['noun']:
    count += 1

In [40]:
## print sample document : NOUN counts
count

10

In [41]:
## now create a function to do this on our dataset
def pos_tag(txt, family):

  # spacy document creation
  txt = nlp(txt)

  # list of tags
  all_tags = []

  # get the POS tags
  for w in txt:
    all_tags.append(w.tag_)

  # instantiate counter
  count = 0

  # count number of nouns and verbs
  for tag in all_tags:
    if tag in pos_dict[family]:
      count += 1

  return count

In [42]:
## test on sample document: NOUN count
pos_tag("The police station is in New York City.", "noun")

5

In [43]:
## test on sample document: VERB count
pos_tag("They are running, jumping, and throwing in the Olympics.", "verb")

2

In [44]:
## now apply function to dataset
%%time
data['noun_count'] = data['cleaned'].apply(lambda x: pos_tag(x, "noun"))
data['verb_count'] = data['cleaned'].apply(lambda x: pos_tag(x, "verb"))

CPU times: user 1min 31s, sys: 551 ms, total: 1min 32s
Wall time: 1min 36s


In [45]:
# print output
data.head()

Unnamed: 0,label,text,cleaned,word_count,word_count_cleaned,char_count,char_count_without_spaces,num_dig,noun_count,verb_count
0,ham,"Go until jurong point, crazy.. Available only ...",jurong point crazy available bugis n great wor...,20,15,79,65,0,10,1
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni,6,6,23,18,0,3,1
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...,28,22,131,110,3,12,0
3,ham,U dun say so early hor... U c already then say...,u dun early hor u c,11,6,19,14,0,6,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah nt think goes usf lives,13,6,27,22,0,1,1


In [46]:
## analytics on noun and verb counts
data[['noun_count','verb_count']].describe()

Unnamed: 0,noun_count,verb_count
count,5572.0,5572.0
mean,4.589734,1.254128
std,4.06274,1.33912
min,0.0,0.0
25%,2.0,0.0
50%,3.0,1.0
75%,6.0,2.0
max,40.0,13.0


### c. Model Building for Meta Features

In [47]:
# label encoding target variable - convert strings to integers
from sklearn.preprocessing import LabelEncoder

target = data['label'].values
target = LabelEncoder().fit_transform(target)

In [48]:
# list of features - can model classify text or not?
train = data[['word_count', 'word_count_cleaned', 'char_count',
              'char_count_without_spaces', 'num_dig', 'noun_count',
              'verb_count']]

In [49]:
## train-val split
from sklearn.model_selection import train_test_split

## split data - without specifying sklearn will default to 75% train/25% test
X_train, X_valid, y_train, y_valid = train_test_split(train, target, random_state=42, stratify=target)

In [50]:
# shape of datasets
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_valid: {X_valid.shape}")
print(f"Shape of y_valid: {y_valid.shape}")

Shape of X_train: (4179, 7)
Shape of y_train: (4179,)
Shape of X_valid: (1393, 7)
Shape of y_valid: (1393,)


Summary:
* We can see there are 4,179 documents in the train set and 1,393 documents in the test set.

Build Naive Bayes Model

In [51]:
# Multinomial naive bayes
from sklearn import naive_bayes

# instantiate model - multinomialNB works well with discrete features
model = naive_bayes.MultinomialNB()

In [52]:
# fit model on trainin data
model.fit(X_train, y_train)

In [53]:
# Make predictions on training data
pred_train = model.predict(X_train)

# Predict on val data
pred_valid = model.predict(X_valid)

In [54]:
## accuracy of model
from sklearn.metrics import accuracy_score

# train accuracy
accuracy_score(y_train, pred_train)

0.9411342426417804

In [55]:
# test or val accuracy
accuracy_score(y_valid, pred_valid)

0.9389806173725772

Classification Report

In [56]:
# import classification report
from sklearn.metrics import classification_report

In [57]:
## print classification report for multinomial naive bayes
target_vals = ['ham', 'spam']
print('Classification Report for Multinomial Naive Bayes Model\n')
## print the classification report
print(classification_report(y_valid, pred_valid, target_names=target_vals))

Classification Report for Multinomial Naive Bayes Model

              precision    recall  f1-score   support

         ham       0.96      0.97      0.96      1206
        spam       0.78      0.76      0.77       187

    accuracy                           0.94      1393
   macro avg       0.87      0.86      0.87      1393
weighted avg       0.94      0.94      0.94      1393



Summary:
* Accuracy for train data was 94.1% and 93.8% for test data.
* The f1 score for spam was 77% significantly lower than for ham. This could be attributed to the text preprocessing methods, but also the fact that we do know the target variable is not balanced, so perhaps going back to up or downsample (or another method like SMOTE) to handle the target variable would work.
* However, in the next example we will show how combining meta features of the text data with Tfidf vectorization can improve the model prediction accuracy without having to change what we did for preprocessing.
* We should look further into this because we do need to question whether the model is overfitting which it seems to be in my immediate assessment.

### d. Tf-idf Features
* Creating features using vectorization.

In [58]:
# import tf-idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# instantiate tfidf vectorizer --> top 500 most relevant words in vocabulary/data
word_tfidf = TfidfVectorizer(max_features=500)

# fit Tfidf vectorizer
word_tfidf.fit(data['cleaned'].values)


In [59]:
# transform data
word_vectors_tfidf = word_tfidf.transform(data['cleaned'].values)

In [60]:
# print tfidf vectors
word_vectors_tfidf

<5572x500 sparse matrix of type '<class 'numpy.float64'>'
	with 21920 stored elements in Compressed Sparse Row format>

Summary:
* We have 500 terms for all 5,572 documents.


Now we combine the tfidf features with the features we created earlier

In [61]:
# combining meta features and Tf-idf features
from scipy.sparse import hstack, csr_matrix

# list of meta features
meta_features = ['word_count', 'word_count_cleaned',
                 'char_count', 'char_count_without_spaces',
                 'num_dig', 'noun_count', 'verb_count']

# meta features
feature_set1 = data[meta_features]

# combined features - train data is tfidf + meta features
# csr - compressed sparse row matrix
train = hstack([word_vectors_tfidf, csr_matrix(feature_set1)], "csr")

### e. Model Building for complete feature set
* Putting it all together.

In [62]:
# train and val datasets - defaulting 75/25 split
X_train, X_valid, y_train, y_valid = train_test_split(train, target, random_state=42, stratify=target)

In [63]:
# train and val dataset shapes
# shape of datasets
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_valid: {X_valid.shape}")
print(f"Shape of y_valid: {y_valid.shape}")

Shape of X_train: (4179, 507)
Shape of y_train: (4179,)
Shape of X_valid: (1393, 507)
Shape of y_valid: (1393,)


In [64]:
# multinomial naive bayes model
model = naive_bayes.MultinomialNB()

In [65]:
# fit NB model
model.fit(X_train, y_train)

In [66]:
# predict train data
pred_train = model.predict(X_train)

# predict on val data
pred_valid = model.predict(X_valid)

In [67]:
# accuracy on train set
accuracy_score(y_train, pred_train)

0.9660205790859058

In [68]:
# accuracy on validation/test data
accuracy_score(y_valid, pred_valid)

0.964824120603015

In [69]:
## print classification report for multinomial naive bayes
target_vals = ['ham', 'spam']
print('Classification Report for Multinomial Naive Bayes Model with meta features + tfidf features\n')
## print the classification report
print(classification_report(y_valid, pred_valid, target_names=target_vals))

Classification Report for Multinomial Naive Bayes Model with meta features + tfidf features

              precision    recall  f1-score   support

         ham       0.99      0.97      0.98      1206
        spam       0.83      0.93      0.88       187

    accuracy                           0.96      1393
   macro avg       0.91      0.95      0.93      1393
weighted avg       0.97      0.96      0.97      1393



# Summary
* We can see the accuracy for the train and validation/test data improved by combining the meta features with the tfidf vectorization features.
* The model's f1 score significantly improved for predicting spam at 88% vs. only 77% for previous model with only the meta features.
* The overall accuracy was up to 96% from 93-94% from the previous model.
* We can significantly improve this model by using embeddings instead of Tfidf vectorization. In that case we may not want to remove the stop words and utilize sentence transformers with multi-head attention mechanism which takes stop words into account. Perhaps another experiment to try.
* We could also consider adding POS tags for adjectives and other parts of speech.
* We could also consider dealing with lemma conversions as well as abbreviations, however if we used embeddings that may not be necessary.
* We can also improve the model by dealing with the significant imbalance in the target variable using various sampling techniques. Although we were able to improve the prediction f1 score and accuracy for predicting spam by combining the 2 techniques metafeatures and tfidf.

# References
* Idoko, 2019. "RandomForest Classifier Vs Multinomial Naive Bayes for a multi-output Natural Language classification problem." Retrieved from: https://medium.com/analytics-vidhya/randomforest-classifier-vs-multinomial-naive-bayes-for-a-multi-output-natural-language-2426381a5217#:~:text=2.,for%20larger%20test%20sample%20size.
* Ratz, 2021. "Multinomial Naїve Bayes’ For Documents Classification and Natural Language Processing (NLP)". Retrieved from: https://towardsdatascience.com/multinomial-na%C3%AFve-bayes-for-documents-classification-and-natural-language-processing-nlp-e08cc848ce6