# Spam text message classifier
This interactive notebook will explore text message data found on [Kaggle.com](https://www.kaggle.com/team-ai/spam-text-message-classification) and create models to predict whether or not the message is spam. 

### Importing the data
Import the data from `SPAM text message 20170820 - Data.csv`. The file contains a label `Category`, which is whether or not the transaction is spam. Each row corresponds to a text message. 

Here we'll import the data into a Pandas dataframe and change the `spam` instances in `Category` to a 1, and replace other instances (the `ham` messages) with 0s. Then we'll compute the percentage of messages that are spam. 

In [2]:
import pandas as pd
import numpy as np

spam_data = pd.read_csv('SPAM text message 20170820 - Data.csv')

spam_data['Category'] = np.where(spam_data['Category']=='spam',1,0)


from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['Message'], 
                                                    spam_data['Category'], 
                                                    random_state=0)

spam_data['Category'].mean()*100

13.406317300789663

### Vectorization
First, we'll fit the training data X_train using Count Vectorizer with default parameters.
For fun, we print the longest token in the vocabulary.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer().fit(X_train)
max(vec.get_feature_names(),key = lambda x: len(x))

'hypotheticalhuagauahahuagahyuhagga'

### Naive Bayes Classifier
Now we'll fit and transform the training data X_train using a Count Vectorizer with 
default parameters. Next, we'll fit a multinomial Naive Bayes classifier model with smoothing
alpha = 0.1. We compute the area under the curve (AUC) score using the transformed test data.

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

X_train_vectorized = vec.transform(X_train)

model = MultinomialNB(alpha = 0.1)
model.fit(X_train_vectorized,y_train)

predictions = model.predict(vec.transform(X_test))
roc_auc_score(y_test, predictions)

0.9619675138714874

### Tfidf vectorization
Now we'll fit and transform the training data X_train using a Tfidf Vectorizer
with default parameters. To get an idea of how the Tfidf Vectorizer differs from the count vectorizer, we'll return a series containing 20 features that have the smallest tf-idf and 20 features
having the largest tf-idf. 

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer().fit(X_train)

feature_names = np.array(vect.get_feature_names())

X_train_vectorized = vect.transform(X_train)
sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

smallest_tfidfs = pd.Series(X_train_vectorized.max(0).toarray()[0][sorted_tfidf_index[:20]],
index = feature_names[sorted_tfidf_index[:20]])
largest_tfidfs = pd.Series(X_train_vectorized.max(0).toarray()[0][sorted_tfidf_index[:-21:-1]],
index = feature_names[sorted_tfidf_index[:-21:-1]])
print(smallest_tfidfs,'\n',largest_tfidfs)

exterminator    0.075064
psychiatrist    0.075064
psychologist    0.075064
pudunga         0.075064
chef            0.075064
sympathetic     0.075064
healer          0.075064
aaniye          0.075064
venaam          0.075064
companion       0.075064
organizer       0.075064
courageous      0.075064
athletic        0.075064
determined      0.075064
listener        0.075064
pest            0.075064
dependable      0.075064
stylist         0.075064
dasara          0.092161
sankranti       0.092161
dtype: float64 
 ok                1.000000
nite              1.000000
too               1.000000
alrite            1.000000
lei               1.000000
yup               1.000000
error             1.000000
anything          1.000000
146tf150p         1.000000
645               1.000000
anytime           1.000000
home              1.000000
thanx             1.000000
done              1.000000
thank             1.000000
congratulation    1.000000
where             1.000000
beerage           1.0000

Now we'll fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a 
document frequency strictly lower than 3.

Then we'll fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1 and compute the area under
the curve (AUC) score using the transformed test data. 

In [11]:
vect = TfidfVectorizer(min_df= 3).fit(X_train)

X_train_vectorized = vect.transform(X_train)
model = MultinomialNB(alpha = 0.1)
model.fit(X_train_vectorized,y_train)

predictions = model.predict(vect.transform(X_test))
roc_auc_score(y_test, predictions)

0.9536401467692859

### Improving our model by adding features
In order to make our TFIDF model better, we will add features. To this end, we define the following function:

In [12]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

Now we will start to find features to add. For instance, we average length of an authentic message vs the average length of a spam message:

In [16]:
avg_spam_len = np.mean(list(map(len,spam_data[spam_data['Category']==1]['Message'])))
avg_ham_len = np.mean(list(map(len,spam_data[spam_data['Category']==0]['Message'])))

print(avg_spam_len)
print(avg_ham_len)

137.9892904953146
71.44829015544042


So the average spam message is twice as long as the average authentic message. This motivates fitting and training the data X_train using a Tfidf Vectorizer and ignoring terms that have
a document frequency strictly lower than 5. Then we'll use this document-term matrix and an additional feature, the length of the document (number of 
characters), to fit a Support Vector Classifier model with regularization C=10000. We compute
the area under hte curve (AUC) score using the transformed test data.

In [18]:
from sklearn.svm import SVC

vect = TfidfVectorizer(min_df= 5).fit(X_train)
X_train_vectorized = vect.transform(X_train)

doc_length = np.array(list(map(len, X_train)))

X_train_vectorized = add_feature(X_train_vectorized , doc_length)

model = SVC(C=10000, gamma = 'auto').fit(X_train_vectorized,y_train)

X_test_vectorized = vect.transform(X_test)
test_doc_length = np.array(list(map(len,X_test)))

X_test_vectorized = add_feature(X_test_vectorized, test_doc_length)

predictions = model.predict(X_test_vectorized)
roc_auc_score(y_test, predictions)

0.956562108466082

Next we'll look at the average number of digits per message:

In [19]:
import re

digits_non_spam = spam_data[spam_data['Category']==0]['Message'].apply(lambda x: ''.join(re.findall(r'[0-9]+',x)))
avg_digits_ham = np.mean(list(map(len, digits_non_spam)))

digits_spam = spam_data[spam_data['Category']==1]['Message'].apply(lambda x: ''.join(re.findall(r'[0-9]+',x)))
avg_digits_spam = np.mean(list(map(len, digits_spam)))

print(avg_digits_ham)
print(avg_digits_spam)

0.30528497409326427
15.639892904953147


Spam messages are far more likely to contain digits when compared to authentic messages. With this in mind, we'll fit and transform the training data X_train using Tfidf Vectorizer ignoring terms that have
a document frequency strictly lower than 5 and using words n-grams from n=1 to n =3. Using this
document-term matrix and the following additional features: (1) the length of document (number
of characters) (2) number of digits per document, we'll fit a Logistic Regression model with regularization
C=100. Then compute the area under the curve (AUC) score using the transformed test data.

In [23]:
from sklearn.linear_model import LogisticRegression

vect = TfidfVectorizer(min_df= 5, ngram_range = (1,3) ).fit(X_train)
X_train_vectorized = vect.transform(X_train)

doc_length = np.array(list(map(len, X_train)))

X_train_vectorized = add_feature(X_train_vectorized , doc_length)

doc_num_of_digits = np.array(list(map(len, X_train.apply( lambda x: ''.join( re.findall( r'[0-9]+',x ) ) ) ) ) )
X_train_vectorized = add_feature( X_train_vectorized, doc_num_of_digits )

model = LogisticRegression(C=100, solver = 'liblinear')
model.fit(X_train_vectorized, y_train)

X_test_vectorized = vect.transform(X_test)
test_doc_length = np.array(list(map(len,X_test)))
X_test_vectorized = add_feature(X_test_vectorized, test_doc_length)

test_doc_num_of_digits = np.array(list(map(len, X_test.apply( lambda x: ''.join( re.findall( r'[0-9]+',x ) ) ) ) ) )
X_test_vectorized = add_feature(X_test_vectorized, test_doc_num_of_digits)

predictions = model.predict(X_test_vectorized)
roc_auc_score(y_test, predictions)

0.966325845713263

Next we'll look at the average number of non-word characters (anything other than a letter, digit or 
underscore) per document for ham and spam documents.

In [25]:
non_word_chars_non_spam = spam_data[spam_data['Category']==0]['Message'].apply(lambda x: ''.join(re.findall(r'\W+',x)))
ham_non_word_avg = np.mean(list(map(len, non_word_chars_non_spam)))

non_word_chars_spam = spam_data[spam_data['Category']==1]['Message'].apply(lambda x: ''.join(re.findall(r'\W+',x)))
spam_non_word_avg = np.mean(list(map(len, non_word_chars_spam)))

print(ham_non_word_avg)
print(spam_non_word_avg)

17.362072538860104
28.9384203480589


Now we'll put all these features together. That is, we'll fit and transform the training data X_train using a Count Vectorizer ignoring terms that 
have a document frequency stricly lower than 5 and using character n-grams n=2 to n=5.

Note: passing analyzer='char_wb' to the Count Vectorizer will create
character n-grams only from text inside word boundaries. This should make the model more immune to spelling mistakes. Using this document-term matrix and the following additional
features: (1) the length of document (number of characters), (2) number of digits per doc and
(3) number of non-word characters (anything other than a letter, digit of underscore), we'll fit
a Logistic Regression model with regularization C=100. Then compute the area under the curve
(AUC) score using the transformed test data.

In [27]:
vect = CountVectorizer(min_df= 5, ngram_range = (2,5), analyzer = 'char_wb' ).fit(X_train)
X_train_vectorized = vect.transform(X_train)

doc_length = np.array(list(map(len, X_train)))

X_train_vectorized = add_feature(X_train_vectorized , doc_length)

doc_num_of_digits = np.array(list(map(len, X_train.apply( lambda x: ''.join( re.findall( r'[0-9]+',x ) ) ) ) ) )
X_train_vectorized = add_feature( X_train_vectorized, doc_num_of_digits )

doc_non_word_chars = np.array(list(map(len, X_train.apply( lambda x: ''.join(re.findall(r'\W+',x)) ))))
X_train_vectorized = add_feature( X_train_vectorized, doc_non_word_chars )

model = LogisticRegression(C=100, solver = 'liblinear')
model.fit(X_train_vectorized, y_train)

X_test_vectorized = vect.transform(X_test)
test_doc_length = np.array(list(map(len,X_test)))
X_test_vectorized = add_feature(X_test_vectorized, test_doc_length)


test_doc_num_of_digits = np.array(list(map(len, X_test.apply( lambda x: ''.join( re.findall( r'[0-9]+',x ) ) ) ) ) )
X_test_vectorized = add_feature(X_test_vectorized, test_doc_num_of_digits)

test_doc_non_word_chars = np.array(list(map(len, X_test.apply( lambda x: ''.join(re.findall(r'\W+',x)) ))))
X_test_vectorized = add_feature( X_test_vectorized, test_doc_non_word_chars )

predictions = model.predict(X_test_vectorized)
roc_auc_score(y_test, predictions)

0.9721451584034366

All these new features paid off! However, it would be nice to see specifically which features are the most important features for this classifier. Since we trained a Logistic Regression model, this can be achieved by looking at the coefficients in the classifier. We'll look at the ten smallest and ten largest weights to see what are some of the most important features, according to our trained model.

In [28]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())
feature_names = np.append(feature_names,['length_of_doc', 'digit_count', 'non_word_char_count'])

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()


# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
smallest_coefs = pd.Series(model.coef_[0][sorted_coef_index[:10]],
index = feature_names[sorted_coef_index[:10]])
largest_coefs = pd.Series(model.coef_[0][sorted_coef_index[:-11:-1]],
index = feature_names[sorted_coef_index[:-11:-1]])

print(smallest_coefs)
print(largest_coefs)

..    -1.039249
 m    -0.754072
 y    -0.703007
?     -0.695312
.     -0.690813
 i    -0.625795
go    -0.607371
:)    -0.600318
h     -0.569786
 go   -0.547986
dtype: float64
digit_count    1.188104
r!             1.016638
 #             1.016613
r!             0.970923
erro           0.912133
error          0.912133
rror           0.906075
ror            0.906075
or             0.851032
err            0.811515
dtype: float64


Interestingly, from the smaller weights (meaning the corresponding features are useful for identifying ham messages) we see some familiar text message features, e.g. a smiley emoticon, part of an ellipsis, 'go' or 'i'. From the larger weights (meaning the corresponding features are useful for indentifying spam messages) we see 'error', some symbols, and the most important feature: the number of digits contained in message.