<h5> <p> The project aims to explore text message data, create classification models using CountVectorizer and Tf-idf Vectorizer and predict messages spam or not spam . </p>

Questions come up: </h5>

<p> 1) What percentage of the documents in spam_data are spam? </p>
<p> 2) What is the longest token in the vocabulary in training set? </p>
<p> 3) What is the average length of documents (number of characters) for not spam and spam documents? </p>
<p> 4) What is the average number of digits per document for not spam and spam documents? </p>
<p> 5) What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents? </p>

<h4> Import Libraries </h4>

In [1]:
# Data tools
import numpy as np
import pandas as pd

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import cross_val_score
from scipy.sparse import csr_matrix, hstack

import re

import warnings
warnings.filterwarnings("ignore")


<h4> Accesory functions </h4>

In [2]:
def avg_digits_text(data):
    """
    Purpose:
        - Count the number of digits in each doc in data
        - Compute the average 
    
    Arg:
        data: array of text
        
    Return: 
        an average of number of digits in data
    """
    avg = []
    num_re = re.compile('[0-9]')
    for text in data:
        match = num_re.findall(text)
        if(len(match) > 0):
            avg.append(len(match))
            
    return np.mean(avg)


def add_feature(X, feature_to_add):
    """
    Combine new features into the training data
    
    Return:
        sparse feature matrix with added feature.

    Arg:
     - X: sparse feature matrix, for example:
         array([[   0.,    0.,    0., ...,    0.,    0.,   31.],
           [   0.,    0.,    0., ...,    0.,    0.,  130.],
           [   0.,    0.,    0., ...,    0.,    0.,   66.],
           ..., 
           [   0.,    0.,    0., ...,    0.,    0.,  147.],
           [   0.,    0.,    0., ...,    0.,    0.,   62.],
           [   0.,    0.,    0., ...,    0.,    0.,   82.]])
       
     - feature_to_add: list of features, for example:
         [[ 31, 130,  66, ..., 147,  62,  82]]

    """

    return hstack([X, csr_matrix(feature_to_add).T], 'csr')


<p> Load spam.csv, print 5 first lines </p>

In [3]:
spam_data = pd.read_csv('data/spam.csv', sep='\t', 
                        header=None, names=["label", "text"])   


spam_data['label'] = np.where(spam_data['label']=='spam', 1, 0)
                  
X = spam_data['text']
y = spam_data['label']

result = {}# score

print(spam_data.head())

print(spam_data.info())

   label                                               text
0      0  Go until jurong point, crazy.. Available only ...
1      0                      Ok lar... Joking wif u oni...
2      1  Free entry in 2 a wkly comp to win FA Cup fina...
3      0  U dun say so early hor... U c already then say...
4      0  Nah I don't think he goes to usf, he lives aro...
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label    5572 non-null int32
text     5572 non-null object
dtypes: int32(1), object(1)
memory usage: 65.4+ KB
None


<h4> What percentage of the documents in spam_data are spam? </h4>

In [4]:
(len(spam_data[spam_data['label'] == 1])/len(spam_data))*100

13.406317300789663

<p>Fit the training data using a Count Vectorizer with default parameters.</p>

In [5]:
count_vect = CountVectorizer()
X_vect = count_vect.fit_transform(X)

X_train_vect = count_vect.fit_transform(X)

<h4> What is the longest token in the vocabulary in training set? </h4>

In [6]:
words_len = {word:len(word) for word in count_vect.get_feature_names()}

sorted_words_len = sorted(words_len.items(), key = lambda item: item[1], reverse=True)

sorted_words_len[0]


('hypotheticalhuagauahahuagahyuhagga', 34)

<p> Fit a fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1 with cross validation </p>

In [7]:
clf = MultinomialNB(alpha=0.1)

scores = cross_val_score(clf, X_vect, y, cv=10)

print(scores)

#result.update({'MNB_vectorizer' : np.max(scores)})

np.mean(scores)

[0.98566308 0.98566308 0.97670251 0.98387097 0.98028674 0.98204668
 0.98204668 0.98561151 0.97661871 0.99100719]


0.9829517147271354

<p> Fit and transform the training data using a Tfidf Vectorizer with default parameters </p>

In [8]:
tfidf = TfidfVectorizer()

X_tfidf = tfidf.fit_transform(X)

<p>What 5 features have the smallest tf-idf and what 5 have the largest tf-idf? </p>

<p>Put these features in a two series where each series is sorted by tf-idf value and then alphabetically by feature name.
The index of the series should be the feature name, and the data should be the tf-idf.
The series of 5 features with smallest tf-idfs should be sorted smallest tfidf first, the list of 5 features with largest tf-idfs should be sorted largest first.</p>


In [9]:
df_X_tfidf = pd.DataFrame(data = X_tfidf.toarray(), columns = tfidf.get_feature_names())

sums = df_X_tfidf.sum(axis=0)

sorted_sums = sums.sort_values(axis=0, ascending=False)

print(sorted_sums.tail(5))

print(sorted_sums.head(5) )

proove        0.074337
praises       0.074337
attraction    0.074337
makiing       0.074337
sorrows       0.074337
dtype: float64
you    246.226975
to     206.863921
the    147.707246
in     122.596296
me     118.430868
dtype: float64


<p> Fit and transform the training data using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 3.</p>

<p> Then fit a multinomial Naive Bayes classifier model with smoothing alpha=0.1 </p>

In [10]:
tfidf = TfidfVectorizer(min_df=3)

X_tfidf = tfidf.fit_transform(X)

clf = MultinomialNB(alpha=0.1)

scores = cross_val_score(clf, X_tfidf, y, cv=10)

print(scores)

#result.update({'MNB_Tfidf' : np.max(scores)})

np.mean(scores)

[0.99462366 0.98387097 0.98566308 0.9874552  0.98387097 0.98204668
 0.98743268 0.98561151 0.97661871 0.98741007]


0.9854603512417957

<h4> It seems that TFIDF performs a litlle bit better than Count Bag-of-word</h4>

<h5> What is the average length of documents (number of characters) for not spam and spam documents? </h5>

In [11]:
avg_spam_len = [ len(text) for text in spam_data[spam_data['label'] == 1]['text']]
avg_spam_len = np.mean(avg_spam_len)

avg_ham_len = [ len(text) for text in spam_data[spam_data['label'] == 0]['text']]
avg_ham_len = np.mean(avg_ham_len)

print("({} : {} )".format(avg_spam_len, avg_ham_len))

(138.6706827309237 : 71.48290155440415 )


<p> Add new feature: length of document and fit new training data to TFIDF </p>
<p> Fit and transform the training datausing a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5. 
<p> Fit a Support Vector Classification model with regularization C=10000 and compute cross validation score for transformed test data</p>

In [12]:
spam_data['len_doc'] = spam_data['text'].apply(lambda text : len(text))

#1D array
len_docs = spam_data['len_doc'].values

len_docs = len_docs.reshape(1, -1)

tfidf = TfidfVectorizer(min_df=5)

X_tfidf = tfidf.fit_transform(X)

X_tfidf = add_feature(X_tfidf, len_docs)

clf = SVC(C=10000)

scores = cross_val_score(clf, X_tfidf, y, cv=10)

#result.update({'SVC_Tfidf' : np.max(scores)})

print(scores)

np.mean(scores)


[0.99462366 0.98028674 0.98207885 0.98924731 0.98207885 0.98025135
 0.98743268 0.98561151 0.98381295 0.99100719]


0.9856431088406625

<h4> What is the average number of digits per document for not spam and spam documents? </h4>

In [13]:
spam_texts = spam_data.loc[spam_data['label']==1, 'text']

ham_texts = spam_data.loc[spam_data['label']==0, 'text']

avg_digits_spam = avg_digits_text(spam_texts)

avg_digits_ham = avg_digits_text(ham_texts)

print("({} : {})".format(avg_digits_spam, avg_digits_ham))

(16.683615819209038 : 1.9509933774834438)


<p> Add new feature: number of digits in the document</p>

In [14]:
num_re = re.compile('[0-9]')
spam_data['num_digits'] = [len(num_re.findall(text)) for text in  spam_data['text'] ] 

<p> Fit and transform the training data using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using word n-grams from n=1 to n=3 (unigrams, bigrams, and trigrams) </p>
<p> Using this document-term matrix and the following additional features: <b>The length of document (number of characters) </b> and <b> number of digits per document </b> </p>
<p> Fit a Logistic Regression model with <b> regularization C=1000000 </b> and compute cross validation score using the transformed test data.</p>

In [15]:
tfidf = TfidfVectorizer(min_df=5, ngram_range=(1,3))

# Transformed train set
X_tfidf = tfidf.fit_transform(X)

# New features added to train set
len_docs = spam_data['len_doc']
num_digits = spam_data['num_digits']

X_tfidf = add_feature(X_tfidf, len_docs.values.reshape(1, -1))
X_tfidf = add_feature(X_tfidf, num_digits.values.reshape(1, -1))

# Fit classifier with train set
clf = LogisticRegression(C=1000000)

scores = cross_val_score(clf, X_tfidf, y, cv=10)

#result.update({'Logistic_Tfidf' : np.max(scores)})

print(scores)

np.mean(scores)


[0.99283154 0.98566308 0.99283154 0.99283154 0.98924731 0.98563734
 0.994614   0.98741007 0.98741007 0.99280576]


0.9901282263700825

<h4>What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?</h4>

In [16]:
re_bad_carac = re.compile('\W')

spam_data['bad_carac_count'] = [len(re_bad_carac.findall(text)) for text in spam_data['text']]

print("average of number of bad characters in spam message: {}".format(np.mean(spam_data[spam_data['label'] == 1]['bad_carac_count'].values)))

print("average of number of bad characters in ham message: {}".format(np.mean(spam_data[spam_data['label'] == 0]['bad_carac_count'].values)))

spam_data[spam_data['bad_carac_count'] > 0].head()

average of number of bad characters in spam message: 29.104417670682732
average of number of bad characters in ham message: 17.396683937823834


Unnamed: 0,label,text,len_doc,num_digits,bad_carac_count
0,0,"Go until jurong point, crazy.. Available only ...",111,0,28
1,0,Ok lar... Joking wif u oni...,29,0,11
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,155,25,33
3,0,U dun say so early hor... U c already then say...,49,0,16
4,0,"Nah I don't think he goes to usf, he lives aro...",61,0,14


<p>Add a new feature to spam_data</p>

In [17]:
spam_data['bad_carac_count'] = [len(re_bad_carac.findall(text)) for text in spam_data['text']]


<p>Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5.</p>
<p>Fit and transform data using CountVectorizer ignoring terms that have a document frequency strictly lower than 5 and using character n-grams from n=2 to n=5, using use character n-grams pass in analyzer='char_wb' which creates character n-grams only from text inside word boundaries.</p>
<p>Using this document-term matrix and the following additional features: <b>the length of document (number of characters), number of digits per document, number of non-word characters </b></p>
<p>Fit a Logistic Regression model with regularization C=10000 and compute cross validation score.</p>

In [18]:
vectorizer = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb')

# Transformed train set
X_vect = vectorizer.fit_transform(X)

# New features added to train set
len_docs = spam_data['len_doc']
num_digits = spam_data['num_digits']
bad_carac_count = spam_data['bad_carac_count']

X_vect = add_feature(X_vect, len_docs.values.reshape(1, -1))
X_vect = add_feature(X_vect, num_digits.values.reshape(1, -1))
X_vect = add_feature(X_vect, bad_carac_count.values.reshape(1, -1))

# Fit classifier with train set
clf = LogisticRegression(C=10000)

scores = cross_val_score(clf, X_vect, y, cv=10)

print(scores)

np.mean(scores)


[0.99103943 0.98566308 0.99641577 0.99641577 0.99103943 0.98563734
 0.99281867 0.98741007 0.98741007 0.99460432]


0.9908453951496821