## Model development 

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv('spam.csv')
df['target'] = np.where(df['target']=='spam',1,0)
df['msg_len']  = df['text'].str.len()
df['non_word_char'] = df["text"].str.count(r'\W')
df['num_char'] = df["text"].str.count(r'\d')
X_train, X_test, y_train, y_test = train_test_split(df[['text','msg_len','non_word_char','num_char']],df['target'], random_state=0,train_size=0.7)

### Impact of different Tokenizing strategies:

Based on the previous text data analyis discussion, let's compare the performance of the `CountVectorizer`and `TfidfVectorizer` vectorizers using the following classifier algorithms:  
[Multinomial Naive Classiffier](https://scikit-learn.org/stable/modules/naive_bayes.html)    
[Support Vector classifier](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)  
[Logistic Classifier](https://scikitlearn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

We use the *AUC score* as evaluation metric.

In [4]:
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

c1 = MultinomialNB(alpha=0.1)
c2 = SVC(kernel='linear')
c3 = LogisticRegression(C=10,solver='lbfgs')

model = dict([("Bayes",c1),("SVC",c2),("Logit",c3)])

Recall that tokenizing with default parameters led to a vocabulary length of about ~7k, but the *document frecuency* of most of these tokens is 1, meaning that they appear only in one training example. A set of features much larger than the number of training examples is prone to  overfit. 

Let's impose a Change the *minimum document frecuency* of the vectorizers  and check the training and test performance of the previously defined classifiers.

In [5]:
def vectorizer_performance(v1,v2,classif):
    auc_train = []
    auc_test  = []
    for vectorizer in [v1,v2]:
        X_train_transf = vectorizer.transform(X_train["text"])
        clf = model[classif].fit(X_train_transf,y_train)
        y_pred_train = clf.predict(X_train_transf)
        auc_train.append(roc_auc_score(y_train,y_pred_train) )
        y_pred = clf.predict(vectorizer.transform(X_test["text"]))
        auc_test.append(roc_auc_score(y_test,y_pred) )
    print("*** AUC scores for {} Classifier ***".format(classif))
    print("Train: {:0.2f} , {:0.2f}".format(auc_train[0],auc_train[1]))
    print("Test: {:0.2f} , {:0.2f}".format(auc_test[0],auc_test[1]))
    return

In [6]:
for i in [1,3,6]:
    v1 = CountVectorizer(min_df=i).fit(X_train["text"])
    v2 = TfidfVectorizer(min_df=i).fit(X_train["text"])
    print("######### Vocabulary length for min_df={} -->  {} #########".format(i, len(v1.vocabulary_)))
    %time vectorizer_performance(v1,v2,"Bayes")
    %time vectorizer_performance(v1,v2,"SVC")
    %time vectorizer_performance(v1,v2,"Logit")

######### Vocabulary length for min_df=1 -->  7098 #########
*** AUC scores for Bayes Classifier ***
Train: 0.99 , 0.99
Test: 0.98 , 0.94
CPU times: user 187 ms, sys: 0 ns, total: 187 ms
Wall time: 187 ms
*** AUC scores for SVC Classifier ***
Train: 1.00 , 0.98
Test: 0.94 , 0.96
CPU times: user 1.96 s, sys: 9.18 ms, total: 1.97 s
Wall time: 1.97 s
*** AUC scores for Logit Classifier ***
Train: 1.00 , 0.99
Test: 0.93 , 0.95
CPU times: user 352 ms, sys: 247 µs, total: 353 ms
Wall time: 352 ms
######### Vocabulary length for min_df=3 -->  2199 #########
*** AUC scores for Bayes Classifier ***
Train: 0.98 , 0.97
Test: 0.97 , 0.94
CPU times: user 194 ms, sys: 394 µs, total: 194 ms
Wall time: 194 ms
*** AUC scores for SVC Classifier ***
Train: 1.00 , 0.97
Test: 0.95 , 0.96
CPU times: user 1.46 s, sys: 4.19 ms, total: 1.46 s
Wall time: 1.46 s
*** AUC scores for Logit Classifier ***
Train: 1.00 , 0.98
Test: 0.95 , 0.96
CPU times: user 338 ms, sys: 0 ns, total: 338 ms
Wall time: 337 ms
########

Both vectorizers have rather similar performances (Test AUC scores), somehow a bit better for the `CountVectorizer` doing better  for Bayes and Logit classifiers, and `TfidfVectorizers` for the SVC case.  
The number of features decreases rapidly by increasing the min_df, without decreasing considerably the performance of all 3 classifiers. In addition, the SVC computing time  also decreases strongly.  
Before conducting a grid search to optimize the parameters and find the best classifier, lets look if using *character or word n-grams* helps to improve the algorithms performance:

In [7]:
# tokenize strategy: word n-grams 
for i in [3,4,5]:
    v1 = CountVectorizer(ngram_range=(1,i),min_df=3).fit(X_train["text"])
    v2 = TfidfVectorizer(ngram_range=(1,i),min_df=3).fit(X_train["text"])
    print("######### Vocabulary length for ngram_range=(1,{}) (min_df=3) -->  {} #########".format(i, len(v1.vocabulary_)))
    %time vectorizer_performance(v1,v2,"Bayes")

######### Vocabulary length for ngram_range=(1,3) (min_df=3) -->  7000 #########
*** AUC scores for Bayes Classifier ***
Train: 0.98 , 0.97
Test: 0.96 , 0.95
CPU times: user 402 ms, sys: 3.38 ms, total: 406 ms
Wall time: 405 ms
######### Vocabulary length for ngram_range=(1,4) (min_df=3) -->  8034 #########
*** AUC scores for Bayes Classifier ***
Train: 0.97 , 0.97
Test: 0.94 , 0.94
CPU times: user 474 ms, sys: 162 µs, total: 474 ms
Wall time: 473 ms
######### Vocabulary length for ngram_range=(1,5) (min_df=3) -->  8893 #########
*** AUC scores for Bayes Classifier ***
Train: 0.97 , 0.96
Test: 0.94 , 0.93
CPU times: user 550 ms, sys: 3.85 ms, total: 554 ms
Wall time: 554 ms


In [8]:
# tokenize strategy: character n-grams within word boundaries
for i in [3,4,5]:
    v1 = CountVectorizer(analyzer='char_wb',ngram_range=(2,i),min_df=5).fit(X_train["text"])
    v2 = TfidfVectorizer(analyzer='char_wb',ngram_range=(2,i),min_df=5).fit(X_train["text"])
    print("######### Vocabulary length for ngram_range=(1,{}) (min_df=5) -->  {} #########".format(i, len(v1.vocabulary_)))
    %time vectorizer_performance(v1,v2,"Bayes")

######### Vocabulary length for ngram_range=(1,3) (min_df=5) -->  5059 #########
*** AUC scores for Bayes Classifier ***
Train: 0.97 , 0.97
Test: 0.97 , 0.96
CPU times: user 1.13 s, sys: 3.81 ms, total: 1.13 s
Wall time: 1.13 s
######### Vocabulary length for ngram_range=(1,4) (min_df=5) -->  10654 #########
*** AUC scores for Bayes Classifier ***
Train: 0.98 , 0.97
Test: 0.98 , 0.97
CPU times: user 1.61 s, sys: 0 ns, total: 1.61 s
Wall time: 1.61 s
######### Vocabulary length for ngram_range=(1,5) (min_df=5) -->  15569 #########
*** AUC scores for Bayes Classifier ***
Train: 0.98 , 0.98
Test: 0.98 , 0.97
CPU times: user 1.74 s, sys: 4 ms, total: 1.74 s
Wall time: 1.74 s


Using *word n-grams* does not seem to improve significantly the algorithms performance and approximately doubles the computing time. However, the use of **character n-grams within word boundaries** does improve significantly but at the same time increases the Bayes computing time by about three times. This last choice should make the different models  more robust to misspelling.

### Adding extra features (length, non-alphan.  and numer. charac.):

Given the rather good performance of some of the previous classifiers it does not seem, at first glance, the necessity of adding extra features (or even optimizing parameters for regularization). However, recall that the previous best classifiers have a rather large number of features and slow computing time, so faster-training models with equally good performance would be much better. 

Let's add the features disscussed in the text data analysis:

In [9]:
from scipy.sparse import csr_matrix, hstack
add_sparse_feat = lambda X_sparse_set, new_feat: hstack( [X_sparse_set, csr_matrix(new_feat).T], 'csr')
 
def new_vectorizer_performance(v1,v2,classif):
    auc_train = []
    auc_test  = []
    trained_clf = []
    for vectorizer in [v1,v2]:
        #adding new features:
        X_train_transf = add_sparse_feat(vectorizer.transform(X_train["text"]),[ X_train[i].values for i in ["msg_len","non_word_char","num_char"]])
        #X_train_transf = vectorizer.transform(X_train["text"])
        clf = model[classif].fit(X_train_transf,y_train)
        trained_clf.append(clf)
        y_pred_train = clf.predict(X_train_transf)
        auc_train.append(roc_auc_score(y_train,y_pred_train) )
        # transforming X_test adding new features
        X_test_transf = add_sparse_feat(vectorizer.transform(X_test["text"]),[ X_test[i].values for i in ["msg_len","non_word_char","num_char"]])            
        y_pred = clf.predict(X_test_transf)
        auc_test.append(roc_auc_score(y_test,y_pred) )
    print("*** AUC scores for {} Classifier ***".format(classif))
    print("Train: {:0.2f} , {:0.2f}".format(auc_train[0],auc_train[1]))
    print("Test: {:0.2f} , {:0.2f}".format(auc_test[0],auc_test[1]))
    return trained_clf

In [10]:
# tokenize strategy: character n-grams within word boundaries, max number of features = 500
for i in [3,4,5]:
    v1 = CountVectorizer(analyzer='char_wb',ngram_range=(2,i),min_df=0.01,max_features=500).fit(X_train["text"])
    v2 = TfidfVectorizer(analyzer='char_wb',ngram_range=(2,i),min_df=0.01,max_features=500).fit(X_train["text"])
    print("######### Vocabulary length for ngram_range=(1,{}) (min_df=5) -->  {} #########".format(i, len(v1.vocabulary_)))
    %time new_vectorizer_performance(v1,v2,"Bayes")

######### Vocabulary length for ngram_range=(1,3) (min_df=5) -->  500 #########
*** AUC scores for Bayes Classifier ***
Train: 0.95 , 0.94
Test: 0.95 , 0.92
CPU times: user 1.12 s, sys: 0 ns, total: 1.12 s
Wall time: 1.12 s
######### Vocabulary length for ngram_range=(1,4) (min_df=5) -->  500 #########
*** AUC scores for Bayes Classifier ***
Train: 0.95 , 0.93
Test: 0.94 , 0.92
CPU times: user 1.42 s, sys: 0 ns, total: 1.42 s
Wall time: 1.42 s
######### Vocabulary length for ngram_range=(1,5) (min_df=5) -->  500 #########
*** AUC scores for Bayes Classifier ***
Train: 0.95 , 0.93
Test: 0.94 , 0.92
CPU times: user 1.6 s, sys: 8 ms, total: 1.61 s
Wall time: 1.61 s


Adding these extra features largely improves the training process for the Bayes Classifier: **The training time is decrease after limiting the number of tokenized features to 500**  
Let's find out what are the most important features of the model using n-grams in different ranges: (3,4), (3,5), (3,6)

In [11]:
# tokenize strategy: character n-grams within word boundaries, max number of features = 500 + 3 extra
for i in [4,5,6]:
    v1 = CountVectorizer(analyzer='char_wb',ngram_range=(3,i),min_df=0.01,max_features=500).fit(X_train["text"])
    v2 = TfidfVectorizer(analyzer='char_wb',ngram_range=(3,i),min_df=0.01,max_features=500).fit(X_train["text"])
    print("######### ngram_range=(4,{}) - min_df= 10 - vocab.len = 503 #########".format(i))
    clf_v1, clf_v2 = new_vectorizer_performance(v1,v2,"Bayes")

    for v,clf,name in [(v1,clf_v1,"CountVectorizer"),(v2,clf_v2,"TfidfVectorizer")]:
        vv = v.vocabulary_
        for i,j in zip([500,501,502],['msg_len', 'digit', 'non_wordchar']):
            vv[j] = i 
        extended_vv = dict(zip(vv.values(),vv.keys()))
        idx_sorted = clf.coef_[0].argsort()
        scl= [extended_vv[i] for i in idx_sorted[:10]]
        hcl= [extended_vv[i] for i in idx_sorted[-10:]]
        print("{} features with:".format(name))
        print("highest coefficients:{}".format(hcl))
        print("smallest coefficients:{}".format(scl))

######### ngram_range=(4,4) - min_df= 10 - vocab.len = 503 #########
*** AUC scores for Bayes Classifier ***
Train: 0.95 , 0.94
Test: 0.95 , 0.92
CountVectorizer features with:
highest coefficients:['ur ', 'cal', 'call', ' ca', ' to ', 'to ', ' to', 'non_wordchar', 'digit', 'msg_len']
smallest coefficients:[' &l', '&lt;', '&lt', '&gt;', '&gt', '#&gt', '#&g', ';#&', 'lt;', 'she']
TfidfVectorizer features with:
highest coefficients:['ur ', 'cal', 'call', ' ca', ' to ', 'to ', ' to', 'non_wordchar', 'digit', 'msg_len']
smallest coefficients:[' &l', '&lt;', '&lt', '&gt;', '&gt', '#&gt', '#&g', ';#&', 'lt;', 'she']
######### ngram_range=(4,5) - min_df= 10 - vocab.len = 503 #########
*** AUC scores for Bayes Classifier ***
Train: 0.94 , 0.94
Test: 0.95 , 0.93
CountVectorizer features with:
highest coefficients:['ur ', 'cal', 'call', ' ca', ' to ', 'to ', ' to', 'non_wordchar', 'digit', 'msg_len']
smallest coefficients:[' &l', '#&gt', '#&gt;', '&gt', '&gt;', '&gt; ', '&lt', '&lt;', '&lt;#', '

#### The last 3 features added are key for the classifiers to distinguish between spam and non-spam messages.
The next section will optimize the model parameters and additional hyper-parameters (regularization, vectorizer params, etc)