# Spam Detection

Exploring text message data and create models to predict if a message is spam or not. 

## Uploading data

In [2]:
import pandas as pd
import numpy as np

spam_data = pd.read_csv('spam.csv')

spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


## Exploring the data
What percentage of the documents in `spam_data` are spam?

In [4]:
np.sum(spam_data.target)*100.0/spam_data.target.count() 

13.406317300789663

What is the longest token in the vocabulary?

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
def longest_token():
    
    bag_of_words= CountVectorizer().fit(X_train)
    vocab= bag_of_words.get_feature_names()
    maxlen= max([len(w) for w in vocab])
    longest_token= [w for w in vocab if len(w)==maxlen]
    
    return longest_token[0] #Your answer here
longest_token()

'com1win150ppmx3age16subscription'

## Fitting a MultinomialNB classifier using 'bag of  words' method

Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.

Next, fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. Find the area under the curve (AUC) score using the transformed test data as a measure of accuracy

In [8]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

def Bayes_clf():
    
    cv= CountVectorizer().fit(X_train)
    X_train_v= cv.transform(X_train)
    X_test_v= cv.transform(X_test)
    
    clf= MultinomialNB(alpha=0.1)
    clf.fit(X_train_v, y_train)
    y_pred= clf.predict(X_test_v)
    
    return roc_auc_score(y_test, y_pred)#ur answer here

Bayes_clf()

0.97208121827411165

### Fitting MultinomialNB using 'Tfidf' method

Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.

Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1` and compute the area under the curve (AUC) score using the transformed test data.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

def Bayes_clf_tfidf():
    
    tfidf_vec= TfidfVectorizer(min_df=3).fit(X_train)
    X_train_vec= tfidf_vec.transform(X_train)
    X_test_vec= tfidf_vec.transform(X_test)
    
    clf= MultinomialNB(alpha= 0.1)
    clf.fit(X_train_vec, y_train)
    y_pred= clf.predict(X_test_vec)
    
    return roc_auc_score(y_test, y_pred) #Your answer here

Bayes_clf_tfidf()

0.94162436548223349

## Exploring data further

1) What is the average length of documents (number of characters) for not spam and spam documents?

In [12]:
def avg_length():
    
    spam_data['characters']= spam_data.text.apply(lambda x: len(x))
    x= spam_data.groupby(by= ['target']).agg('mean')
    
    return (x.iat[0,0], x.iat[1,0])

avg_length()

(71.023626943005183, 138.8661311914324)

2) what is the average number of digits per document for not spam and spam documents?

In [14]:
def avg_count_digits():
    
    def count(str):
        count=0
        for c in str:
            if c.isdigit():
                count= count +1
        return count
    
    spam_data['number_of_digits']= spam_data.text.apply(count)
    x= spam_data.groupby('target').agg('mean')['number_of_digits']
    
    return (x[0], x[1])

avg_count_digits()

(0.29927461139896372, 15.759036144578314)

3) What is the average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents?

In [16]:
def avg_nonWordCharacters():
    
    spam_data['nonWord_count']= spam_data.text.str.count('\W')
    x= spam_data.groupby('target').agg('mean')
    
    return (x.iat[0,0], x.iat[1,0]) 

avg_nonWordCharacters()

(71.023626943005183, 138.8661311914324)

## Adding features to the data

The following function combine new features into the training data:

In [19]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

## Fitting model to logistic regression with added features

Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* number of digits per document
* number of non-word characters (anything other than a letter, digit or underscore.)

fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.

Also **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.

In [20]:
from sklearn.linear_model import LogisticRegression

def log_reg():
    
    len_train = [len(x) for x in X_train]
    len_test = [len(x) for x in X_test]
    dig_train = [sum(char.isnumeric() for char in x) for x in X_train]
    dig_test = [sum(char.isnumeric() for char in x) for x in X_test]
    
    # Not alpha numeric:
    nan_train = X_train.str.count('\W')
    nan_test = X_test.str.count('\W')
    
    cv = CountVectorizer(min_df = 5, ngram_range=(2,5), analyzer='char_wb').fit(X_train)
    X_train_cv = cv.transform(X_train)
    X_test_cv = cv.transform(X_test)
    
    X_train_cv = add_feature(X_train_cv, [len_train, dig_train, nan_train])
    X_test_cv = add_feature(X_test_cv, [len_test, dig_test, nan_test])
    
    clf = LogisticRegression(C=100).fit(X_train_cv, y_train)
    pred = clf.predict(X_test_cv)
    
    score = roc_auc_score(y_test, pred)
    
    feature_names = np.array(cv.get_feature_names() + ['length_of_doc', 'digit_count', 'non_word_char_count'])
    sorted_coef_index = clf.coef_[0].argsort()
    small_coeffs = list(feature_names[sorted_coef_index[:10]])
    large_coeffs = list(feature_names[sorted_coef_index[:-11:-1]])
    
    return (score, small_coeffs, large_coeffs) #Your answer here

log_reg()

(0.97885931107074342,
 ['. ', '..', '? ', ' i', ' y', ' go', ':)', ' h', 'go', ' m'],
 ['digit_count', 'ne', 'ia', 'co', 'xt', ' ch', 'mob', ' x', 'ww', 'ar'])