# Spam detector
We'll build a model to identify if a given message is spam or ham.

Dataset SOURCE: http://dcomp.sor.ufscar.br/talmeida/smspamcollection/

- Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the Study of SMS Spam Filtering: New Collection and Results.  Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011. [preprint]

- Gómez Hidalgo, J.M., Almeida, T.A., Yamakami, A. On the Validity of a New SMS Spam Collection.  Proceedings of the 11th IEEE International Conference on Machine Learning and Applications (ICMLA'12), Boca Raton, FL, USA, 2012. [preprint]

- Almeida, T.A., Gómez Hidalgo, J.M., Silva, T.P.  Towards SMS Spam Filtering: Results under a New Dataset.   International Journal of Information Security Science (IJISS), 2(1), 1-18, 2013. [Invited paper - full version]
        

In [48]:
import pandas as pd
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

### Data Processing

In [23]:
# Load the dataset
spam_ham = pd.read_csv("SMSSpamCollection.tsv", sep = "\t", names=['label','message'])
spam_ham.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [34]:
len(spam_ham)

5572

In [24]:
# Verify if df contains null values
spam_ham.isnull().sum()

label      0
message    0
dtype: int64

In [26]:
spam_ham['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [32]:
# splitting data into train and test sets
X = spam_ham['message']
y = spam_ham['label']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=44)

In [33]:
X_train.shape

(3900,)

### Model

In [57]:
# building the pipeline to processing the data
text_clf = Pipeline([('tfidf',TfidfVectorizer()),
                    ('clf', LinearSVC())])
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [58]:
# time to make predictions!
predictions = text_clf.predict(X_test)

### Metrics

In [59]:
# confusion matrix
print(confusion_matrix(y_test,predictions))

[[1441    4]
 [  22  205]]


In [60]:
# classification report
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1445
        spam       0.98      0.90      0.94       227

   micro avg       0.98      0.98      0.98      1672
   macro avg       0.98      0.95      0.97      1672
weighted avg       0.98      0.98      0.98      1672



In [61]:
# accuracy
print(accuracy_score(y_test,predictions))

0.984449760766


### Some tests

In [65]:
print(text_clf.predict(["Hello Peter, can you send me your report?"]))

['ham']


In [66]:
print(text_clf.predict(["Congratulations! You earned a prize of U$100. Please send your password to us."]))

['spam']
