The goal of my project is to implement an efficient and universal spam classifier. For spam classifier most important metrics are general accuracy and positive negative count. For the testing I'm going to use email and SMS spam databases.

In [1]:
from sklearn import svm,naive_bayes
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score, r2_score
import pandas as pd

In [2]:
data = pd.read_csv("all_data.csv")
data1 = pd.read_csv("spam.csv",encoding='latin1')
data2 = pd.read_csv("spam_ham_dataset.csv")
X_train, X_test, y_train, y_test = train_test_split(data["text"], data["label"], test_size=0.2, random_state=0)
X_train1, X_test1, y_train1, y_test1 = train_test_split(data1["v2"], data1["v1"], test_size=0.2, random_state=0)
X_train2, X_test2, y_train2, y_test2 = train_test_split(data2["text"], data2["label"], test_size=0.2, random_state=0)
X_all = pd.concat([X_train,X_train1,X_train2])
Y_all = pd.concat([y_train,y_train1,y_train2])
cv = TfidfVectorizer() 
X_all = cv.fit_transform(X_all)
X_train = cv.transform(X_train)
X_test = cv.transform(X_test)
X_train1 = cv.transform(X_train1)
X_test1 = cv.transform(X_test1)
X_train2 = cv.transform(X_train2)
X_test2 = cv.transform(X_test2)

All datasets consist of email/sms messages with ham/spam labels

In [3]:
model = naive_bayes.MultinomialNB()
model.fit(X_train, y_train)
model.partial_fit(X_train1,y_train1)
model.partial_fit(X_train2,y_train2)

MultinomialNB()

In [4]:
y_predict = model.predict(X_test)
y_predict1 = model.predict(X_test1)
y_predict2 = model.predict(X_test2)

In [5]:
idx = 0
score = 0
for _,item in y_test.iteritems():
    if(y_predict[idx]==item):
        score+=1
    idx += 1
print("Dataset1 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")
idx = 0
score = 0
for _,item in y_test1.iteritems():
    if(y_predict1[idx]==item):
        score+=1
    idx += 1
print("Dataset2 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")
idx = 0
score = 0
for _,item in y_test2.iteritems():
    if(y_predict2[idx]==item):
        score+=1
    idx += 1
print("Dataset3 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")

Dataset1 correct guesses: 6630 / 6744 : 98.30960854092527 %
Dataset2 correct guesses: 975 / 1115 : 87.4439461883408 %
Dataset3 correct guesses: 1014 / 1035 : 97.97101449275362 %


In [6]:
f1 = f1_score(y_test, y_predict, average='macro')  
print('Ds1 f1 macro = %.2f' %(f1))
f1 = f1_score(y_test, y_predict, average='micro')  
print('Ds1 f1 micro = %.2f' %(f1))
f1 = f1_score(y_test, y_predict, average='weighted')  
print('Ds1 f1 weighted = %.2f' %(f1))
f1 = f1_score(y_test1, y_predict1, average='macro')  
print('Ds2 f1 macro = %.2f' %(f1))
f1 = f1_score(y_test1, y_predict1, average='micro')  
print('Ds2 f1 micro = %.2f' %(f1))
f1 = f1_score(y_test1, y_predict1, average='weighted')  
print('Ds2 f1 weighted = %.2f' %(f1))
f1 = f1_score(y_test2, y_predict2, average='macro')  
print('Ds3 f1 macro = %.2f' %(f1))
f1 = f1_score(y_test2, y_predict2, average='micro')  
print('Ds3 f1 micro = %.2f' %(f1))
f1 = f1_score(y_test2, y_predict2, average='weighted')  
print('Ds3 f1 weighted = %.2f' %(f1))

Ds1 f1 macro = 0.98
Ds1 f1 micro = 0.98
Ds1 f1 weighted = 0.98
Ds2 f1 macro = 0.79
Ds2 f1 micro = 0.87
Ds2 f1 weighted = 0.88
Ds3 f1 macro = 0.98
Ds3 f1 micro = 0.98
Ds3 f1 weighted = 0.98


In [7]:
fp = 0
idx = 0
score = 0
for _,item in y_test.iteritems():
    if y_predict[idx]!=item and y_predict[idx]=="spam":
        fp+=1
    idx += 1
print("Ds1 False positive: ",fp,";",(fp/idx)*100,"%")
fp = 0
idx = 0
score = 0
for _,item in y_test1.iteritems():
    if y_predict1[idx]!=item and y_predict1[idx]=="spam":
        fp+=1
    idx += 1
print("Ds2 False positive: ",fp,";",(fp/idx)*100,"%")
fp = 0
idx = 0
score = 0
for _,item in y_test2.iteritems():
    if y_predict2[idx]!=item and y_predict2[idx]=="spam":
        fp+=1
    idx += 1
print("Ds3 False positive: ",fp,";",(fp/idx)*100,"%")

Ds1 False positive:  12 ; 0.1779359430604982 %
Ds2 False positive:  114 ; 10.22421524663677 %
Ds3 False positive:  8 ; 0.7729468599033816 %


In [8]:
C = 1.0
svc = svm.SVC(kernel='linear', C=C)
svc.fit(X_all,Y_all)

SVC(kernel='linear')

In [9]:
y_predict = svc.predict(X_test)
y_predict1 = svc.predict(X_test1)
y_predict2 = svc.predict(X_test2)

In [10]:
idx = 0
score = 0
for _,item in y_test.iteritems():
    if(y_predict[idx]==item):
        score+=1
    idx += 1
print("Dataset1 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")
idx = 0
score = 0
for _,item in y_test1.iteritems():
    if(y_predict1[idx]==item):
        score+=1
    idx += 1
print("Dataset2 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")
idx = 0
score = 0
for _,item in y_test2.iteritems():
    if(y_predict2[idx]==item):
        score+=1
    idx += 1
print("Dataset3 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")

Dataset1 correct guesses: 6679 / 6744 : 99.0361803084223 %
Dataset2 correct guesses: 1067 / 1115 : 95.69506726457399 %
Dataset3 correct guesses: 1030 / 1035 : 99.51690821256038 %


In [11]:
f1 = f1_score(y_test, y_predict, average='macro')  
print('Ds1 f1 macro = %.2f' %(f1))
f1 = f1_score(y_test, y_predict, average='micro')  
print('Ds1 f1 micro = %.2f' %(f1))
f1 = f1_score(y_test, y_predict, average='weighted')  
print('Ds1 f1 weighted = %.2f' %(f1))
f1 = f1_score(y_test1, y_predict1, average='macro')  
print('Ds2 f1 macro = %.2f' %(f1))
f1 = f1_score(y_test1, y_predict1, average='micro')  
print('Ds2 f1 micro = %.2f' %(f1))
f1 = f1_score(y_test1, y_predict1, average='weighted')  
print('Ds2 f1 weighted = %.2f' %(f1))
f1 = f1_score(y_test2, y_predict2, average='macro')  
print('Ds3 f1 macro = %.2f' %(f1))
f1 = f1_score(y_test2, y_predict2, average='micro')  
print('Ds3 f1 micro = %.2f' %(f1))
f1 = f1_score(y_test2, y_predict2, average='weighted')  
print('Ds3 f1 weighted = %.2f' %(f1))

Ds1 f1 macro = 0.99
Ds1 f1 micro = 0.99
Ds1 f1 weighted = 0.99
Ds2 f1 macro = 0.92
Ds2 f1 micro = 0.96
Ds2 f1 weighted = 0.96
Ds3 f1 macro = 0.99
Ds3 f1 micro = 1.00
Ds3 f1 weighted = 1.00


In [12]:
fp = 0
idx = 0
score = 0
for _,item in y_test.iteritems():
    if y_predict[idx]!=item and y_predict[idx]=="spam":
        fp+=1
    idx += 1
print("Ds1 False positive: ",fp,";",(fp/idx)*100,"%")
fp = 0
idx = 0
score = 0
for _,item in y_test1.iteritems():
    if y_predict1[idx]!=item and y_predict1[idx]=="spam":
        fp+=1
    idx += 1
print("Ds2 False positive: ",fp,";",(fp/idx)*100,"%")
fp = 0
idx = 0
score = 0
for _,item in y_test2.iteritems():
    if y_predict2[idx]!=item and y_predict2[idx]=="spam":
        fp+=1
    idx += 1
print("Ds3 False positive: ",fp,";",(fp/idx)*100,"%")

Ds1 False positive:  30 ; 0.4448398576512456 %
Ds2 False positive:  28 ; 2.5112107623318383 %
Ds3 False positive:  5 ; 0.4830917874396135 %


Based on tests, Bayes classifier has good performance when trained on general data, but has worse performance on different class data. The Support Vector Machine has better performance on all data, but it takes a lot longer to train. However Bayesian model can be tested with different marigin to improve it's performance

Dataset 2 has highest probability of false positive, so for this dataset it's best to test which marigin has lowest false positive chance and how it'll impact overall accuracy

In [13]:
for i in range(5,10):
    y_pred_diff = (model.predict_proba(X_test1)[:,1] >= i/10).astype(bool)
    fp = 0
    idx = 0
    score = 0
    for _,item in y_test1.iteritems():
        if item == "ham" and y_pred_diff[idx]==True:
            fp+=1
        idx += 1
    print(i,":Ds2 False positive: ",fp,";",(fp/idx)*100,"%")
    idx = 0
    score = 0
    for _,item in y_test1.iteritems():
        if(y_pred_diff[idx]==True and item=="spam") or (y_pred_diff[idx]==False and item=="ham"):
            score+=1
        idx += 1
    print(i,":Dataset2 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")

5 :Ds2 False positive:  114 ; 10.22421524663677 %
5 :Dataset2 correct guesses: 975 / 1115 : 87.4439461883408 %
6 :Ds2 False positive:  56 ; 5.0224215246636765 %
6 :Dataset2 correct guesses: 1023 / 1115 : 91.74887892376682 %
7 :Ds2 False positive:  24 ; 2.1524663677130045 %
7 :Dataset2 correct guesses: 1041 / 1115 : 93.36322869955157 %
8 :Ds2 False positive:  9 ; 0.8071748878923767 %
8 :Dataset2 correct guesses: 1030 / 1115 : 92.37668161434978 %
9 :Ds2 False positive:  2 ; 0.17937219730941703 %
9 :Dataset2 correct guesses: 996 / 1115 : 89.32735426008969 %


Testing with marigin 0.8 showed false positive probability less than 1% and overall accuracy of 92,3% which is better than prediction with default marigin. But it's better to test marigins on all datasets

In [14]:
for i in range(5,10):
    y_pred_diff = (model.predict_proba(X_test)[:,1] >= i/10).astype(bool)
    fp = 0
    idx = 0
    score = 0
    for _,item in y_test.iteritems():
        if item == "ham" and y_pred_diff[idx]==True:
            fp+=1
        idx += 1
    print(i,":Ds1 False positive: ",fp,";",(fp/idx)*100,"%")
    idx = 0
    score = 0
    for _,item in y_test.iteritems():
        if(y_pred_diff[idx]==True and item=="spam") or (y_pred_diff[idx]==False and item=="ham"):
            score+=1
        idx += 1
    print(i,":Dataset1 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")

5 :Ds1 False positive:  12 ; 0.1779359430604982 %
5 :Dataset1 correct guesses: 6630 / 6744 : 98.30960854092527 %
6 :Ds1 False positive:  8 ; 0.11862396204033215 %
6 :Dataset1 correct guesses: 6549 / 6744 : 97.10854092526691 %
7 :Ds1 False positive:  6 ; 0.0889679715302491 %
7 :Dataset1 correct guesses: 6445 / 6744 : 95.56642941874259 %
8 :Ds1 False positive:  4 ; 0.05931198102016608 %
8 :Dataset1 correct guesses: 6255 / 6744 : 92.7491103202847 %
9 :Ds1 False positive:  2 ; 0.02965599051008304 %
9 :Dataset1 correct guesses: 5873 / 6744 : 87.08481613285883 %


In [15]:
for i in range(5,10):
    y_pred_diff = (model.predict_proba(X_test2)[:,1] >= i/10).astype(bool)
    fp = 0
    idx = 0
    score = 0
    for _,item in y_test2.iteritems():
        if item == "ham" and y_pred_diff[idx]==True:
            fp+=1
        idx += 1
    print(i,":Ds3 False positive: ",fp,";",(fp/idx)*100,"%")
    idx = 0
    score = 0
    for _,item in y_test2.iteritems():
        if(y_pred_diff[idx]==True and item=="spam") or (y_pred_diff[idx]==False and item=="ham"):
            score+=1
        idx += 1
    print(i,":Dataset3 correct guesses:",score,"/",idx,":",(score/idx)*100,"%")

5 :Ds3 False positive:  8 ; 0.7729468599033816 %
5 :Dataset3 correct guesses: 1014 / 1035 : 97.97101449275362 %
6 :Ds3 False positive:  6 ; 0.5797101449275363 %
6 :Dataset3 correct guesses: 1007 / 1035 : 97.29468599033815 %
7 :Ds3 False positive:  4 ; 0.3864734299516908 %
7 :Dataset3 correct guesses: 1000 / 1035 : 96.61835748792271 %
8 :Ds3 False positive:  4 ; 0.3864734299516908 %
8 :Dataset3 correct guesses: 983 / 1035 : 94.97584541062803 %
9 :Ds3 False positive:  0 ; 0.0 %
9 :Dataset3 correct guesses: 938 / 1035 : 90.6280193236715 %


For other datasets increasing marigin to 0.8 showed less effect on false positive probability, while decreasing overall accuracy by 3-4%. But it's more important for this classifier to avoid false positives, rather than have high accuracy, so using Bayes classifier with marigin 0.8 will offer best performance. Also using Bayes classifier will allow to dynamically train the model with new data to improve accuracy.

As a result of this project was implemented a classifer capable of identifying both email and SMS spam with 90% probability and less than 0.5% of false positive. Such parameters are optimal as it'll be able to filter most of spam while minimizing important message loss. Furthermore, this classifier can be further trained on more data to maintain relevance on newer spam and mails. And due to fast training speed it can be trained on larger datasets.

Reference to video: [here](https://drive.google.com/file/d/1NXqC4pt1FMOze3zgpDKphU4vVAd5Be58/view?usp=sharing)