<a href="https://colab.research.google.com/github/dvircohen0/NLP/blob/main/simple_spam_filter.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split

**SPAM E-MAIL DATABASE ATTRIBUTES** 

---


*   48 continuous real [0,100] attributes of type word_freq_WORD 
 = percentage of words in the e-mail that match WORD,
 i.e. 100 * (number of times the WORD appears in the e-mail) / 
 total number of words in e-mail.  A "word" in this case is any 
 string of alphanumeric characters bounded by non-alphanumeric 
 characters or end-of-string.
*   6 continuous real [0,100] attributes of type char_freq_CHAR
 = percentage of characters in the e-mail that match CHAR,
 i.e. 100 * (number of CHAR occurences) / total characters in e-mail


*   1 continuous real [1,...] attribute of type capital_run_length_average
 = average length of uninterrupted sequences of capital letters
*   1 continuous integer [1,...] attribute of type capital_run_length_longest
 = length of longest uninterrupted sequence of capital letters


*   1 continuous integer [1,...] attribute of type capital_run_length_total
 = sum of length of uninterrupted sequences of capital letters
 = total number of capital letters in the e-mail
*   1 nominal {0,1} class attribute of type spam
 = denotes whether the e-mail was considered spam (1) or not (0), 
 i.e. unsolicited commercial e-mail.



*  word_freq_make:         continuous.
*  word_freq_address:      continuous.
*  word_freq_all:          continuous.
*  word_freq_3d:           continuous.
*  word_freq_our:          continuous.
*  word_freq_over:         continuous.
*  word_freq_remove:       continuous.
*  word_freq_internet:     continuous.
*  word_freq_order:        continuous.
*  word_freq_mail:         continuous.
*  word_freq_receive:      continuous.
*  word_freq_will:         continuous.
*  word_freq_people:       continuous.
*  word_freq_report:       continuous.
*  word_freq_addresses:    continuous.
*  word_freq_free:         continuous.
*  word_freq_business:     continuous.
*  word_freq_email:        continuous.
*  word_freq_you:          continuous.
*  word_freq_credit:       continuous.
*  word_freq_your:         continuous.
*  word_freq_font:         continuous.
*  word_freq_000:          continuous.
*  word_freq_money:        continuous.
*  word_freq_hp:           continuous.
*  word_freq_hpl:          continuous.
*  word_freq_george:       continuous.
*  word_freq_650:          continuous.
*  word_freq_lab:          continuous.
*  word_freq_labs:         continuous.
*  word_freq_telnet:       continuous.
*  word_freq_857:          continuous.
*  word_freq_data:         continuous.
*  word_freq_415:          continuous.
*  word_freq_85:           continuous.
*  word_freq_technology:   continuous.
*  word_freq_1999:         continuous.
*  word_freq_parts:        continuous.
*  word_freq_pm:           continuous.
*  word_freq_direct:       continuous.
*  word_freq_cs:           continuous.
*  word_freq_meeting:      continuous.
*  word_freq_original:     continuous.
*  word_freq_project:      continuous.
*  word_freq_re:           continuous.
*  word_freq_edu:          continuous.
*  word_freq_table:        continuous.
*  word_freq_conference:   continuous.
*  char_freq_;:            continuous.
*  char_freq_(:            continuous.
*  char_freq_[:            continuous.
*  char_freq_!:            continuous.
*  char_freq_$:            continuous.
*  char_freq_#:            continuous.
*  capital_run_length_average: continuous.
*  capital_run_length_longest: continuous.
*  capital_run_length_total:   continuous.









In [10]:
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data
data = pd.read_csv("spambase.data",header=None)
data.head()

--2021-05-18 16:51:41--  https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 702942 (686K) [application/x-httpd-php]
Saving to: ‘spambase.data.3’


2021-05-18 16:51:42 (2.16 MB/s) - ‘spambase.data.3’ saved [702942/702942]



Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.64,0.0,0.0,0.0,0.32,0.0,1.29,1.93,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,0.21,0.79,0.65,0.21,0.14,0.14,0.07,0.28,3.47,0.0,1.59,0.0,0.43,0.43,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,0.38,0.45,0.12,0.0,1.75,0.06,0.06,1.03,1.36,0.32,0.51,0.0,1.16,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.06,0.0,0.0,0.12,0.0,0.06,0.06,0.0,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,0.31,0.31,0.31,0.0,0.0,0.31,0.0,0.0,3.18,0.0,0.31,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [7]:
data = data.to_numpy()
np.random.shuffle(data)
 
features = data[:,:-1]
labels = data[:,-1]

X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.02173, random_state=42)

In [8]:
NB_model = MultinomialNB()
NB_model.fit(X_train,y_train)

print("Classification rate for Naive bayes: ", NB_model.score(X_test,y_test))

AB_model = AdaBoostClassifier()
AB_model.fit(X_train,y_train)

print("Classification rate for AdaBoost: ", AB_model.score(X_test,y_test)) 


Classification rate for Naive bayes:  0.77
Classification rate for AdaBoost:  0.92
