
# NAIVE BAYES CLASSIFIER


EMAIL SPAM DETECTION USING NAIVE BAYES CLASSIFIER

The Naïve Bayes classifier is a supervised machine learning algorithm, which is used for classification tasks, like text classification.
It predicts on the basis of the probability of an object.

The algorithm is called 'Naïve' because it assumes that the occurrence of a certain feature is independent of the occurrence of other features.
Also, the algorithm is based on the Bayes Theorem. 
Hence, the algorithm is called "Naïve Bayes"

First, we import the libraries that we might require for our purpose.

In [1]:
import pandas as pd

Creating a dataframe out of our csv file that contains the dataset.

In [3]:
df = pd.read_csv("spam email dataset.csv")

The dataframe is,

In [4]:
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will ü b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


Lets now check the head of the dataframe

In [5]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In the dataset, we represent the category as 'spam' for a bad mail and 'ham' for a good mail.

Lets now explore the data. We can group the dataset category-wise and then lets describe the data

In [6]:
df.groupby('Category').describe()

Unnamed: 0_level_0,Message,Message,Message,Message
Unnamed: 0_level_1,count,unique,top,freq
Category,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
ham,4825,4516,"Sorry, I'll call later",30
spam,747,641,Please call our customer service representativ...,4


On describing we can find out that the count of 'ham' messages is 4825 and 'spam' messages is 747 out of which 4516 and 641 messages contain unique characters repectively. The messages which most frequently appeared have also been described.

Lets now convert our text datas to numericals, since machines can understand only in numerical formats. Let 'spam' messages be denoted using 1 and 'ham' messages using 0.

In [7]:
df['Spam'] = df['Category'].apply(lambda x: 1 if x == 'spam' else 0)

In [8]:
df.head()

Unnamed: 0,Category,Message,Spam
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


Lets now divide the dataframe into input (x) and output (y).

In [9]:
x = df.Message
y = df.Spam

Lets now perform train_test_split. Lets first import the library file required for that.

In [10]:
from sklearn.model_selection import train_test_split

Splitting the data into training and testing

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

In [12]:
x_train

1284                            Yes i thought so. Thanks.
5287    Hey ! Don't forget ... You are MINE ... For ME...
4125    Hey sexy buns ! Have I told you ? I adore you,...
4702                               I liked the new mobile
2914    Kindly send some one to our flat before  &lt;D...
                              ...                        
4322    K, jason says he's gonna be around so I'll be ...
2461    i cant talk to you now.i will call when i can....
991                                          26th OF JULY
2245                              No management puzzeles.
225     500 New Mobiles from 2004, MUST GO! Txt: NOKIA...
Name: Message, Length: 4457, dtype: object

In [13]:
x_test

5276    Dunno leh cant remember mayb lor. So wat time ...
1515                         K:)all the best:)congrats...
603                Speaking of does he have any cash yet?
3228         Wife.how she knew the time of murder exactly
1269    Can U get 2 phone NOW? I wanna chat 2 set up m...
                              ...                        
2476                        Mm i am on the way to railway
3864    Oh my god! I've found your number again! I'm s...
676                    I dont knw pa, i just drink milk..
1065    That's fine, I'll bitch at you about it later ...
2929                                          Anything...
Name: Message, Length: 1115, dtype: object

In [14]:
y_train

1284    0
5287    0
4125    0
4702    0
2914    0
       ..
4322    0
2461    0
991     0
2245    0
225     1
Name: Spam, Length: 4457, dtype: int64

In [15]:
y_test

5276    0
1515    0
603     0
3228    0
1269    1
       ..
2476    0
3864    1
676     0
1065    0
2929    0
Name: Spam, Length: 1115, dtype: int64

Lets now convert the texts in 'Category' column to numerical. We can use 'count vecrorization' method for this.
The Count Vectorization method is used to transform a given text into a vector on the basis of the frequency or count of each word that occurs in the entire text.

Lets first import the libraries required for this

In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
v = CountVectorizer()

Now lets fit and transform the training input values

In [18]:
x_train_count = v.fit_transform(x_train.values)

In [19]:
x_train_count

<4457x7668 sparse matrix of type '<class 'numpy.int64'>'
	with 58956 stored elements in Compressed Sparse Row format>

We can convert this into an array using a 'toarray' function

There are three types of Naïve Bayes classifier:
   
    1.Bernoulli Naïve Bayes: These are used for discrete data, where features are in binary format
    2.Multinomial Naïve Bayes: These are used for multiple discrete features
    3.Gaussian Naïve Bayes: These are used when dealing with continuous data


Here, we are dealing with datas containing multiple and discrete values. Hence, we use Multinomial Naïve Bayes Algorithm.

Importing the required files for this algorithm,

In [20]:
from sklearn.naive_bayes import MultinomialNB

Lets now create a model for this algorithm

In [21]:
model = MultinomialNB()

Training the model,

In [22]:
model.fit(x_train_count,y_train)

MultinomialNB()

Lets now predict an email

In [23]:
email = ['Congratulations, on winning $2000']

Lets now convert this email into numerical format

In [24]:
email_count = v.transform(email)

Lets now predict the output

In [25]:
model.predict(email_count)

array([1], dtype=int64)

The output is a spam!!

Lets predict the output of another email

In [26]:
email2 = ['Heyy lets meet this evening at 7']
email2_count = v.transform(email2)
model.predict(email2_count)

array([0], dtype=int64)

The output is a ham!!