# Naive Bayes

Type of Naive Bayes:

1. Gaussian Naive Bayes: gaussiannb is used in classification tasks and it assumes that feature values follow a gaussian distribution.
2. Multinomial Naive Bayes: It is used for discrete counts. For example, let’s say,  we have a text classification problem. Here we can consider Bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.
3. Bernoulli Naive Bayes: The binomial model is useful if your feature vectors are boolean (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.
4. Complement Naive Bayes: It is an adaptation of Multinomial NB where the complement of each class is used to calculate the model weights. So, this is suitable for imbalanced data sets and often outperforms the MNB on text classification tasks.
5. Categorical Naive Bayes: Categorical Naive Bayes is useful if the features are categorically distributed. We have to encode the categorical variable in the numeric format using the ordinal encoder for using this algorithm.

In [12]:
import pandas as pd
import numpy as np
import re
import string
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix


In [50]:
# Load the dataset
df = pd.read_csv("spam.csv")

In [14]:
pd.set_option('display.max_rows', None)

In [15]:
df

Unnamed: 0.1,Unnamed: 0,label,text,label_num
0,605,ham,Subject: enron methanol ; meter # : 988291\r\n...,0
1,2349,ham,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,3624,ham,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,4685,spam,"Subject: photoshop , windows , office . cheap ...",1
4,2030,ham,Subject: re : indian springs\r\nthis deal is t...,0
5,2949,ham,Subject: ehronline web address change\r\nthis ...,0
6,2793,ham,Subject: spring savings certificate - take 30 ...,0
7,4185,spam,Subject: looking for medication ? we ` re the ...,1
8,2641,ham,Subject: noms / actual flow for 2 / 26\r\nwe a...,0
9,1870,ham,"Subject: nominations for oct . 21 - 23 , 2000\...",0


In [51]:


df = df[['text', 'label_num']]


In [52]:
df.head()

Unnamed: 0,text,label_num
0,Subject: enron methanol ; meter # : 988291\r\n...,0
1,"Subject: hpl nom for january 9 , 2001\r\n( see...",0
2,"Subject: neon retreat\r\nho ho ho , we ' re ar...",0
3,"Subject: photoshop , windows , office . cheap ...",1
4,Subject: re : indian springs\r\nthis deal is t...,0


In [53]:

# Preprocessing the text data
def clean_text(text):
    text = text.lower()
    text = re.sub('\n', ' ', text)
    text = re.sub('\d', ' ', text)
    text = re.sub('\r', ' ',text)
    #text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(' +', ' ', text)
    return text

df['text'] = df['text'].apply(clean_text)


In [54]:
df.head()

Unnamed: 0,text,label_num
0,subject: enron methanol ; meter # : this is a ...,0
1,"subject: hpl nom for january , ( see attached ...",0
2,"subject: neon retreat ho ho ho , we ' re aroun...",0
3,"subject: photoshop , windows , office . cheap ...",1
4,subject: re : indian springs this deal is to b...,0


In [55]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\AN20259618\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [56]:
def remove_stop(text):# "Hey Hi there I am anil"
    filtered = []
    tokenized_word = word_tokenize(text) #[hey,hi,there,i,am,anil]
    for each_word in tokenized_word:
        if each_word not in stop_words:
            filtered.append(each_word)   #[hey,hi,i,anil]
    return(" ".join(filtered))


In [None]:
df['text'] = df['text'].apply(remove_stop)

In [59]:
from nltk.stem.snowball import SnowballStemmer
snow_stem = SnowballStemmer(language='english')

In [61]:
def stem(text):# "Hey Hi there I am anil"
    filtered = []
    tokenized_word = word_tokenize(text) #[hey,hi,there,i,am,anil]
    for each_word in tokenized_word:
         filtered.append(snow_stem.stem(each_word))  #[hey,hi,i,anil]
    return(" ".join(filtered))

In [62]:
df['text'] = df['text'].apply(stem)

In [64]:
df.head()

Unnamed: 0,text,label_num
0,subject : enron methanol ; meter # : follow no...,0
1,"subject : hpl nom januari , ( see attach file ...",0
2,"subject : neon retreat ho ho ho , ' around won...",0
3,"subject : photoshop , window , offic . cheap ....",1
4,subject : : indian spring deal book teco pvr r...,0


In [65]:
#Creating feature vectors using CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['text'])
y = df['label_num']

In [66]:
X[0]

<1x37777 sparse matrix of type '<class 'numpy.int64'>'
	with 30 stored elements in Compressed Sparse Row format>

In [67]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Training the Naive Bayes model
nb = MultinomialNB()
nb.fit(X_train, y_train)

# Making predictions on the testing set
y_pred = nb.predict(X_test)


In [68]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [69]:
# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
conf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", conf_matrix)
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9729381443298969
Confusion Matrix:
 [[1099   22]
 [  20  411]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.98      0.98      1121
           1       0.95      0.95      0.95       431

    accuracy                           0.97      1552
   macro avg       0.97      0.97      0.97      1552
weighted avg       0.97      0.97      0.97      1552



In [77]:
text = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your laptop."
]

In [75]:
vectorizer.transform(text)

<5x37777 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [76]:
nb.predict(vectorizer.transform(text))

array([0, 1, 1, 0, 0], dtype=int64)