#  Spam Classification project using Naive Bayes

1. Objective: Identify whether a message is spam or not based on its content.

2. Dataset: Use a labeled dataset containing messages categorized as "spam" or "not spam."

3. Preprocessing: Clean text data by removing punctuation, converting to lowercase, and eliminating stop words.

4. Feature Extraction: Use techniques like Bag of Words (BoW) or Term Frequency-Inverse Document Frequency (TF-IDF) to convert text into numerical features.

5. Algorithm: Apply the Naive Bayes algorithm, known for its effectiveness in text classification tasks.

6. Data Splitting: Split the dataset into training and testing sets for model evaluation.

7. Model Training: Train the Naive Bayes model on the training data to learn patterns in spam and non-spam messages.

8. Evaluation Metrics: Use accuracy, precision, recall, and F1 score to measure model performance.

9. Testing: Test the model on unseen data to validate its ability to classify new messages accurately.

10. Application: Use the trained model to filter spam messages in real-world scenarios like emails or SMS.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('spam.csv',encoding='latin-1')

In [3]:
data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [4]:
data.isna().sum()

v1               0
v2               0
Unnamed: 2    5522
Unnamed: 3    5560
Unnamed: 4    5566
dtype: int64

In [5]:
data = data[['v1','v2']]

In [6]:
data

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [7]:
data.columns = ['label','message']

In [8]:
data

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [9]:
data['label'] = data['label'].map({'ham':0,'spam':1})

In [10]:
data

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    5572 non-null   int64 
 1   message  5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


In [12]:
import re

In [13]:
def clean_text(message):
    message = message.lower()
    message = re.sub(r'[^\w\s]','',message)
    message = re.sub(r'\d+','',message)
    return message

In [14]:
data['message']= data['message'].apply(clean_text)

In [15]:
data

Unnamed: 0,label,message
0,0,go until jurong point crazy available only in ...
1,0,ok lar joking wif u oni
2,1,free entry in a wkly comp to win fa cup final...
3,0,u dun say so early hor u c already then say
4,0,nah i dont think he goes to usf he lives aroun...
...,...,...
5567,1,this is the nd time we have tried contact u u...
5568,0,will ì_ b going to esplanade fr home
5569,0,pity was in mood for that soany other suggest...
5570,0,the guy did some bitching but i acted like id ...


In [16]:
vector = CountVectorizer(stop_words='english')

In [17]:
X = vector.fit_transform(data['message'])

In [18]:
X

<5572x8323 sparse matrix of type '<class 'numpy.int64'>'
	with 41188 stored elements in Compressed Sparse Row format>

In [19]:
print(X)

  (0, 3643)	1
  (0, 5341)	1
  (0, 1508)	1
  (0, 459)	1
  (0, 899)	1
  (0, 2856)	1
  (0, 8036)	1
  (0, 3795)	1
  (0, 897)	1
  (0, 1227)	1
  (0, 2817)	1
  (0, 230)	1
  (0, 7802)	1
  (1, 4907)	1
  (1, 3831)	1
  (1, 3611)	1
  (1, 7930)	1
  (1, 4938)	1
  (2, 2556)	1
  (2, 2157)	2
  (2, 7987)	1
  (2, 1342)	1
  (2, 7944)	1
  (2, 2286)	2
  (2, 1564)	1
  :	:
  (5567, 2050)	1
  (5567, 5404)	1
  (5567, 885)	1
  (5568, 3151)	1
  (5568, 2778)	1
  (5568, 8314)	1
  (5568, 2543)	1
  (5568, 2185)	1
  (5569, 4499)	1
  (5569, 5265)	1
  (5569, 6498)	1
  (5569, 6870)	1
  (5570, 2556)	1
  (5570, 3300)	1
  (5570, 3945)	1
  (5570, 7852)	1
  (5570, 1800)	1
  (5570, 2673)	1
  (5570, 933)	1
  (5570, 3431)	1
  (5570, 2920)	1
  (5570, 61)	1
  (5570, 693)	1
  (5571, 7409)	1
  (5571, 5959)	1


In [20]:
y = data['label']

In [21]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=42)

In [22]:
nb = MultinomialNB()

In [23]:
nb.fit(X_train,y_train)

In [24]:
y_pred = nb.predict(X_test)

In [25]:
y_pred

array([1, 0, 1, ..., 0, 0, 1], dtype=int64)

In [26]:
accuracy = accuracy_score(y_test,y_pred)

In [27]:
accuracy

0.9704035874439462

In [28]:
cm = confusion_matrix(y_test,y_pred)

In [29]:
cm

array([[945,  20],
       [ 13, 137]], dtype=int64)