# Spam Email Detector

## Obtain the Dataset:
Download the "Spam or Ham" dataset from Kaggle (https://www.kaggle.com/uciml/sms-spam-collection-dataset).
Extract the dataset and load it into your Python environment.

## Data Exploration:
Examine the dataset to understand its structure, such as columns and their meanings. Explore the distribution of spam and ham emails.

## Data Preprocessing:
Clean and preprocess the text data. This includes Removing special characters and punctuation.

## Tokenization: Splitting the text into words or tokens.
Removing stopwords.Lemmatization or stemming to reduce words to their base form.


## Feature Extraction:
Convert the text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. You can use the TfidfVectorizer from scikit-learn.


## Split the Data:
Split the dataset into a training set and a testing set. Common ratios are 80% for training and 20% for testing.

## Model Selection:
Choose a machine learning algorithm for classification. You can start with simple models like Multinomial Naive Bayes or try more advanced algorithms like Random Forest or Support Vector Machines (SVM).

## Model Training:
Train the selected model using the training data and the TF-IDF features.

## Model Evaluation:
Evaluate the model's performance using various metrics such as accuracy, precision, recall, F1 score, and ROC AUC. You can use scikit-learn's classification_report and confusion_matrix for this purpose.


## Hyperparameter Tuning:
Fine-tune the model by adjusting hyperparameters for better performance. You can use techniques like grid search or random search.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
data = pd.read_csv('spam.csv', encoding='latin-1')
# You may need to adjust the column names based on your dataset's structure
X = data['v2']  # Email text
y = data['v1']  # Spam or ham labels

# Data preprocessing and TF-IDF vectorization
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train a classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)

print("Accuracy:", accuracy)
print(report)
print("Confusion Matrix:")
print(confusion)


Accuracy: 0.9623318385650225
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.72      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115

Confusion Matrix:
[[965   0]
 [ 42 108]]


# Classification Algorithm

## Feature Extraction:
Convert the text data into numerical features using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization. You can use the TfidfVectorizer from scikit-learn.

## Model Selection:
Choose a machine learning algorithm for classification. You can start with simple models like Multinomial Naive Bayes or try more advanced algorithms like Random Forest or Support Vector Machines (SVM).


## Model Training:
Train the selected model using the training data and the TF-IDF features.


## Model Evaluation:
Evaluate the model's performance using various metrics such as accuracy, precision, recall, F1 score, and ROC AUC. You can use scikit-learn's classification_report and confusion_matrix for this purpose.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


# Create and train a classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print("Accuracy:", accuracy)
print(report)


Accuracy: 0.9623318385650225
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.72      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115



In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report



In [6]:

# Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
nb_pred = nb_classifier.predict(X_test)



In [7]:
# SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
svm_pred = svm_classifier.predict(X_test)

# Evaluate Naive Bayes classifier
nb_accuracy = accuracy_score(y_test, nb_pred)
nb_report = classification_report(y_test, nb_pred)

print("Naive Bayes Classifier Results:")
print("Accuracy:", nb_accuracy)
print(nb_report)

# Evaluate SVM classifier
svm_accuracy = accuracy_score(y_test, svm_pred)
svm_report = classification_report(y_test, svm_pred)

print("\nSVM Classifier Results:")
print("Accuracy:", svm_accuracy)
print(svm_report)


Naive Bayes Classifier Results:
Accuracy: 0.9623318385650225
              precision    recall  f1-score   support

         ham       0.96      1.00      0.98       965
        spam       1.00      0.72      0.84       150

    accuracy                           0.96      1115
   macro avg       0.98      0.86      0.91      1115
weighted avg       0.96      0.96      0.96      1115


SVM Classifier Results:
Accuracy: 0.979372197309417
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       965
        spam       0.98      0.86      0.92       150

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115



# Clasification Using NLP

In [9]:
import numpy as np
import pandas as pd
data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [10]:
spam_data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

In [11]:
spam_data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [12]:
#converting the value of ham and spam into 1 or 0
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
spam_data['v1'] = encoder.fit_transform(spam_data['v1'])

In [13]:
#validation things worked so far
spam_data.head()
spam_data.isnull().sum()

v1    0
v2    0
dtype: int64

In [14]:
#check the duplicate values
spam_data.duplicated().sum()

403

In [15]:
#removing the duplicate value
spam_data = spam_data.drop_duplicates(keep = 'first')
spam_data.duplicated().sum()

0

In [16]:
spam_data.head()

Unnamed: 0,v1,v2
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [17]:
spam_data['num_character'] = spam_data['v2'].apply(len)
!pip  install nltk
import nltk
nltk.download('punkt')
spam_data['num_words'] = spam_data['v2'].apply(lambda x:len(nltk.word_tokenize(x)))
spam_data['num_sentence'] = spam_data['v2'].apply(lambda x : len(nltk.sent_tokenize(x)))
spam_data.head()





[nltk_data] Downloading package punkt to C:\Users\my
[nltk_data]     pc\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


Unnamed: 0,v1,v2,num_character,num_words,num_sentence
0,0,"Go until jurong point, crazy.. Available only ...",111,24,2
1,0,Ok lar... Joking wif u oni...,29,8,2
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,155,37,2
3,0,U dun say so early hor... U c already then say...,49,13,1
4,0,"Nah I don't think he goes to usf, he lives aro...",61,15,1


## DATA PROCESSING

In [18]:
#here we are going to convert the text data into suitable format
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('stopwords')
import string

def transformTexts(text):
    text = text.lower()
    #tokenizing the text based on the words in the text
    text = nltk.word_tokenize(text)
    #including all those characters which are alphabet or number
    y = []
    for i in text:
        if i.isalnum:
            y.append(i)
            
    #storing the value of y into the vaiable text
    text = y[:]
    y.clear()
    
    #Removing the unnecessary words such as 'how are you ' or something which is not importent 
    #down here the stopwords.words('english') is nothing but all the lame set of words as described above
    for i in text:
        if i not in stopwords.words('english') and i not in string.punctuation:
            y.append(i)
            
    #the last step is nothing but steming in the steming we basically convert every verb form into the simple form such as
    #if we have dancing or danced we will convert them into dance and so on for this need to import something  hehehehe
    
    text = y[:]
    y.clear()
    ps = PorterStemmer()
    
    for i in text:
        y.append(ps.stem(i))
    
    text = y[:]
    y.clear()
    
    return " ".join(text)

[nltk_data] Downloading package stopwords to C:\Users\my
[nltk_data]     pc\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


In [19]:
transformTexts('Hi Shehriar how are you? Did you eat my salad?And if you are dancing tell me I danced very well last time')

'hi shehriar eat salad danc tell danc well last time'

Now I am going to run this function for whole of the text we have in our dataset and then create a new column to add transformed text there

In [20]:
spam_data['transformed'] = spam_data['v2'].apply(transformTexts)

In [21]:
spam_data.head()

Unnamed: 0,v1,v2,num_character,num_words,num_sentence,transformed
0,0,"Go until jurong point, crazy.. Available only ...",111,24,2,go jurong point crazi .. avail bugi n great wo...
1,0,Ok lar... Joking wif u oni...,29,8,2,ok lar ... joke wif u oni ...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,155,37,2,free entri 2 wkli comp win fa cup final tkt 21...
3,0,U dun say so early hor... U c already then say...,49,13,1,u dun say earli hor ... u c alreadi say ...
4,0,"Nah I don't think he goes to usf, he lives aro...",61,15,1,nah n't think goe usf live around though


now we are going to figure the importent or some of the most used words from the spam text

In [24]:
!pip install wordcloud
from wordcloud import WordCloud




