# Spam Ham Detection Using BERT and Tensorflow

### <u>Project Summary</u>

### <u>GitHub Link</u>
[Click Here](https://github.com/ajitmane36/spam-ham-detection-bert-tensorflow.git)

### <u>Problem Statement</u>

- The data is related to the classification of emails into spam or ham (non-spam). The goal of this project is to develop a model using BERT and TensorFlow to predict whether an email is spam or not based on its content. By fine-tuning a pre-trained BERT model, the objective is to enhance the accuracy and efficiency of email classification, ensuring that legitimate emails are delivered to the inbox while spam is effectively filtered out.

### <u>Data Description</u>

- **text**: Description of the email content (text).
- **label**: Indicates whether the email is spam (1) or not (0).

In [7]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# filter warnings
import warnings
warnings.filterwarnings('ignore')

In [8]:
# Dataset Loading
df=pd.read_csv(r"C:\Users\ajitm\Downloads\DS Projects\Deep Larning Projects\1. Text Classification Using BERT & Tensorflow\spam_emails_data.csv")
df.set_index('label')

Unnamed: 0_level_0,text
label,Unnamed: 1_level_1
Spam,viiiiiiagraaaa\nonly for the ones that want to...
Ham,got ice thought look az original message ice o...
Spam,yo ur wom an ne eds an escapenumber in ch ma n...
Spam,start increasing your odds of success & live s...
Ham,author jra date escapenumber escapenumber esca...
...,...
Ham,on escapenumber escapenumber escapenumber rob ...
Spam,we have everything you need escapelong cialesc...
Ham,hi quick question say i have a date variable i...
Spam,thank you for your loan request which we recie...


In [9]:
# Fist five observations
df.head()

Unnamed: 0,label,text
0,Spam,viiiiiiagraaaa\nonly for the ones that want to...
1,Ham,got ice thought look az original message ice o...
2,Spam,yo ur wom an ne eds an escapenumber in ch ma n...
3,Spam,start increasing your odds of success & live s...
4,Ham,author jra date escapenumber escapenumber esca...


In [10]:
# Last five observations
df.tail()

Unnamed: 0,label,text
193847,Ham,on escapenumber escapenumber escapenumber rob ...
193848,Spam,we have everything you need escapelong cialesc...
193849,Ham,hi quick question say i have a date variable i...
193850,Spam,thank you for your loan request which we recie...
193851,Ham,this is an automatically generated delivery st...


#### <u>Data Inispection</u>

In [12]:
# Shape of dataset
df.shape
print(f'Dataset has {df.shape[0]} observations and {df.shape[1]} columns.')

Dataset has 193852 observations and 2 columns.


In [13]:
# Dataset columns
print(df.columns.tolist())

['label', 'text']


In [14]:
# Basic information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193852 entries, 0 to 193851
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   label   193852 non-null  object
 1   text    193850 non-null  object
dtypes: object(2)
memory usage: 3.0+ MB


In [15]:
# Basic description of dataset
df.groupby('label').describe()

Unnamed: 0_level_0,text,text,text,text
Unnamed: 0_level_1,count,unique,top,freq
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Ham,102159,102159,got ice thought look az original message ice o...,1
Spam,91691,91691,viiiiiiagraaaa\nonly for the ones that want to...,1


- Dataset having 102159 Ham observations and 91691 spam observations.

In [17]:
# Cehcking  duplicates value in each feature
duplicates_df=pd.DataFrame({'columns':df.columns, 'number_of_duplicates': df.duplicated().sum()}).sort_values(by='number_of_duplicates', ascending=False)
print(duplicates_df)
print(f'Dataset having {df.duplicated().sum()} duplicates values.')

  columns  number_of_duplicates
0   label                     0
1    text                     0
Dataset having 0 duplicates values.


In [18]:
# Checking missing values
null_df=pd.DataFrame({'columns': df.columns, 'num_of_nulls': df.isna().sum()})
print(null_df )
print(f'Dataset have {df.isna().sum()} null values.')

      columns  num_of_nulls
label   label             0
text     text             2
Dataset have label    0
text     2
dtype: int64 null values.
