<a href="https://colab.research.google.com/github/YD140/Text-Classification/blob/main/TEXT_CLASSIFICATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **E-mail Classification**

**Goal**: Classify emails as either spam or not spam (ham).  
**Importance**: We can be saved in trapped done through emails if we already know that the email is spam or not. If it is spam so it means it is fraud ,so we ignore that email.




# **Importing Library**

In [6]:
import pandas as pd
import numpy as np
import nltk

# **Data Preprocessing**

1 - In these project we use two dataset 'combined_data' containing  83448 rows and 2 columns and another one is 'spam_ data' containing 5572 rows and 5 columns.  
2 - In combined_data a column name 'label' in which 0 means ham and 1 means spam.   


In [7]:
# Read combined_data
combined_data = pd.read_csv('/content/combined_data.csv')

In [8]:
#check top 5 rows of combined data
combined_data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [9]:
#shape of combined data
combined_data.shape

(83448, 2)

83448 rows and 2 column in combined_data

In [10]:
#counting the frequency of unique value in label column
combined_data.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,43910
0,39538


Out of 83448 rows 43910 rows containing label 1 and 39538 rows containing label 0.

In [11]:
#check the missing value
combined_data.isnull().sum()

Unnamed: 0,0
label,0
text,0


No missing value was present

In [12]:
# Read Spam-data
spam_data = pd.read_csv('/content/spam.csv',encoding='latin-1') #we use latin-1 when we special character (such as é, ñ, ü)

In [13]:
spam_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [14]:
spam_data.isnull().sum()

Unnamed: 0,0
v1,0
v2,0
Unnamed: 2,5522
Unnamed: 3,5560
Unnamed: 4,5566


In spam_data there are 3 unnecessary columns are present name is 'Unnamed:2','Unnamed:3','Unnamed:4' so we should drop that column because we dont reqired it.

In [15]:
#Remove the unnecessary column
spam_data = spam_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1) #axis=1 use when we want to drop column

In [16]:
spam_data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [17]:
spam_data.shape

(5572, 2)

This dataset containing 5572 rows and 2 columns

In [18]:
#Rename column v1 and v2 to label and text
spam_data = spam_data.rename(columns={'v1': 'label', 'v2': 'text'})

Now we rename the column name 'v1' and 'v2' to 'label' and 'text'

In [19]:
spam_data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [20]:
#Replace word ham and spam with 0 and 1.
spam_data['label'] = spam_data['label'].replace({'ham': 0, 'spam': 1})

Now we convert name ham to '0' and spam to '1' so, we can identify both the dataset easily

In [21]:
spam_data.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [22]:
spam_data.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,4825
1,747


In spam_data they containing 4825 rows of label '0' and 747 rows of label '1'

In [23]:
spam_data.shape

(5572, 2)

In [24]:
#combine both the dataset
final_data = pd.concat([combined_data, spam_data], ignore_index=True, axis=0).drop_duplicates()

In these step we combined both the dataset and then we remove the duplicate value present in the final_data

In [25]:
final_data.shape

(88617, 2)

Final total no. of rows is 88617 and columns is 2

In [26]:
final_data.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,44563
0,44054


The final_data containing 44563 rows of label '1' and 44054 rows of label '0'.

# **Now we are going to perforn NLP task to get read the data in a better way so a model can understand data easily and performing better.**

# **Importing necessary library**

In [27]:
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# **Tokenization**

In tokenization we convert paragraph or sentences into different-different tokens(words)

In [28]:
final_data['text'] = final_data['text'].astype(str)
tokens=[word_tokenize(text) for text in final_data['text']]

In [29]:
final_data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [30]:
#remove html tags and punctuation tags
final_data['text'] = final_data['text'].apply(lambda x:re.sub(r'<.*?>','',x))

In above we remove unnecessary tags to get the data into a proper textual form

In [31]:
#lowercase the text
final_data['text'] = final_data['text'].str.lower()

In above step we convert all data into lowercase , so model can not confuse when performing on the data

In [32]:
#remove stopwords
stop_words = set(stopwords.words('english'))
final_data['text'] = final_data['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In the above step we remove stopwords means some grammatical words like 'the','of','is',etc  

In [33]:
final_data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get medircations online qnb ikud viagra...
2,0,computer connection cnn com wednesday escapenu...
3,1,university degree obtain prosperous future mon...
4,0,thanks answers guys know checked rsync manual ...


# **Stemming**

In this step we convert a word into their base form and find out simliar word so it will helpul in reducing the unique words.

In [34]:
#Convert a word into their root form with the help of Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
final_data['text'] = final_data['text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))


In [35]:
final_data.head()

Unnamed: 0,label,text
0,1,ounc feather bowl hummingbird opec moment alab...
1,1,wulvob get medirc onlin qnb ikud viagra escape...
2,0,comput connect cnn com wednesday escapenumb ma...
3,1,univers degre obtain prosper futur money earn ...
4,0,thank answer guy know check rsync manual would...


In [36]:
final_data.shape

(88617, 2)

# **TF-IDF**

TF-IDF stand for 'Term Frequency-Inverse Document Frequency' use to convert word into vectors with 0 and 1 labeling and TF-IDF gives the more importance to the words which are rarely use and give less importance to the word which are more usable.

In [37]:
#Term Frequency-Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf = TfidfVectorizer(max_features=1000)
X = tfidf.fit_transform(final_data['text']).toarray()
y = final_data['label']

# **Splitting the data into Training and Testing**

In [38]:
from sklearn.model_selection import train_test_split

In [39]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Model Implementation**

**So we completed the EDA process and now we need to apply classification model on the dataset for further prediction**

**The below description are the list of the classification algorithm which we used for the prediction**

**1-Logistic Regression:** Predicts the probability of a class (yes/no) based on input features.

**2-SVC (Support Vector Classifier):** Finds the best line (or boundary) to separate different classes in the data.

**3-Gaussian Naive Bayes:** Assumes that features follow a normal distribution and uses that to classify data.

**4-Bernoulli Naive Bayes:** Classifies data where features are binary (0 or 1).

**5-Multinomial Naive Bayes:** Used for data like word counts (common in text classification).

**6-Decision Tree:** Makes decisions by splitting the data into branches based on feature values.

**7-Random Forest:** Combines multiple decision trees to make more accurate predictions.

**8-K-Nearest Neighbors:** Classifies data by looking at the closest (most similar) points.

**9-XGBoost:** A powerful method that combines many small decision trees to improve performance.  
**10-Evaluation Metrics:**

**Accuracy:** How many predictions were correct.  
**Precision:** How many positive predictions were actually correct.  
**Recall:** How many actual positives were correctly predicted.  
**F1 Score:** A balance between precision and recall.  
**Confusion Matrix:** Shows how many predictions were correct or wrong for each class.

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,confusion_matrix

**Logistic Regression**

In [41]:
model=LogisticRegression()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

0.9453847889866848
[[8251  504]
 [ 464 8505]]
0.9440559440559441
0.9482662504181069
0.9461564133941484


# **Naive Bayes**

**1-GaussianNB**

In [42]:
gnb=GaussianNB()
gnb.fit(X_train,y_train)
y_pred=gnb.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

0.9065109456104716
[[7543 1212]
 [ 445 8524]]
0.8755135579293344
0.9503846582673654
0.9114140604116546


**2-BernoulliNB**

In [43]:
#BernoulliNB
bnb=BernoulliNB()
bnb.fit(X_train,y_train)
y_pred=bnb.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

0.8490747009704356
[[6679 2076]
 [ 599 8370]]
0.8012636415852958
0.933214405173375
0.8622199330414628


**3-MultinomialNB**

In [44]:
#MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,y_train)
y_pred=mnb.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

0.9050440081245769
[[7610 1145]
 [ 538 8431]]
0.8804302422723476
0.9400156093209945
0.9092477756807765


**Support Vector Machine**

**Decision Tree Classifier**

In [46]:
model=DecisionTreeClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.9344391785150079
[[8204  551]
 [ 611 8358]]
0.938152430126838
0.9318764633738432
0.9350039154267815


**Random Forest Classifier**

In [47]:
model=RandomForestClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.9665425411870909
[[8463  292]
 [ 301 8668]]
0.9674107142857142
0.966439959861746
0.9669250934240616


**KNeighbors Classifier**

In [48]:
model=KNeighborsClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.8972015346422929
[[8579  176]
 [1646 7323]]
0.9765302040272036
0.8164789831642324
0.8893611853291232


**XGBOOST Classifier**

In [49]:
model=XGBClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.9608440532611149
[[8455  300]
 [ 394 8575]]
0.9661971830985916
0.9560709109153752
0.9611073750280208


# **Among all these classifier we found that "XGBOOST Classifier" and "Random Forest Classifier" performing well with an 96.08% and 96.71% accuracy.**

In [67]:
from xgboost import XGBClassifier
import re
from nltk.corpus import stopwords

# Assuming 'X_train' and 'y_train' are defined

model=XGBClassifier()

sample_email = [
    "Congratulations! You've won $1 million. Please contact us to claim your prize: [Phone Number]",
    "Hi, this is a legitimate email from your bank. There's nothing to worry about.",
    "Your order has been shipped! Track it here: [Tracking Link]",
    "URGENT! Your account is about to expire. Click here to renew: [Suspicious Link]",
    "Hey there! Just wanted to say hi and check in. How's it going?"
]


# Preprocess each email individually within the list
processed_emails = []
for email in sample_email:
    #remove html tags
    email = re.sub(r'<.*?>','',email)

    #lowercase the text
    email = email.lower()

    #remove stopwords
    stop_words = set(stopwords.words('english'))
    email = ' '.join([word for word in email.split() if word not in stop_words])

    processed_emails.append(email)

# Transform the processed emails using the same TF-IDF vectorizer
sample_email_vector = tfidf.transform(processed_emails)

# Use the trained model to predict
model.fit(X_train, y_train)
predictions = model.predict(sample_email_vector)

# Print results with labels
for i, email in enumerate(processed_emails):
    if predictions[i] == 1:
        print(f"Email {i+1}: Classified as SPAM - {email}")
    else:
        print(f"Email {i+1}: Classified as NOT SPAM - {email}")

Email 1: Classified as NOT SPAM - congratulations! $1 million. please contact us claim prize: [phone number]
Email 2: Classified as NOT SPAM - hi, legitimate email bank. there's nothing worry about.
Email 3: Classified as NOT SPAM - order shipped! track here: [tracking link]
Email 4: Classified as NOT SPAM - urgent! account expire. click renew: [suspicious link]
Email 5: Classified as NOT SPAM - hey there! wanted say hi check in. how's going?


In [68]:
from sklearn.ensemble import RandomForestClassifier
import re
from nltk.corpus import stopwords

# Assuming 'X_train' and 'y_train' are defined

model=RandomForestClassifier()

sample_email = [
    "Congratulations! You've won $1 million. Please contact us to claim your prize: [Phone Number]",
    "Hi, this is a legitimate email from your bank. There's nothing to worry about.",
    "Your order has been shipped! Track it here: [Tracking Link]",
    "URGENT! Your account is about to expire. Click here to renew: [Suspicious Link]",
    "Hey there! Just wanted to say hi and check in. How's it going?"
]


# Preprocess each email individually within the list
processed_emails = []
for email in sample_email:
    #remove html tags
    email = re.sub(r'<.*?>','',email)

    #lowercase the text
    email = email.lower()

    #remove stopwords
    stop_words = set(stopwords.words('english'))
    email = ' '.join([word for word in email.split() if word not in stop_words])

    processed_emails.append(email)

# Transform the processed emails using the same TF-IDF vectorizer
sample_email_vector = tfidf.transform(processed_emails)

# Use the trained model to predict
model.fit(X_train, y_train)
predictions = model.predict(sample_email_vector)

# Print results with labels
for i, email in enumerate(processed_emails):
    if predictions[i] == 1:
        print(f"Email {i+1}: Classified as SPAM - {email}")
    else:
        print(f"Email {i+1}: Classified as NOT SPAM - {email}")

Email 1: Classified as SPAM - congratulations! $1 million. please contact us claim prize: [phone number]
Email 2: Classified as SPAM - hi, legitimate email bank. there's nothing worry about.
Email 3: Classified as SPAM - order shipped! track here: [tracking link]
Email 4: Classified as SPAM - urgent! account expire. click renew: [suspicious link]
Email 5: Classified as NOT SPAM - hey there! wanted say hi check in. how's going?


In [72]:
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()

sample_email =  "Hi, this is a legitimate email from your bank. There's nothing to worry about."

# Preprocess the sample email
#remove html tags
sample_email = re.sub(r'<.*?>','',sample_email)

#lowercase the text
sample_email = sample_email.lower()

#remove stopwords
stop_words = set(stopwords.words('english'))
sample_email = ' '.join([word for word in sample_email.split() if word not in stop_words])

# Transform the sample email using the same TF-IDF vectorizer
sample_email_vector = tfidf.transform([sample_email])

# Use the trained model to predict
model.fit(X_train, y_train)
prediction = model.predict(sample_email_vector)

if prediction[0] == 1:
    print("The sample email is classified as SPAM.")
else:
    print("The sample email is classified as NOT SPAM.")


The sample email is classified as SPAM.
