<a href="https://colab.research.google.com/github/YD140/Text-Classification/blob/main/TEXT_CLASSIFICATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **E-mail Classification**

**Goal**: Classify emails as either spam or not spam (ham).  
**Importance**: We can be saved in trapped done through emails if we already know that the email is spam or not. If it is spam so it means it is fraud ,so we ignore that email.




# **Importing Library**

In [None]:
import pandas as pd
import numpy as np
import nltk

# **Data Preprocessing**

1 - In these project we use two dataset 'combined_data' containing  83448 rows and 2 columns and another one is 'spam_ data' containing 5572 rows and 5 columns.  
2 - In combined_data a column name 'label' in which 0 means ham and 1 means spam.   


In [None]:
# Read combined_data
combined_data = pd.read_csv('/content/combined_data.csv')

In [None]:
#check top 5 rows of combined data
combined_data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [None]:
#shape of combined data
combined_data.shape

(83448, 2)

83448 rows and 2 column in combined_data

In [None]:
#counting the frequency of unique value in label column
combined_data.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,43910
0,39538


Out of 83448 rows 43910 rows containing label 1 and 39538 rows containing label 0.

In [None]:
#check the missing value
combined_data.isnull().sum()

Unnamed: 0,0
label,0
text,0


No missing value was present

In [None]:
# Read Spam-data
spam_data = pd.read_csv('/content/spam.csv',encoding='latin-1')

In [None]:
spam_data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
spam_data.isnull().sum()

Unnamed: 0,0
v1,0
v2,0
Unnamed: 2,5522
Unnamed: 3,5560
Unnamed: 4,5566


In spam_data there are 3 unnecessary columns are present name is 'Unnamed:2','Unnamed:3','Unnamed:4' so we should drop that column because we dont reqired it.

In [None]:
#Remove the unnecessary column
spam_data = spam_data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)

In [None]:
spam_data.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
spam_data.shape

(5572, 2)

This dataset containing 5572 rows and 2 columns

In [None]:
#Rename column v1 and v2 to label and text
spam_data = spam_data.rename(columns={'v1': 'label', 'v2': 'text'})

Now we rename the column name 'v1' and 'v2' to 'label' and 'text'

In [None]:
spam_data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
#Replace word ham and spam with 0 and 1.
spam_data['label'] = spam_data['label'].replace({'ham': 0, 'spam': 1})

Now we convert name ham to '0' and spam to '1' so, we can identify both the dataset easily

In [None]:
spam_data.head()

Unnamed: 0,label,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
spam_data.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,4825
1,747


In spam_data they containing 4825 rows of label '0' and 747 rows of label '1'

In [None]:
spam_data.shape

(5572, 2)

In [None]:
#combine both the dataset
final_data = pd.concat([combined_data, spam_data], ignore_index=True, axis=0).drop_duplicates()

In these step we combined both the dataset and then we remove the duplicate value present in the final_data

In [None]:
final_data.shape

(88617, 2)

Final total no. of rows is 88617 and columns is 2

In [None]:
final_data.value_counts('label')

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,44563
0,44054


The final_data containing 44563 rows of label '1' and 44054 rows of label '0'.

# **Now we are going to perforn NLP task to get read the data in a better way so a model can understand data easily and performing better**

# **Importing necessary library**

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Tokenization**

In tokenization we convert paragraph or sentences into different-different tokens(words)

In [None]:
final_data['text'] = final_data['text'].astype(str)
tokens=[word_tokenize(text) for text in final_data['text']]

In [None]:
final_data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get your medircations online qnb ikud v...
2,0,computer connection from cnn com wednesday es...
3,1,university degree obtain a prosperous future m...
4,0,thanks for all your answers guys i know i shou...


In [None]:
#remove html tags
final_data['text'] = final_data['text'].apply(lambda x:re.sub(r'<.*?>','',x))

In above we remove unnecessary tags to get the data into a proper textual form

In [None]:
#lowercase the text
final_data['text'] = final_data['text'].str.lower()

In above step we convert all data into lowercase , so model can not confuse when performing on the data

In [None]:
#remove stopwords
stop_words = set(stopwords.words('english'))
final_data['text'] = final_data['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In the above step we remove stopwords means some grammatical words like 'the','of','is',etc  

In [None]:
final_data.head()

Unnamed: 0,label,text
0,1,ounce feather bowl hummingbird opec moment ala...
1,1,wulvob get medircations online qnb ikud viagra...
2,0,computer connection cnn com wednesday escapenu...
3,1,university degree obtain prosperous future mon...
4,0,thanks answers guys know checked rsync manual ...


# **Stemming**

In this step we convert a word into their base form and find out simliar word so it will helpul in reducing the unique words.

In [None]:
#Convert a word into their root form with the help of Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
final_data['text'] = final_data['text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))


In [None]:
final_data.head()

Unnamed: 0,label,text
0,1,ounc feather bowl hummingbird opec moment alab...
1,1,wulvob get medirc onlin qnb ikud viagra escape...
2,0,comput connect cnn com wednesday escapenumb ma...
3,1,univers degre obtain prosper futur money earn ...
4,0,thank answer guy know check rsync manual would...


In [None]:
final_data.shape

(88617, 2)

# **TF-IDF**

TF-IDF stand for 'Term Frequency-Inverse Document Frequency' use to convert word into vectors with 0 and 1 labeling and TF-IDF gives the more importance to the words which are rarely use and give less importance to the word which are more usable.

In [None]:
#Term Frequency-Inverse Document Frequency
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(final_data['text'])
y = final_data['label']

# **Splitting the data into Training and Testing**

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# **Model Implementation**

**So we completed the EDA process and now we need to apply classification model on the dataset for further prediction**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

**Logistic Regression**

In [None]:
model=LogisticRegression()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

# **Naive Bayes**

**1-GaussianNB**

In [None]:
gnb=GaussianNB()
gnb.fit(X_train,y_train)
y_pred=gnb.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

**2-BernoulliNB**

In [None]:
#BernoulliNB
bnb=BernoulliNB()
bnb.fit(X_train,y_train)
y_pred=bnb.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

**3-MultinomialNB**

In [None]:
#MultinomialNB
mnb=MultinomialNB()
mnb.fit(X_train,y_train)
y_pred=mnb.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(recall_score(y_test,y_pred))
print(f1_score(y_test,y_pred))

**Support Vector Machine**

In [None]:
model = SVC(kernel='linear')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

**Decision Tree Classifier**

In [None]:
model=DecisionTreeClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

**Random Forest Classifier**

In [None]:
model=RandomForestClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

**KNeighbors Classifier**

In [None]:
model=KNeighborsClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

**XGBOOST Classifier**

In [None]:
model=XGBClassifier()
model.fit(X_train,y_train)
y_pred=model.predict(X_test)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))