# Text Classification using RNN

## Business Problem

Spam and ham classification plays a crucial role in contemporary communication systems, especially in the context of electronic communication channels like emails and text messages. With the exponential growth of digital communication, the significance of accurate spam and ham classification has become increasingly important for several reasons:

1. **User Experience:** Efficient spam filtering ensures a better user experience by preventing unwanted and irrelevant content from reaching users' inboxes. Users are less likely to be inundated with unsolicited messages, leading to a cleaner and more organized communication environment.

2. **Productivity:** Spam filtering contributes to increased productivity as users spend less time sifting through irrelevant or potentially harmful messages. It allows individuals and organizations to focus on legitimate and valuable communication, improving overall workflow efficiency.

3. **Security:** Many spam messages are associated with phishing attempts, scams, and malware distribution. Effective spam filtering acts as a frontline defense against cyber threats by identifying and isolating malicious content, thereby enhancing the security posture of individuals and organizations.

4. **Resource Optimization:** By reducing the volume of spam that enters email servers and messaging platforms, resources such as storage, bandwidth, and processing power can be optimized. This is particularly crucial for large-scale email providers and enterprises dealing with vast amounts of communication data.

5. **Brand Reputation:** For businesses and organizations, spam filtering is essential for maintaining a positive brand reputation. Preventing spam from reaching customers' inboxes ensures that legitimate messages are not overlooked or associated with undesirable content, preserving trust and credibility.

6. **Regulatory Compliance:** Compliance with data protection and privacy regulations often requires organizations to implement measures to protect users from unwanted communication. Adequate spam and ham classification helps in meeting regulatory requirements and avoiding potential legal implications.

## Data  
**Context**  
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,572 messages, tagged acording being ham (legitimate) or spam.  

**Acknowledgement**  
The original dataset can be found [here](https://archive.ics.uci.edu/dataset/228/sms+spam+collection)

# Importing Dependencies

In [41]:
import pandas as pd
import numpy as np
import os
os.chdir("/content/drive/MyDrive/Colab Notebooks/Text Classification")
from nltk.corpus import stopwords
from nltk import *
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import load_model
from sklearn.metrics import precision_recall_fscore_support, classification_report
from tensorflow.keras.preprocessing.text import Tokenizer
import tensorflow as tf
import pickle
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, LSTM, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import nltk
from sklearn.preprocessing import LabelEncoder
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Preprocessing the Data

In [42]:
# reading the file
file_content = pd.read_csv("spam.csv", encoding = "ISO-8859-1")
file_content

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [43]:
# checking only the email column
file_content["v2"]

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: v2, Length: 5572, dtype: object

In [44]:
# removing the stop words
stop = stopwords.words("english")
file_content["v2"] = file_content["v2"].apply(
    lambda x: " ".join(x for x in x.split() if x not in stop))

# delete unwanted columns
Email_Data = file_content[['v1', 'v2']]

#rename column names
Email_Data = Email_Data.rename(columns={"v1":"Target", "v2":"Email"})
Email_Data


Unnamed: 0,Target,Email
0,ham,"Go jurong point, crazy.. Available bugis n gre..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry 2 wkly comp win FA Cup final tkts 2...
3,ham,U dun say early hor... U c already say...
4,ham,"Nah I think goes usf, lives around though"
...,...,...
5567,spam,This 2nd time tried 2 contact u. U å£750 Pound...
5568,ham,Will Ì_ b going esplanade fr home?
5569,ham,"Pity, * mood that. So...any suggestions?"
5570,ham,The guy bitching I acted like i'd interested b...


In [45]:
Email_Data.Target.value_counts()

ham     4825
spam     747
Name: Target, dtype: int64

In [46]:
# Delete punctuations, convert text to lowercase, and remove double spaces

Email_Data['Email'] = Email_Data['Email'].apply(
    lambda x: re.sub('[!@#$:).;,?&]', '', x.lower()))
Email_Data['Email'] = Email_Data['Email'].apply(
    lambda x: re.sub(' +', ' ', x))
Email_Data['Email'].head(5)


0    go jurong point crazy available bugis n great ...
1                              ok lar joking wif u oni
2    free entry 2 wkly comp win fa cup final tkts 2...
3                  u dun say early hor u c already say
4             nah i think goes usf lives around though
Name: Email, dtype: object

In [47]:
# Separating text(input) and target classes
list_sentences_rawdata = Email_Data["Email"].fillna("_na_").values
list_classes = ["Target"]
target = Email_Data[list_classes].values
To_Process = Email_Data[["Email", "Target"]]

In [48]:
target

array([['ham'],
       ['ham'],
       ['spam'],
       ...,
       ['ham'],
       ['ham'],
       ['ham']], dtype=object)

In [49]:
To_Process

Unnamed: 0,Email,Target
0,go jurong point crazy available bugis n great ...,ham
1,ok lar joking wif u oni,ham
2,free entry 2 wkly comp win fa cup final tkts 2...,spam
3,u dun say early hor u c already say,ham
4,nah i think goes usf lives around though,ham
...,...,...
5567,this 2nd time tried 2 contact u u å£750 pound ...,spam
5568,will ì_ b going esplanade fr home,ham
5569,pity * mood that soany suggestions,ham
5570,the guy bitching i acted like i'd interested b...,ham


In [50]:
# Preparing data for model building
train, test = train_test_split(To_Process, test_size=0.3)

# Defining the sequence lengths, max number of words and embedding dimensions
# sequence length of each sentence. If more, truncate. If less, pad with zeros
MAX_SEQUENCE_LENGTH = 300

# Top 20000 frequently occuring words
MAX_NB_WORDS = 20000
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(train.Email)
train_sequences = tokenizer.texts_to_sequences(train.Email)
test_sequences = tokenizer.texts_to_sequences(test.Email)

# dictionary containing words and their index
word_index = tokenizer.word_index

# total words in the corpus
print("Found %s unique tokens." % len(word_index))

# get only the top frequent words on train
train_data = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)

# get only the top frequent word on text
test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print(train_data.shape)
print(test_data.shape)

Found 7891 unique tokens.
(3900, 300)
(1672, 300)


In [51]:
train_labels = train["Target"]
test_labels = test["Target"]

# Convert the character array to numeric array. Assigns levels to unique labels
le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_labels)
print(le.classes_)
print(np.unique(train_labels, return_counts=True))
print(np.unique(test_labels, return_counts=True))

['ham' 'spam']
(array([0, 1]), array([3350,  550]))
(array([0, 1]), array([1475,  197]))


In [52]:
# changing data types
labels_train = to_categorical(np.asarray(train_labels))
labels_test = to_categorical(np.asarray(test_labels))
print('Shape of data tensor:', train_data.shape)
print('Shape of label tensor:', labels_train.shape)
print('Shape of label tensor:', labels_test.shape)

Shape of data tensor: (3900, 300)
Shape of label tensor: (3900, 2)
Shape of label tensor: (1672, 2)


In [53]:
EMBEDDING_DIM = 100
print(MAX_SEQUENCE_LENGTH)

300


# Building the Model

In [54]:
# model building
print("Training Simple RNN")
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH))
model.add(SimpleRNN(2))
model.add(Dense(2, activation="softmax"))

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

Training Simple RNN


In [55]:
model.fit(train_data, labels_train, batch_size=16, epochs=5, validation_data=(test_data, labels_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x7c9e424d0f10>

# Model Evaluation

In [56]:
# Define class names
class_names = ['ham', 'spam']

# Predictions
predicted_Srnn = model.predict(test_data)

# Converting probabilities to binary predictions
binary_predictions = np.round(predicted_Srnn)

# Calculating precision, recall, fscore, and support
precision, recall, fscore, support = precision_recall_fscore_support(
                              labels_test,
                              binary_predictions,
                              average='weighted')

print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F-score: {}'.format(fscore))
print('Support: {}'.format(support))

print("############################")

# Classification Report with class names
report = classification_report(labels_test, binary_predictions, target_names=class_names)
print(report)


Precision: 0.9766587700809911
Recall: 0.9766746411483254
F-score: 0.9757999372274953
Support: None
############################
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99      1475
        spam       0.98      0.82      0.89       197

   micro avg       0.98      0.98      0.98      1672
   macro avg       0.98      0.91      0.94      1672
weighted avg       0.98      0.98      0.98      1672
 samples avg       0.98      0.98      0.98      1672



# Deploying the Model on Gradio

In [58]:
model.save('model')

In [59]:
import pickle
# Save the tokenizer to a file
with open('tokenizer.pkl', 'wb') as tokenizer_file:
    pickle.dump(tokenizer, tokenizer_file)


In [None]:
!pip install gradio

In [60]:
import gradio as gr

In [61]:
# Loading the trained model
model = load_model('model')
# Loading the tokenizer from the file
with open('tokenizer.pkl', 'rb') as tokenizer_file:
    tokenizer = pickle.load(tokenizer_file)

def preprocess_input_sequence(input_sequence):
    # preprocessing the input text.
    stop = stopwords.words("english")
    input_sequence = " ".join(x for x in input_sequence.split() if x not in stop)
    input_sequence = re.sub('[!@#$:).;,?&]', '', input_sequence.lower())
    input_sequence = re.sub(' +', ' ', input_sequence)
    tokenized_sequence = tokenizer.texts_to_sequences([input_sequence])
    processed_input_sequence = pad_sequences(tokenized_sequence,
                                    maxlen=300)[0]
    return processed_input_sequence

def predict_sequence(input_sequence):
    # Preprocess the input_sequence
    processed_input = preprocess_input_sequence(input_sequence)

    # Model prediction
    prediction = model.predict(np.array([processed_input]))

    # Converting probability to class (assuming binary classification)
    predicted_class = int(np.round(prediction.flatten()[0]))

    # Mapping class to 'ham' or 'spam'
    result = 'ham' if predicted_class == 0 else 'spam'

    return result

iface = gr.Interface(
    fn=predict_sequence,
    inputs="text",
    outputs="text",  # Output as text for displaying 'ham' or 'spam'
    live=True
)

iface.launch()


Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ed1b69b653648f113f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# End of the Project