**An LSTM Based Multi-Class Email Classification Using TensorFlow,NLTK**

You can find the Dataset at `https://catalog.data.gov/dataset/consumer-complaint-database`

**Importing the Dataset**

As the dataset is very huge it seems appropriate to add it in the drive and then unmount in the colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Importing the Required Libraries**

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.corpus import stopwords

In [None]:
data_set=pd.read_csv('/content/drive/MyDrive/complaints.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
data_set.head()

Unnamed: 0,Date received,Product,Sub-product,Issue,Sub-issue,Consumer complaint narrative,Company public response,Company,State,ZIP code,Tags,Consumer consent provided?,Submitted via,Date sent to company,Company response to consumer,Timely response?,Consumer disputed?,Complaint ID
0,2022-06-01,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Public record information inaccurate,,,Experian Information Solutions Inc.,NY,13027.0,Servicemember,,Web,2022-06-01,In progress,Yes,,5623393
1,2022-06-01,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Credit inquiries on your report that you don't...,,Company has responded to the consumer and the ...,Experian Information Solutions Inc.,FL,33990.0,,Consent not provided,Web,2022-06-01,Closed with non-monetary relief,Yes,,5621702
2,2022-06-08,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"TRANSUNION INTERMEDIATE HOLDINGS, INC.",FL,33403.0,,,Web,2022-06-08,In progress,Yes,,5645514
3,2022-06-07,"Credit reporting, credit repair services, or o...",Credit reporting,Improper use of your report,Credit inquiries on your report that you don't...,,,"Rocket Mortgage, LLC",GA,31008.0,,,Web,2022-06-08,Closed with explanation,Yes,,5643448
4,2022-06-08,"Credit reporting, credit repair services, or o...",Credit reporting,Incorrect information on your report,Information belongs to someone else,,,"EQUIFAX, INC.",AL,35473.0,,,Web,2022-06-08,In progress,Yes,,5644133


We will be using only 2 columns namely **Product** and **Consumer complaint Narrative** for our project, so let's export that from the dataframe.

In [None]:
df1 = data_set[['Product', 'Consumer complaint narrative']].copy()

let's further extract data which has the **Consumer complaint Narative** as not null 

In [None]:
df1 = df1[pd.notnull(df1['Consumer complaint narrative'])]
df1.columns = ['Product', 'Consumer_complaint']

Let's take a look at the different categories that we are going to classify our data into

In [None]:
pd.DataFrame(df1.Product.unique()).values

array([['Credit reporting, credit repair services, or other personal consumer reports'],
       ['Credit card or prepaid card'],
       ['Debt collection'],
       ['Mortgage'],
       ['Vehicle loan or lease'],
       ['Checking or savings account'],
       ['Payday loan, title loan, or personal loan'],
       ['Student loan'],
       ['Money transfer, virtual currency, or money service'],
       ['Consumer Loan'],
       ['Bank account or service'],
       ['Payday loan'],
       ['Credit card'],
       ['Credit reporting'],
       ['Other financial service'],
       ['Money transfers'],
       ['Prepaid card'],
       ['Virtual currency']], dtype=object)

Let's make our lives Easier by changing the category names into simple ones

In [None]:
df2.replace({'Product': 
             {'Credit reporting, credit repair services, or other personal consumer reports': 
              'Credit reporting, repair, or other', 
              'Credit reporting': 'Credit reporting, repair, or other',
             'Credit card': 'Credit card or prepaid card',
             'Prepaid card': 'Credit card or prepaid card',
             'Payday loan': 'Payday loan, title loan, or personal loan',
             'Money transfer': 'Money transfer, virtual currency, or money service',
             'Virtual currency': 'Money transfer, virtual currency, or money service'}}, 
            inplace= True)
pd.DataFrame(df2.Product.unique())

Unnamed: 0,0
0,Credit card or prepaid card
1,Debt collection
2,"Credit reporting, repair, or other"
3,Vehicle loan or lease
4,Checking or savings account
5,Mortgage
6,Student loan
7,Bank account or service
8,"Money transfer, virtual currency, or money ser..."
9,"Payday loan, title loan, or personal loan"


In [None]:
df2['category_id'] = df2['Product'].factorize()[0]
category_id_df = df2[['Product', 'category_id']].drop_duplicates()

Stop words are the words that are commonly ignored by a search engine while performing search or querying , so let's try to remove them from our mails which wont cause any error

In [None]:
nltk.download('stopwords')
stop=set(stopwords.words('english'))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def remove_stopwords(text):
  filtered_words=[word.lower() for word in text.split() if word not in stop]
  return " ".join(filtered_words)

In [None]:
df2['Consumer_complaint']=df2['Consumer_complaint'].map(remove_stopwords)

In [None]:
df2

Unnamed: 0.1,Unnamed: 0,Product,Consumer_complaint,category_id
0,1885501,Credit card or prepaid card,received letter mail american express attempti...,0
1,2183466,Debt collection,veteran served twice xxxx xxxx xxxx. medical c...,1
2,655475,Debt collection,portofolio recovery submitted collection repor...,1
3,2527140,"Credit reporting, repair, or other",equifax xxxx xxxx still reporting past due bal...,2
4,1893551,Debt collection,getting notification xxxx debt approximately x...,1
...,...,...,...,...
9995,1330494,"Credit reporting, repair, or other","dear might concern, name "" xxxx xxxx xxxx '', ...",2
9996,1626685,Debt collection,bought car friend back xxxx title transferred ...,1
9997,2065636,Consumer Loan,keep receiving harassing phone calls personal ...,11
9998,1276903,Debt collection,collection acount showing erc. supposing acvou...,1


Let's now count the number of unique words that are found in the mails , which can help us while designing our LSTM Model

for this we are going to use **Counter** in **collections** module

In [None]:
#counting the frequency
from collections import Counter

def count_words(text_col):
  count=Counter()
  for text in text_col.values:
    for word in text.split():
      count[word]+=1
  return count

counted_words=count_words(df2.Consumer_complaint)

In [None]:
uniqueWordCount=len(counted_words)

Now let's split our data into train and validation dataset

In [None]:
train_size=int(df2.shape[0]*0.8)

In [None]:
train_data=df2[:train_size]
validate_data=df2[train_size:]

In [None]:
train_sentences=train_data.Consumer_complaint.to_numpy()
train_labels=train_data.category_id.to_numpy()
validate_sentences=validate_data.Consumer_complaint.to_numpy()
validate_labels=validate_data.category_id.to_numpy()

Now let's tokenize our data which is nothing but adding an ID to every word tha were found in our dataset of mails

In [None]:
tokenizer=tf.keras.preprocessing.text.Tokenizer(num_words=uniqueWordCount)
tokenizer.fit_on_texts(train_sentences)


In [None]:
word_index=tokenizer.word_index

In [None]:
train_sequence=tokenizer.texts_to_sequences(train_sentences)
validate_sequence=tokenizer.texts_to_sequences(validate_sentences)

As all the mails might not have the same number of words , lets use the pad_sequences method to make them equal (*Note: We are using zero padding here)

In [None]:
max_length=50

In [None]:
train_padded=tf.keras.preprocessing.sequence.pad_sequences(train_sequence,maxlen=max_length,padding='post',truncating='post')
validate_padded=tf.keras.preprocessing.sequence.pad_sequences(validate_sequence,maxlen=max_length,padding='post',truncating='post')

Now  let's model our LSTM and train it . we use 13 output layer neurons each for 1 class and a softmax activation function

In [None]:
model=tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(uniqueWordCount,128,input_length=max_length))
model.add(tf.keras.layers.LSTM(128,dropout=0.1))
model.add(tf.keras.layers.Dense(13,activation='softmax'))
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 50, 128)           4499328   
                                                                 
 lstm_2 (LSTM)               (None, 128)               131584    
                                                                 
 dense_2 (Dense)             (None, 13)                1677      
                                                                 
Total params: 4,632,589
Trainable params: 4,632,589
Non-trainable params: 0
_________________________________________________________________


In [None]:
loss='sparse_categorical_crossentropy'
opt=tf.keras.optimizers.Adam(lr=0.001)
model.compile(loss=loss,optimizer=opt,metrics=['accuracy'])

  super(Adam, self).__init__(name, **kwargs)


In [None]:
model.fit(train_padded,train_labels,epochs=20,validation_data=(validate_padded,validate_labels),verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f5c92dbad90>

In [None]:
predictions=model.predict(train_padded)

As we can see our model has a 98% accuracy over the training set and 71% over Validation Set 

In [None]:
prediction=[]
for pre in predictions:
  prediction.append(np.argmax(pre))

In [None]:
mismatch_count=0
total_count=len(prediction)
for i in range(len(prediction)):
  if prediction[i]!=train_labels[i]:
    mismatch_count+=1
accuracy=((total_count-mismatch_count)/total_count)*100
print("Total mails :{} \n misclassified mails :{} \n Accuracy of Prediction : {}".format(total_count,mismatch_count,accuracy))

Total mails :8000 
 misclassified mails :85 
 Accuracy of Prediction : 98.9375
