<a href="https://colab.research.google.com/github/gdg-ml-team/DevFest19/blob/master/spam_detector_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep learning for Text Classification

In this notebook I'll be using the neural networks algorithm to create a model that can classify dataset Messages as spam or not spam based on the dataset that we'll give to the model. If you don't know what is the spammy message look like it usually contain words like 'win', 'cash', 'money', 'winner' ,'free'..etc and it designed to be notice and tempt you to open it.And sometimes it contains CAPTIAL WORDS and alot of exclamation marks!!!. Our mission here is to train a model to predict spammy messages for us!

Identify spam messages is **a binary classification problem** as messages are classified as either 'Spam' or 'Not Spam' and nothing else. Also, this is **a supervised learning problem**, as we will be feeding a labelled dataset into the model, that it can learn from, to make future predictions.

## Step 1.1: Understanding our dataset


Import the dataset into a **pandas** dataframe using the `.read_csv()` method. You can access it using the filepath `"spam.csv"`.

The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

In [0]:
import pandas as pd

In [0]:
df = pd.read_csv("spam.csv", encoding="iso-8859-1")
df.head(10)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
5,spam,FreeMsg Hey there darling it's been 3 week's n...,,,
6,ham,Even my brother is not like to speak with me. ...,,,
7,ham,As per your request 'Melle Melle (Oru Minnamin...,,,
8,spam,WINNER!! As a valued network customer you have...,,,
9,spam,Had your mobile 11 months or more? U R entitle...,,,


As we see above there are five columns and only two have values the first one called `v1` and the second called `v2`
so first we have to get only the two columns that contain data

In [0]:
df = df[["v1","v2"]]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Good now we have two columns v1 and v2 but there names are meaningless so let's rename them `label` and `message`

In [0]:
df.rename(columns={"v1":"label","v2":"message"},inplace=True)
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Step 1.2: Data Preprocessing
Now that we have a basic understanding of what our dataset looks like, lets convert our labels to binary variables(we make binary classification), 0 to represent 'ham'(not spam) and 1 to represent 'spam' for ease of computation.

Our model would still be able to make predictions if we left our labels as strings but we could have issues later when calculating performance metrics

In [0]:
df['label'] = df.label.map({"ham":0,"spam":1})
df.shape

(5572, 2)

## Step 2.1: Bag of words

What we have here in our dataset is a large collection of text data (5,572 rows of data). Most ML algorithms rely on **numerical data** to be fed into them as input, and **our dataset are usually text**.

Here we'd like to introduce the **Bag of Words(BoW)** concept which is a term used to specify the problems that have a collection of text data that needs to be worked with. The basic idea of BoW is to take a piece of text and count the frequency of the words in that text. It is important to note that the BoW concept treats each word individually and the order in which the words occur does not matter.

Using a process which we will go through now, we can convert a collection of documents to a matrix, with each document being a row and each word(token) being the column, and the corresponding (row,column) values being the frequency of occurrence of each word or token in that document.

For example:

Lets say we have 4 documents as follows:


[

'Hello, how are you!',

'Win money, win from home.',

'Call me now',

'Hello, Call you tomorrow?'

]


In [0]:
documents = ['Hello, how are you!',
                'Win money, win from home.',
                'Call me now.',
                'Hello, Call you tomorrow?']

from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
count_vector.fit(documents) # Fit the documents and then return the matrix
count_vector.get_feature_names() # here the only words that we have

['are',
 'call',
 'from',
 'hello',
 'home',
 'how',
 'me',
 'money',
 'now',
 'tomorrow',
 'win',
 'you']

In [0]:
doc_array =count_vector.transform(documents) # here we transform the text to the Bag of Words
doc_array.toarray()

array([[1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 2, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1]], dtype=int64)

let's convert it to more understandable way

In [0]:
frequency_matrix = pd.DataFrame(doc_array.toarray(), columns=count_vector.get_feature_names())
frequency_matrix

Unnamed: 0,are,call,from,hello,home,how,me,money,now,tomorrow,win,you
0,1,0,0,1,0,1,0,0,0,0,0,1
1,0,0,1,0,1,0,0,1,0,0,2,0
2,0,1,0,0,0,0,1,0,1,0,0,0
3,0,1,0,1,0,0,0,0,0,1,0,1


Congratulations! You have successfully implemented a Bag of Words problem for a document dataset that we created.

## Step 3.1: Training and Testing sets

Now that we have understood how to deal with the BOW problem we can get back to our dataset and split it to train and test to use it with our model.

   **TODO:**  Split the dataset into a training and testing set by using the `train_test_split` method in sklearn. Split the data using the following variables:

   * X_train is our training data for the 'message' column.
   * y_train is our training data for the 'label' column
   * X_test is our testing data for the 'message' column.
   * y_test is our testing data for the 'label' column Print out the number of rows we have in each our training and testing data.

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], 
                                                    df['label'], 
                                                    random_state=1)

print('Number of rows in the total set: {}'.format(df.shape[0]))
print('Number of rows in the training set: {}'.format(X_train.shape[0]))
print('Number of rows in the test set: {}'.format(X_test.shape[0]))

Number of rows in the total set: 5572
Number of rows in the training set: 4179
Number of rows in the test set: 1393


# Step 3.2: Applying Bag of Words processing to our dataset.

our mission now is to apply BoW in our dataset as we did before using `CountVectorizer()`.

**TODO:**

* Fit our training data(X_train) into CountVectorizer() and return the matrix.
* we have to transform our testing data(X_test) to return the matrix.

In [0]:
# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train) # Fit will make it as dictionry of words

# Transform testing data and return the matrix. Note we are not fitting the testing data into the CountVectorizer()
testing_data = count_vector.transform(X_test)

In [0]:
print("The shape of the BoW is {} rows(sentences) and {} columns(features)".format(training_data.shape[0],training_data.shape[1]))

# we have to pass the features into the neural network so let's get the number of features
input_shape = training_data.shape[1] 
print("The input shape is the features number which is ",input_shape)

The shape of the BoW is 4179 rows(sentences) and 7496 columns(features)
The input shape is the features number which is  7496


## Step4: Deep learning implementation using Keras
We will be using Deep neural networks to solve this binary classification challnage."

In [0]:
import keras

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Build The model

In [0]:
print("The training data shape = ",training_data.shape)
input_shape = training_data.shape[1] # The columns are the number of features
print("The number of features in our dataset = ",input_shape)

The training data shape =  (4179, 7496)
The number of features in our dataset =  7496


In [0]:
# Build The structure of Model
model = keras.Sequential() 
# here we define 20 nodes with input shape 7496 and activation function called relu
model.add(keras.layers.Dense(20, input_shape=(input_shape,), activation='relu')) 
# then the output 1 node because we have binary classifiaction (spam/ notspam) and the activation function called sigmoid 
model.add(keras.layers.Dense(1, activation='sigmoid')) 






In [0]:
# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # define the loss and the optimizer

model.summary()



Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 20)                149940    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 21        
Total params: 149,961
Trainable params: 149,961
Non-trainable params: 0
_________________________________________________________________


## Train the Model

In [0]:
# pass the training data, training_labels(Y_train), define the epochs, the validation data
model.fit(training_data, y_train, epochs=10, 
          validation_data=(testing_data, y_test),
          verbose=1) 


Train on 4179 samples, validate on 1393 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x26577876b70>

## Evaluate

In [0]:
training_loss, accuracy1 = model.evaluate(training_data, y_train)
testing_loss, accuracy2 = model.evaluate(testing_data, y_test)
print("The training loss = {:2f},  and the accuracy = {:2f}".format(training_loss, accuracy1))
print("The testing loss = {:2f},  and the accuracy = {:2f}".format(testing_loss, accuracy2))

The training loss = 0.001443,  and the accuracy = 0.999521
The testing loss = 0.058221,  and the accuracy = 0.988514


## Save the model and the Bow
 We'll save the model in **h5** format which contains the model architecture and the weigths of the model

In [0]:
model.save("SpamDetector.h5")

## Load the model and using it 

In [0]:
from keras.models import load_model

In [0]:
loaded_model = load_model("SpamDetector.h5")

### To save the BoW we need to save the feature names to fit them again when we use the model

We can use `pickle` module for that. This module have two methods,

Pickling(dump): Convert Python objects into string representation.

Unpickling(load): Retrieving original objects from stored string representstion.

In [0]:
print("The length of the bow = ",len(count_vector.get_feature_names()))
print("-"*30)
print("some samples of the data : ")
count_vector.get_feature_names()[720:730]

The length of the bow =  7496
------------------------------
some samples of the data : 


['abj',
 'able',
 'abnormally',
 'about',
 'aboutas',
 'abroad',
 'absolutely',
 'absolutly',
 'abstract',
 'abt']

Let's save the feature names into file to use it for applying bow again without load The data

In [0]:
import pickle

with open("bow_featureNames.txt", "wb") as fp:   #Pickling
    pickle.dump(count_vector.get_feature_names(), fp)


In [0]:
def fit_bow(message):
    with open("bow_featureNames.txt", "rb") as fp:   # Unpickling
        loaded_names = pickle.load(fp)
        count_vector = CountVectorizer()
        fit_bow = count_vector.fit(loaded_names)
        return count_vector.transform(message)
        

## Build code to predict

In [0]:
def spam_detector(message):
    msg = []
    msg.append(message)
    bow = fit_bow(msg)
    predicting = model.predict(bow)
    if float(predicting)*100 > 0.4:
        return "Spam"
    else:
        return "Not Spam"

In [0]:
x = "free cash just sign in"
spam_detector(x)

'Spam'

In [0]:
y = "take this cash and buy a dinner"
spam_detector(y)

'Not Spam'

##### Congratulations! You have successfully designed a model that can predict if an message is spam or not!
##### Thanks for reach The eand