## Zhengxu Wang 
zhengxu@bu.edu  
cs505 hw5

# Programming Assignment Five: Spam detection with neural network.

In this assignment, you are asked to build a neural network that can detect spam from a given SMS message.

The provided files are:
1. `spam_train.csv`: a csv file containing the training data, where the 'text' column provides the sms messages and the 'label' column indicates whether the sms message is a 'ham' (0) or a 'spam' (1).
2. `spam_test.csv`: a csv file containing the testing data, following the same format as `spam_train.csv`.

**Step 1: Compute the SMS message vector based on the average value of the word vectors that belong to the words in it.** 

Just like the last assignment, we compute the 'representation' of each message, i.e., the vector, by averaging word vectors with Word2Vec. But this time, we are using pre-trained [Glove word embeddings](https://nlp.stanford.edu/projects/glove/) instead. Specifically, we are using word embedding `glove.6B.100d` to obtain word vectors of each message, as long as the word is in the 'glove.6B.100d' embedding space.

In other words, you need to:
1. Have a [basic idea](https://nlp.stanford.edu/pubs/glove.pdf) of how Glove provides pre-trained word embeddings (vectors).
2. Download and extract word vectors from `glove.6B.100d`, contained in `glove.6B.zip`.
3. Compute the message vectors by averaging the vectors of words in the message.

In [1]:
import pandas as pd
df_train = pd.read_csv('./spam_train.csv')
df_test = pd.read_csv('./spam_test.csv')

In [2]:
df_train['text']

Unnamed: 0.1,Unnamed: 0,text,label
0,0,One small prestige problem now.,0
1,1,Hey babe! I saw you came online for a second a...,0
2,2,You should get more chicken broth if you want ...,0
3,3,"My fri ah... Okie lor,goin 4 my drivin den go ...",0
4,4,"That's cool, I'll come by like &lt;#&gt; ish",0
...,...,...,...
995,995,Natalja (25/F) is inviting you to be her frien...,1
996,996,Do you want a new Video handset? 750 any time ...,1
997,997,Congratulations ur awarded either a yrs supply...,1
998,998,TheMob>Yo yo yo-Here comes a new selection of ...,1


In [3]:
vocab,embeddings = [],[]
with open('./glove.6B/glove.6B.100d.txt','rt') as fi:
    full_content = fi.read().strip().split('\n')
for i in range(len(full_content)):
    i_word = full_content[i].split(' ')[0]
    i_embeddings = [float(val) for val in full_content[i].split(' ')[1:]]
    vocab.append(i_word)
    embeddings.append(i_embeddings)

In [5]:
len(vocab)

400001

In [9]:
len(embeddings[0])

100

In [10]:
len(embeddings)

400001

In [17]:
import spacy
import re

def cutWord(train_df):
    allsents = []
    nlp = spacy.load("en_core_web_sm")
    lemmatizer = nlp.get_pipe("lemmatizer")
    for i in range(len(train_df)):
        train_df['text'][i] = re.sub(r'[\[].*?[\]]', '', train_df['text'][i])
  
        sentences = []
        docParagraph = nlp(train_df['text'][i])
        assert docParagraph.has_annotation("SENT_START")
        for sent in docParagraph.sents:
            tokens = []
            docSent = nlp(sent.text)
            for token in docSent:
                sentences.append(token.lemma_.lower())
        allsents.append(sentences)
    return allsents

In [18]:
trainSents = cutWord(df_train)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['text'][i] = re.sub(r'[\[].*?[\]]', '', train_df['text'][i])


In [23]:
trainSents[0]

['one', 'small', 'prestige', 'problem', 'now', '.']

In [24]:
testSents = cutWord(df_test)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['text'][i] = re.sub(r'[\[].*?[\]]', '', train_df['text'][i])


In [60]:
# Turn to sentence vec
import numpy as np
import torch
def build_sentence_vector(sentence,size):
    sen_vec=np.zeros(size).reshape(size).astype(np.float32)
    count=0
    for word in sentence:
        try:
            sen_vec += np.array(word).reshape(size).astype(np.float32)
            count+=1
        except KeyError:
            continue
    if count!=0:
        sen_vec/=count
    return torch.from_numpy(sen_vec)

In [61]:
trainVecs = []
for i in range(len(trainSents)):
    sentVecs = []
    for token in trainSents[i]:
        try:
            sentVecs.append(embeddings[vocab.index(token)])
        except:
            pass
    trainVecs.append(build_sentence_vector(sentVecs, 100))

In [62]:
len(trainVecs[0])

100

In [63]:
len(trainVecs)

1000

In [64]:
testVecs = []
for i in range(len(testSents)):
    sentVecs = []
    for token in testSents[i]:
        try:
            sentVecs.append(embeddings[vocab.index(token)])
        except:
            pass
    testVecs.append(build_sentence_vector(sentVecs, 100))

**Step 2: Build 'dataset + data loader' that can feed data to train your model with Pytorch.**

Our goal is to train a spam detection model (classification). Here's an [example](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) of how a classfier is trained. Although it is for image classification, the idea is very similar:

1. Prepare/build a dataset and load it with data loader;
2. Prepare/build a model that takes the data input and predicts; and 
3. Prepare/build the optimizer and loss functions to train the model with the dataset.

Naturally, the next thing to do is to prepare the data. We do it by building the 'Dataset' and 'Dataloader' with Pytorch.

You may refer to [this page](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) to get an idea of how to make 'Dataset' and 'Dataloader'. 

Hints:
1. Make sure `__init__` , `__len__` and `__getitem__` of your defined dataset is implemented properly. In particular, the `__getitem__` function should return the specified message vector and its label.
2. Don't compute the message vector when calling the `__getitem__` function, otherwise the training process will slow down A LOT.
3. Make sure the shuffle is on for your data loader setup, as the data in the csv file is not. 



In [106]:
import os
import pandas as pd
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        input = self.data[idx]
        label = self.labels[idx]
        return input,label

In [107]:
trainDataset = CustomDataset(trainVecs,df_train['label'])

In [108]:
testDataset = CustomDataset(testVecs,df_test['label'])

In [109]:
from torch.utils.data import DataLoader
trainDataloader = DataLoader(trainDataset, batch_size = 64, shuffle = True)
testDataloader = DataLoader(testDataset, batch_size = 64, shuffle = True)

In [110]:
import torch.nn.functional as F
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(100, 15)
        self.activation = nn.ReLU()
        self.linear2 = nn.Linear(15, 2)
        self.softmax = nn.Softmax()


    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x


net = Net()

In [113]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.0005, momentum=0.9)

**Step 3: Build the neural net model.** 

Once the data is ready, we need to design and implement our neural network model.

You should look [here](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html) to see how a model can be defined.

The model does not need to be complicated. An example structure could be:

1. linear layer 100 x 15
2. ReLU activation layer
3. linear layer 15 x 2 (think about why here is 2 instead of 1?)
4. Softmax activation layer

But feel free to test out other possible combinations of linear layers & activation functions and whether they make significant difference to the model performance later.

**Step 4: Train the model with optimizer and loss function.**

Lastly, we need to set up the [optimizer](https://pytorch.org/docs/stable/optim.html) and [loss function](https://pytorch.org/docs/stable/nn.html#loss-functions) to train the model. You may refer to the links for more details. Specifically, we need Stochastic Gradient Descent (SGD) for optimizer and CrossEntropyLoss for loss function.

The last thing to do is to train the model for several epochs and evaluate its performance from time to time. For example,  train the model 5000 epochs, evaluating the model every 100 epochs. If you are not sure how the training works, you may refer to the [classification model tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) to see how it is typically done. Don't forget to print the average loss of the epoch to see if the model is being optimized properly.

The evaluation metric should be the [**accuracy**](https://en.wikipedia.org/wiki/Confusion_matrix) of predicting ham/spam on the testing data (TP+TN/(TP+TN+FP+FN)). The highest accuracy should be above at least **90%**. Try different settings of model structure, learning rate, and the number of training epochs  to achieve that level of accuracy.

In [112]:
import torch
def eval(testloader, net):
    correct = 0
    total = 0
    # since we're not training, we don't need to calculate the gradients for our outputs
    with torch.no_grad():
        for data in testloader:
            images, labels = data
            # calculate outputs by running images through the network
            outputs = net(images)
            # the class with the highest energy is what we choose as prediction
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return 100 * correct // total


In [114]:
for epoch in range(5000):  # loop over the dataset multiple times
    acc = 0
    running_loss = 0.0
    for i, data in enumerate(trainDataloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
            
    if epoch % 100 == 99:
        acc = eval(testDataloader, net)
        print(f'Epoch {epoch} , Accuracy of the network on the 500 test texts: {acc} % , loss: {running_loss / 2000:.3f}')
        running_loss = 0.0
        if acc > 90:
            break

print('Finished Training')

  x = self.softmax(x)


Epoch 99 , Accuracy of the network on the 500 test texts: 76 % , loss: 0.005
Epoch 199 , Accuracy of the network on the 500 test texts: 89 % , loss: 0.004
Epoch 299 , Accuracy of the network on the 500 test texts: 89 % , loss: 0.004
Epoch 399 , Accuracy of the network on the 500 test texts: 89 % , loss: 0.004
Epoch 499 , Accuracy of the network on the 500 test texts: 90 % , loss: 0.003
Epoch 599 , Accuracy of the network on the 500 test texts: 90 % , loss: 0.003
Epoch 699 , Accuracy of the network on the 500 test texts: 91 % , loss: 0.003
Finished Training
