# Universidad Autónoma de Yucatán

## Facultad de Matemáticas

### Natural Language Processing

**Teacher:** Dr. Jorge Carlos Reyes

**Student:** Dayan Bravo Fraga

# Doc2vec

## Document to Vector

# Practice: "Spam Detection"

## Download Corpus from GitHub (only for Colab)

In [1]:
import sys

in_colab: bool = 'google.colab' in sys.modules
if in_colab:
    print('Is running in Colab')
    import os

    if not os.path.isfile('spam.csv'):
        import gdown

        print("Downloading Data")
        url = "https://raw.githubusercontent.com/dayan3847/natural_language_processing/dayan3847/task_final/spam.csv"
        gdown.download(url, quiet=False)
else:
    print('Is not running in Colab')

Is not running in Colab


## Import Libraries

In [2]:
import pandas as pd
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

## Import Data

In [3]:
spam_data = pd.read_csv('spam.csv')
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
5,FreeMsg Hey there darling it's been 3 week's n...,spam
6,Even my brother is not like to speak with me. ...,ham
7,As per your request 'Melle Melle (Oru Minnamin...,ham
8,WINNER!! As a valued network customer you have...,spam
9,Had your mobile 11 months or more? U R entitle...,spam


## Split Data into Train and Test

In [4]:
x_train, x_test, y_train, y_test = train_test_split(
    spam_data['text'],
    spam_data['target'],
    random_state=0,
)

## Training data tokenization

In [5]:
x_train_tk = [word_tokenize(_x.lower()) for _x in x_train]
y_train_tk = [str(_y) for _y in y_train]

## Test data tokenization

In [6]:
x_test_tk = [word_tokenize(_x.lower()) for _x in x_test]
y_test_tk = [str(_y) for _y in y_test]

## Create labels for documents

In [7]:
tagged_data = [TaggedDocument(words=x_train_tk[i], tags=[y_train_tk[i]]) for i in range(len(x_train_tk))]

## Doc2Vec model training

In [8]:
model = Doc2Vec(tagged_data, vector_size=200, min_count=1, epochs=20)

## Calculate Accuracy Function

In [9]:
def calculate_accuracy(x, y):
    _correct = 0
    _incorrect = 0
    for i in range(len(x)):
        _x = x[i]
        _y = y[i]
        _xv = model.infer_vector(_x)
        _h = model.dv.most_similar([_xv], topn=1)[0][0]
        if _h == _y:
            _correct += 1
        else:
            _incorrect += 1

    total = _correct + _incorrect
    _correct /= total
    _incorrect /= total

    return _correct, _incorrect

## Train Accuracy

In [10]:
correct, incorrect = calculate_accuracy(x_train_tk, y_train_tk)
print('Train Accuracy')
print('Correct:', correct)
print('Incorrect:', incorrect)

Train Accuracy
Correct: 0.8255563531945441
Incorrect: 0.17444364680545585


## Test Accuracy

In [11]:
correct, incorrect = calculate_accuracy(x_test_tk, y_test_tk)
print('Test Accuracy')
print('Correct:', correct)
print('Incorrect:', incorrect)

Test Accuracy
Correct: 0.8183776022972002
Incorrect: 0.18162239770279973


# Results

## Experiment 1

### Config
- Vector Size: 100
- Epochs: 5

### Train Accuracy
- Correct: <span style="color: green;">0.7659727207465901</span>
- Incorrect: <span style="color: red;">0.2340272792534099</span>

### Test Accuracy
- Correct: <span style="color: green;">0.7430007178750897</span>
- Incorrect: <span style="color: red;">0.25699928212491024</span>

## Experiment 2

### Config
- Vector Size: 100
- Epochs: 20

### Train Accuracy
- Correct: <span style="color: green;">0.7695620961952621</span>
- Incorrect: <span style="color: red;">0.23043790380473797</span>

### Test Accuracy
- Correct: <span style="color: green;">0.7989949748743719</span>
- Incorrect: <span style="color: red;">0.20100502512562815</span>

## Experiment 3

### Config
- Vector Size: 150
- Epochs: 20

### Train Accuracy
- Correct: <span style="color: green;">0.8054558506819813</span>
- Incorrect: <span style="color: red;">0.19454414931801867</span>

### Test Accuracy
- Correct: <span style="color: green;">0.815984685331419</span>
- Incorrect: <span style="color: red;">0.184015314668581</span>