# Classification with Doc2Vec and SVM

It is a binary classification with two labels {'no Asshole': 0,'Asshole': 1}.

Several steps are included during the training and testing process:

1. Create feature representations for sentences

2. Use oversampling to balance two classes given features vector

3. Use balanced x_train and y_train to train SVM model

4. Test the trained model with x_test and report the evaluation results

Download the dataset

In [1]:
!pip install dvc

Collecting dvc
  Downloading dvc-2.8.3-py3-none-any.whl (399 kB)
[K     |████████████████████████████████| 399 kB 5.2 MB/s 
[?25hCollecting ruamel.yaml>=0.17.11
  Downloading ruamel.yaml-0.17.17-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 44.4 MB/s 
Collecting distro>=1.3.0
  Downloading distro-1.6.0-py2.py3-none-any.whl (19 kB)
Collecting fsspec[http]>=2021.10.1
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 46.1 MB/s 
[?25hCollecting psutil>=5.8.0
  Downloading psutil-5.8.0-cp37-cp37m-manylinux2010_x86_64.whl (296 kB)
[K     |████████████████████████████████| 296 kB 42.5 MB/s 
Collecting shtab<2,>=1.3.4
  Downloading shtab-1.5.2-py2.py3-none-any.whl (14 kB)
Collecting diskcache>=5.2.1
  Downloading diskcache-5.3.0-py3-none-any.whl (44 kB)
[K     |████████████████████████████████| 44 kB 2.9 MB/s 
[?25hCollecting flatten-dict<1,>=0.4.1
  Downloading flatten_dict-0.4.2-py2.py3-none-any.whl 

In [2]:
!dvc get https://github.com/iterative/aita_dataset aita_clean.csv

[0m

Load necessary libraries

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from gensim.models import doc2vec
from tqdm import tqdm
from sklearn import utils
from imblearn.over_sampling import SMOTE
from sklearn.svm import SVC
import numpy as np
from sklearn.metrics import accuracy_score, classification_report

Data processing
* Select rows with `score > 10`, which means more than ten people make judgement for it
* Combine `title` and `body`, fill the missing cell with an empty string

In [4]:
df = pd.read_csv('aita_clean.csv')
df = df[df['score'] >= 10]
df['text'] = df['title'] + df['body'].fillna('')

In [5]:
def label_sentences(corpus, label_type):
    labeled = []
    for i, v in enumerate(corpus):
        label = label_type + '_' + str(i)
        labeled.append(doc2vec.TaggedDocument(str(v).split(), [label]))
    return labeled

In [6]:
x_train, x_test, y_train, y_test = train_test_split(df.text, df.is_asshole, random_state=0, 
                                                    test_size=0.3)
x_train = label_sentences(x_train, 'Train')
x_test = label_sentences(x_test, 'Test')
all_data = x_train + x_test

In [7]:
model_dbow = doc2vec.Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, 
                     min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

100%|██████████| 48853/48853 [00:00<00:00, 1709459.25it/s]


In [8]:
for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), 
                     total_examples=len(all_data), 
                     epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 48853/48853 [00:00<00:00, 1899284.73it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2727548.23it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2180088.45it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2270156.58it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2387077.36it/s]
100%|██████████| 48853/48853 [00:00<00:00, 1725423.00it/s]
100%|██████████| 48853/48853 [00:00<00:00, 1973326.78it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2132221.29it/s]
100%|██████████| 48853/48853 [00:00<00:00, 1979101.87it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2760100.40it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2653135.83it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2348527.57it/s]
100%|██████████| 48853/48853 [00:00<00:00, 1856454.72it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2170665.73it/s]
100%|██████████| 48853/48853 [00:00<00:00, 2036417.54it/s]
100%|██████████| 48853/48853 [00:00<00:00, 1858576.42it/s]
100%|██████████| 48853/48853 [00:00<00:00, 1889581.55it/

Create x_train and x_test vectors

In [9]:
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors

In [10]:
train_vectors_dbow = get_vectors(model_dbow, len(x_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(x_test), 300, 'Test')

Use oversampling to eliminate class inbalance

In [11]:
smt = SMOTE(random_state = 42)
X_train, Y_train = smt.fit_resample(train_vectors_dbow, y_train)

In [12]:
model = SVC(C=10, gamma='auto', kernel='rbf')
model.fit(X_train, Y_train)

SVC(C=10, gamma='auto')

Generate classification report

In [13]:
y_pred = model.predict(test_vectors_dbow)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred))

accuracy 0.7199099344978166
              precision    recall  f1-score   support

           0       0.73      0.96      0.83     10623
           1       0.45      0.08      0.14      4033

    accuracy                           0.72     14656
   macro avg       0.59      0.52      0.49     14656
weighted avg       0.66      0.72      0.64     14656

