# 25.  Combining Datasets

We would like to see if we can train the model better when combining the Agora and WebIQ Datasets.

## Creating sets of labels and features

### WebIQ
Firstly, we unpack the WebIQ dataset using the stored pickle items we got from TNO.

In [21]:
import pickle

filenameModel = "../darkweb/data/pekelbad/d2v_model_prep1.pkl"
filenameVectors = "../darkweb/data/pekelbad/d2v_vectors_prep1.pkl"
filenameCategories = "../darkweb/data/webiq_mapped_categories.pkl"

d2v_model = pickle.load(open(filenameModel, 'rb'))
features_WebIQ = pickle.load(open(filenameVectors, 'rb'))
labels_WebIQ = pickle.load(open(filenameCategories, 'rb')).Label

type(features_WebIQ)

list

### Agora
Secondly, we preprocess the Agora dataset and vectorize the features with the same fasttext model as the WebIQ dataset, so that we get vectors that are context-aware of each other.

In [22]:
import pandas as pd
from preprocessing import PreProcessor
from sklearn.feature_extraction.text import TfidfVectorizer

pp = PreProcessor()

df_Agora = pd.read_csv('../Data/Structured_DataFrame_Mapped.csv', index_col=0)
df_Agora['Item Description'] = df_Agora['Item Description'].apply(lambda d: pp.preprocess(str(d)))

In [23]:
from gensim.models import FastText
from nltk import word_tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# def fasttext(corpus, size):
#     tokenized = [word_tokenize(row) for row in corpus]
#     model = ft_model
#     vectors = []
#     for i, row in enumerate(tokenized):
#         sentence_vectors = [model.wv[word] for word in row]
#         if len(sentence_vectors) == 0:
#             vectors.append([0] * size)
#         else:
#             sentence_vector = np.average(sentence_vectors, axis=0)
#             vectors.append(sentence_vector)
#     return vectors, model

# features_Agora, fasttextmodel = fasttext(df_Agora['Item Description'], 128)

def doc2vec(corpus, size):
    documents = [TaggedDocument(words=word_tokenize(doc), tags=[i]) for i, doc in enumerate(corpus)]
    model = d2v_model
    model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
    return [model.docvecs[i] for i, _doc in enumerate(corpus)]

features_Agora = doc2vec(df_Agora['Item Description'], 128)
labels_Agora = df_Agora.Category

type(features_Agora)

list

## Training

In [24]:
from sklearn.svm import LinearSVC

model_Agora = LinearSVC()
model_Agora.fit(features_Agora, labels_Agora)
# y_pred = model.predict(features_Agora)

# features_Agora_Average
# model.fit(features_Agora, labels_Agora)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [26]:
import numpy as np

# features_WebIQ_averaged = [np.average(x, axis=0) for x in features_WebIQ]

model_WebIQ = LinearSVC()
model_WebIQ.fit(features_WebIQ, labels_WebIQ)



LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

## Results

In [28]:
from sklearn import metrics

y_pred = model_Agora.predict(features_WebIQ)

print("Accuracy: ", metrics.accuracy_score(labels_WebIQ, y_pred))
print()
print(metrics.classification_report(labels_WebIQ, y_pred))

ValueError: X has 128 features per sample; expecting 5

## Conclusion
We have trained the Agora dataset to a great extend. When we feed the model with the new WebIQ data, the accuracy score is 55%. That doesn't seem like much, but the Agora model doesn't know some of the categorie that are labeled in the WebIQ dataset. For example, the Fraud category is not present in the Agora model. This means all of the Fraud cases in the WebIQ test set need to be categorised in different Agora categories. This logically causes a lot of noise and therefore a worse result. The categories that are present in both sets get predicted pretty well, so the model still ends up at 55%. 

To avoid the problem of missing categories and to hopefully better train the dataset so it works well on both sets of data and thus on more categories, we want to combine the datasets into one and then train and test them as a singular model.

## Combined training
### Creating a combined dataset

In [5]:
from scipy.sparse import csc_matrix
from scipy.sparse import vstack

features = vstack([features_WebIQ, features_Agora])

# def concatenate_csc_matrices_by_columns(matrix1, matrix2):
#     new_data = np.concatenate((matrix1.data, matrix2.data))
#     new_indices = np.concatenate((matrix1.indices, matrix2.indices))
#     new_ind_ptr = matrix2.indptr + len(matrix1.data)
#     new_ind_ptr = new_ind_ptr[1:]
#     new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))

#     return csc_matrix((new_data, new_indices, new_ind_ptr))

# features = concatenate_csc_matrices_by_columns(features_WebIQ, features_Agora)
# labels = labels_WebIQ.append(labels_Agora)

features

ValueError: incompatible dimensions for axis 1

## Splitting

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.33, random_state=0)

In [None]:
from sklearn.svm import LinearSVC

model = LinearSVC()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)