# Collect the Dataset

We use, for example, the dataset present in the article, but if you want to train a model in your dataset, you must use your dataset. Or, if you're going to use the dataset present in the article with another model, you only must change the model (OCSVM)

In [None]:
!gdown --id 1opBT5gZuSplDX2rd34GjubOF4Mk-GosA
!gdown --id 1D8kjCsLi1JteJSGNqwF7sxu13MpPmcn-

Downloading...
From: https://drive.google.com/uc?id=1opBT5gZuSplDX2rd34GjubOF4Mk-GosA
To: /content/bert-large-nli-stsb-mean-tokens.csv
42.0MB [00:00, 159MB/s]
Downloading...
From: https://drive.google.com/uc?id=1D8kjCsLi1JteJSGNqwF7sxu13MpPmcn-
To: /content/RevisoesSoftware.json
4.48MB [00:00, 135MB/s]


In [None]:
import pandas as pd
import json
import csv
import numpy as np

with open('RevisoesSoftware.json', 'r') as f:
  data = json.load(f)

df_complete = pd.DataFrame(data)
 
with open('bert-large-nli-stsb-mean-tokens.csv', 'r') as arq:
  mat_embedding = list()
  reader = csv.reader(arq)
  for doc in reader:
    mat_embedding.append([float(i) for i in doc])
  
  col = len(mat_embedding[0])
  df_bert = pd.DataFrame(mat_embedding, columns=range(col))

  df_bert['class'] = df_complete['label']

# Train the Model



You must import the model that you want to use. In this case, we use the OCSVM as an example.

In [104]:
from sklearn.svm import OneClassSVM as OCSVM
clf = OCSVM()

First, you must define a class of interest, second you must define the train set from the class of interest and the test set that contians examples from the class of interest and other classes. *test_size* define the percent of examples of test set, consequently, the train set size is 1 - *test_size*

In [105]:
from sklearn.model_selection import train_test_split

class_interest = 'Rating'

df_train_interest, df_test_interest = train_test_split(df_bert.iloc[:,0:1024][df_bert['class'] == class_interest],test_size=0.25, random_state=42)
df_test_outliers = df_bert.iloc[:,0:1024][df_bert['class'] != class_interest]

function to train the model

In [106]:
clf.fit(df_train_interest)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale', kernel='rbf',
            max_iter=-1, nu=0.5, shrinking=True, tol=0.001, verbose=False)

saving the model

In [110]:
import pickle

pkl_filename = "pickle_OCSVM_BERT.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file)

if you want to load the model, use:

with open(pkl_filename, 'rb') as file: \\
    clf = pickle.load(file)

# Test the model

In [107]:
y_pred_int = clf.predict(df_test_interest)
y_pred_out = clf.predict(df_test_outliers)

In [108]:
from sklearn.metrics import classification_report

def evaluation_one_class(preds_interest, preds_outliers):
  y_true = [1]*len(preds_interest) + [-1]*len(preds_outliers)
  y_pred = list(preds_interest)+list(preds_outliers)
  return classification_report(y_true, y_pred, output_dict=False)

In [109]:
print(evaluation_one_class(y_pred_int, y_pred_out))

              precision    recall  f1-score   support

          -1       0.75      0.75      0.75      1229
           1       0.50      0.50      0.50       616

    accuracy                           0.66      1845
   macro avg       0.62      0.62      0.62      1845
weighted avg       0.66      0.66      0.66      1845

