# Collect the Dataset

We use, for example, the dataset present in the article, but if you want to train a model in your dataset, you must use your dataset. Or, if you're going to use the dataset present in the article with another model, you only must change the model

In [1]:
!wget https://raw.githubusercontent.com/adailtonaraujo/app_review_analysis/master/Classification/Dataset/RevisoesSoftware.json

--2021-04-23 14:35:03--  https://raw.githubusercontent.com/adailtonaraujo/app_review_analysis/master/Classification/Dataset/RevisoesSoftware.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4475705 (4.3M) [text/plain]
Saving to: ‘RevisoesSoftware.json’


2021-04-23 14:35:04 (22.8 MB/s) - ‘RevisoesSoftware.json’ saved [4475705/4475705]



In [2]:
import pandas as pd
import json

with open('RevisoesSoftware.json', 'r') as f:
  data = json.load(f)

df_complete = pd.DataFrame(data)

# Word-Embeddings

## Install

In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/c4/87/49dc49e13ac107ce912c2f3f3fd92252c6d4221e88d1e6c16747044a11d8/sentence-transformers-1.1.0.tar.gz (78kB)
[K     |████████████████████████████████| 81kB 3.6MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 8.6MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 39.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)


## Import

In [4]:
from sentence_transformers import SentenceTransformer
import numpy as np

## Variations of word embeddings and how to use them

In [5]:
def WordEmbeddings(texts, model):

  if type(texts) == pd.core.series.Series:
    sentences = texts.replace(['\t','\n','\r'], [' ',' ',' '], regex=True)
  else:
    sentences = texts
  
  sentence_embeddings = model.encode(list(sentences))

  return sentence_embeddings 

In [6]:
dic_word_emb = {
    'BERT' : SentenceTransformer('bert-large-nli-stsb-mean-tokens'),
    'RoBERTa' : SentenceTransformer('roberta-large-nli-stsb-mean-tokens'),
    'DistilBERT' : SentenceTransformer('distilbert-base-nli-stsb-mean-tokens'),
    'DistilBERT ML' : SentenceTransformer('distiluse-base-multilingual-cased')
}

HBox(children=(FloatProgress(value=0.0, max=1243516997.0), HTML(value='')))




# Functions to train the Model



## import models

In [7]:
from sklearn.svm import OneClassSVM as OCSVM

## Define the algorithm that you will use

In [8]:
clf = OCSVM()

## Train-Test division

First, you must define a class of interest, second you must define the train set from the class of interest and the test set that contians examples from the class of interest and other classes. *test_size* define the percent of examples of test set, consequently, the train set size is 1 - *test_size*

In [9]:
from sklearn.model_selection import train_test_split

class_interest = 'Rating'

df_train_interest, df_test_interest = train_test_split(df_complete['comment'][df_complete['label'] == class_interest],test_size=0.25, random_state=42)
df_test_outliers = df_complete['comment'][df_complete['label'] != class_interest]

# Execution

## Pre-processing

In [10]:
x_train = WordEmbeddings(df_train_interest, dic_word_emb['BERT']) 

x_test_interest =  WordEmbeddings(df_test_interest,dic_word_emb['BERT']) 

x_test_outlier = WordEmbeddings(df_test_outliers,dic_word_emb['BERT']) 

## Train

In [12]:
clf.fit(x_train)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale', kernel='rbf',
            max_iter=-1, nu=0.5, shrinking=True, tol=0.001, verbose=False)

### Saving the model

In [None]:
import pickle

pkl_filename = "pickle_OCSVM_BERT.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file) 

if you want to load the model, use:

with open(pkl_filename, 'rb') as file: \\
    clf = pickle.load(file)

## Test

In [13]:
y_pred_int = clf.predict(x_test_interest)
y_pred_out = clf.predict(x_test_outlier)

In [14]:
from sklearn.metrics import classification_report

def evaluation_one_class(preds_interest, preds_outliers):
  y_true = [1]*len(preds_interest) + [-1]*len(preds_outliers)
  y_pred = list(preds_interest)+list(preds_outliers)
  return classification_report(y_true, y_pred, output_dict=False)

In [15]:
print(evaluation_one_class(y_pred_int, y_pred_out))

              precision    recall  f1-score   support

          -1       0.75      0.75      0.75      1229
           1       0.50      0.50      0.50       616

    accuracy                           0.66      1845
   macro avg       0.62      0.62      0.62      1845
weighted avg       0.66      0.66      0.66      1845



# Study Case

In [16]:
texts = ['the app always crashes !!!!!!!!!!', 'I loved this app!!']

In [20]:
def Classification(text):
  embeddings_test = WordEmbeddings([text], dic_word_emb['BERT'])
  resp = clf.predict(embeddings_test)
  if resp[0] == 1:
    print('The text: "' + text + '" BELONGS to the class of interest!') 
  if resp[0] == -1:
    print('The text: "' + text + '" DOES NOT belong to the class of interest!') 

In [21]:
for text in texts:
  Classification(text)

The text: "the app always crashes !!!!!!!!!!" DOES NOT belong to the class of interest!
The text: "I loved this app!!" BELONGS to the class of interest!
