# Collect the Dataset

We use, for example, the dataset present in the article, but if you want to train a model in your dataset, you must use your dataset. Or, if you're going to use the dataset present in the article with another model, you only must change the model

In [1]:
!wget https://raw.githubusercontent.com/adailtonaraujo/app_review_analysis/master/Classification/Dataset/RevisoesSoftware.json

--2021-04-23 14:12:09--  https://raw.githubusercontent.com/adailtonaraujo/app_review_analysis/master/Classification/Dataset/RevisoesSoftware.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4475705 (4.3M) [text/plain]
Saving to: ‘RevisoesSoftware.json’


2021-04-23 14:12:09 (23.9 MB/s) - ‘RevisoesSoftware.json’ saved [4475705/4475705]



In [2]:
import pandas as pd
import json

with open('RevisoesSoftware.json', 'r') as f:
  data = json.load(f)

df_complete = pd.DataFrame(data)

# Word-Embeddings

## Install

In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/c4/87/49dc49e13ac107ce912c2f3f3fd92252c6d4221e88d1e6c16747044a11d8/sentence-transformers-1.1.0.tar.gz (78kB)
[K     |████████████████████████████████| 81kB 3.4MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 5.5MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/f5/99/e0808cb947ba10f575839c43e8fafc9cc44e4a7a2c8f79c60db48220a577/sentencepiece-0.1.95-cp37-cp37m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 17.1MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-

## Import

In [4]:
from sentence_transformers import SentenceTransformer
import numpy as np

## Variations of word embeddings and how to use them

In [23]:
def WordEmbeddings(texts, model):

  if type(texts) == pd.core.series.Series:
    sentences = texts.replace(['\t','\n','\r'], [' ',' ',' '], regex=True)
  else:
    sentences = texts
  
  sentence_embeddings = model.encode(list(sentences))

  return sentence_embeddings 

In [6]:
dic_word_emb = {
    'BERT' : SentenceTransformer('bert-large-nli-stsb-mean-tokens'),
    'RoBERTa' : SentenceTransformer('roberta-large-nli-stsb-mean-tokens'),
    'DistilBERT' : SentenceTransformer('distilbert-base-nli-stsb-mean-tokens'),
    'DistilBERT ML' : SentenceTransformer('distiluse-base-multilingual-cased')
}

HBox(children=(FloatProgress(value=0.0, max=1313952051.0), HTML(value='')))




# Functions to train the Model



## import models

In [14]:
from scipy.spatial.distance import cosine
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.neural_network import MLPClassifier as MLP
from sklearn.naive_bayes import GaussianNB as NB
from sklearn.naive_bayes import MultinomialNB as MNB
from sklearn.svm import SVC as SVM

if you use the KNN its interessant use the metric cosine that is good for text data

In [15]:
def cosseno(x,y):
  dist = cosine(x,y)
  if np.isnan(dist):
   return 1
  return dist

## Algorithms Variation

You can change the algorithms parameters 

In [16]:
algs = {
    "KNN" : KNN(metric=cosseno),
    "MLP" : MLP(),
    "NB" : NB(),
    "MNB" : MNB(alpha=0.4, fit_prior=False),
    "SVM" : SVM()
}

## Define the algorithm that you will use

In [17]:
clf = algs['MNB']

## Train-Test division

First, you must define the train and the test set. *test_size* define the percent of examples of test set, consequently, the train set size is 1 - *test_size*

In [18]:
from sklearn.model_selection import train_test_split

df_train,df_test,y_train_class, y_test_class = train_test_split(df_complete['comment'],df_complete['label'],test_size=0.25, random_state=42)

# Execution

## Pre-processing

In [None]:
x_train = WordEmbeddings(df_train, dic_word_emb['RoBERTa']) 

x_train = np.abs(np.min(x_train))  + x_train # Use this only with algorithms that not acept negative values on the input

x_test =  WordEmbeddings(df_test,dic_word_emb['RoBERTa']) 

x_test = np.abs(np.min(x_test))  + x_test # Use this only with algorithms that not acept negative values on the input

## Train

In [25]:
clf.fit(x_train,y_train_class)

MultinomialNB(alpha=0.4, class_prior=None, fit_prior=False)

### Saving the model

In [None]:
import pickle

pkl_filename = "pickle_MNB_RoBERTa.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file) 

if you want to load the model, use:

with open(pkl_filename, 'rb') as file: \\
    clf = pickle.load(file)

## Test

In [26]:
y_pred = clf.predict(x_test)

In [27]:
from sklearn.metrics import classification_report

print(classification_report(y_test_class, y_pred, output_dict=False))

                precision    recall  f1-score   support

           Bug       0.53      0.66      0.59       109
       Feature       0.24      0.41      0.30        58
        Rating       0.91      0.64      0.75       612
UserExperience       0.36      0.63      0.46       144

      accuracy                           0.63       923
     macro avg       0.51      0.59      0.52       923
  weighted avg       0.73      0.63      0.66       923



# Study Case

In [28]:
texts = ['the app always crashes !!!!!!!!!!', 'I loved this app!!']

In [31]:
def Classification(text):
  embeddings_test = WordEmbeddings([text], dic_word_emb['RoBERTa'])
  resp = clf.predict(embeddings_test)
  print('The text: "' + text + '" belongs to the '+ str(resp[0]).upper() +' class' ) 

In [32]:
for text in texts:
  Classification(text)

The text: "the app always crashes !!!!!!!!!!" belongs to the BUG class
The text: "I loved this app!!" belongs to the RATING class
