# Collect the Dataset

We use, for example, the dataset present in the article, but if you want to train a model in your dataset, you must use your dataset. Or, if you're going to use the dataset present in the article with another model, you only must change the model (MultinomialNB)

In [None]:
!gdown --id 1LaU3V8NtKWI3cB-12PjObRMHeYHeqIRL
!gdown --id 1D8kjCsLi1JteJSGNqwF7sxu13MpPmcn-

Downloading...
From: https://drive.google.com/uc?id=1LaU3V8NtKWI3cB-12PjObRMHeYHeqIRL
To: /content/roberta-large-nli-stsb-mean-tokens.csv
41.6MB [00:01, 34.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1D8kjCsLi1JteJSGNqwF7sxu13MpPmcn-
To: /content/RevisoesSoftware.json
4.48MB [00:00, 16.8MB/s]


In [None]:
import pandas as pd
import json
import csv
import numpy as np

with open('RevisoesSoftware.json', 'r') as f:
  data = json.load(f)

df_complete = pd.DataFrame(data)
 
with open('roberta-large-nli-stsb-mean-tokens.csv', 'r') as arq:
  mat_embedding = list()
  reader = csv.reader(arq)
  for doc in reader:
    mat_embedding.append([float(i) for i in doc])

  mat_embedding = np.abs(np.min(mat_embedding))  + mat_embedding  # MultinomialNB needs this (other models don't need)
  
  col = len(mat_embedding[0])
  df_roberta = pd.DataFrame(mat_embedding, columns=range(col))

  df_roberta['class'] = df_complete['label']

# Train the Model



You must import the model that you want to use. In this case, we use the MultinomialNB as an example because it obtains the best f1-score.

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()

First, you must define the train and the test set. *test_size* define the percent of examples of test set, consequently, the train set size is 1 - *test_size*

In [None]:
from sklearn.model_selection import train_test_split

df_train,df_test,df_train_class, df_test_class = train_test_split(df_roberta.iloc[:,0:1024],df_roberta['class'],test_size=0.25, random_state=42)

function to train the model

In [None]:
clf.fit(df_train,df_train_class)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

saving the model

In [None]:
import pickle

pkl_filename = "pickle_MNB.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(clf, file)

if you want to load the model, use:

with open(pkl_filename, 'rb') as file: \\
    clf = pickle.load(file)

# Test the model

In [None]:
y_pred = clf.predict(df_test)

In [76]:
from sklearn.metrics import classification_report

print(classification_report(df_test_class, y_pred, output_dict=False))

                precision    recall  f1-score   support

           Bug       0.52      0.66      0.58       109
       Feature       0.25      0.31      0.27        58
        Rating       0.87      0.75      0.81       612
UserExperience       0.40      0.52      0.45       144

      accuracy                           0.67       923
     macro avg       0.51      0.56      0.53       923
  weighted avg       0.72      0.67      0.69       923

