# Collect the Dataset

We use, for example, the dataset present in the article, but if you want to train a model in your dataset, you must use your dataset. Or, if you're going to use the dataset present in the article with another model, you only must change the model 

In [1]:
!gdown --id 1hkVuZ7SicPZpR9Pf8R-uSivhhC6TRB0X

Downloading...
From: https://drive.google.com/uc?id=1hkVuZ7SicPZpR9Pf8R-uSivhhC6TRB0X
To: /content/zomato_en_900k.json
1.04GB [00:07, 134MB/s]


In [2]:
import pandas as pd
import json

with open('zomato_en_900k.json') as f:
  data = json.load(f)

df = pd.DataFrame(data)

# Pre-processing

## Utility

You must define the utility treshold, in this case we use 5. futhermore, you must normalize the values. 

In [3]:
df_utility = df[df.thumbsUp >=5]

## Normilize Utility



In [None]:
import numpy as np

maior = np.max(df_utility['thumbsUp'])

thumbs_up = list()

for num in list(df_utility['thumbsUp']):
  thumbs_up.append(num/maior)

bins = np.linspace(0, 1, 100) 
thumbs_up_normalized = np.digitize(thumbs_up, bins)

df_utility['TU_normalized'] = thumbs_up_normalized

## Word Embeddings

## Install

In [None]:
!pip install -U sentence-transformers

## Imports

In [6]:
from sentence_transformers import SentenceTransformer
import numpy as np

## Variations of word embeddings and how to use them

In [7]:
def WordEmbeddings(texts, model):

  if type(texts) == pd.core.series.Series:
    sentences = texts.replace(['\t','\n','\r'], [' ',' ',' '], regex=True)
  else:
    sentences = texts
  
  sentence_embeddings = model.encode(list(sentences))

  return sentence_embeddings 

In [8]:
dic_word_emb = {
    'BERT' : SentenceTransformer('bert-large-nli-stsb-mean-tokens'),
    'RoBERTa' : SentenceTransformer('roberta-large-nli-stsb-mean-tokens'),
    'DistilBERT' : SentenceTransformer('distilbert-base-nli-stsb-mean-tokens'),
    'DistilBERT ML' : SentenceTransformer('distiluse-base-multilingual-cased')
}

HBox(children=(FloatProgress(value=0.0, max=503702349.0), HTML(value='')))




# Functions to train the model

## Import models

In [9]:
from scipy.spatial.distance import cosine
from sklearn.neighbors import KNeighborsRegressor as KNR # similar ao KNN
from sklearn.svm import SVR # similar ao SVM
from sklearn.neural_network import MLPRegressor as MLPR # similar ao MLP
from sklearn.linear_model import BayesianRidge as NBR # similar ao NB

if you use the KNR its interessant use the metric cosine that is good for text data

In [10]:
def cosseno(x,y):
  dist = cosine(x,y)
  if np.isnan(dist):
   return 1
  return dist

## Regressors Variation

In [11]:
rgrs = {
    "KNR" : KNR(metric=cosseno),
    "MLPR" : MLPR(),
    "NBR" : NBR(),
    "SVR" : SVR()
}

## Define the algorithm that you will use

In [12]:
rgs = rgrs['MLPR']

## Train-Test division

First, you must define the train and the test set. *test_size* define the percent of examples of test set, consequently, the train set size is 1 - *test_size*

In [13]:
from sklearn.model_selection import train_test_split

df_train, df_test, y_train_utility, y_test_utility = train_test_split(df_utility.text, df_utility['TU_normalized'] ,test_size=0.25, random_state=42)

# Execution

## Pre-processing

In [14]:
x_train = WordEmbeddings(df_train, dic_word_emb['DistilBERT ML']) 

x_test =  WordEmbeddings(df_test,dic_word_emb['DistilBERT ML']) 


## Train

In [15]:
rgs.fit(x_train,y_train_utility)



MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=200,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=None, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

### Save the regressor 

In [None]:
import pickle

pkl_filename = "pickle_MLPR_DistilBERTML.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(rgs, file)

if you want to load the model, use:

with open(pkl_filename, 'rb') as file: \\
rgs = pickle.load(file)

# Test the model

## Import metrics

In [16]:
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

In [17]:
y_pred = rgs.predict(x_test)

**Mean Absolut Error**

In [18]:
mae(y_test_utility,y_pred)

2.5386814047156423

**Mean Squad Error**


In [19]:
mse(y_test_utility,y_pred)

43.61295392364879

**R^2 (coefficient of determination) regression score function.**

In [20]:
r2(y_test_utility,y_pred)

-0.18620509012970765

# Case Study

In [21]:
texts = ['This coment do not have anything important', 'The app is good and has a lot of functionality, you can easily access the files']

In [22]:
def Regression(text):
  embeddings_test = WordEmbeddings([text], dic_word_emb['DistilBERT ML'])
  resp = rgs.predict(embeddings_test)
  print('The text: "' + text + '" has the utility: '+ str(resp[0])) 

In [23]:
for text in texts:
  Regression(text)

The text: "This coment do not have anything important" has the utility: 3.551284940011397
The text: "The app is good and has a lot of functionality, you can easily access the files" has the utility: 3.7397086342580663
