# Collect the Dataset

We use, for example, the dataset present in the article, but if you want to train a model in your dataset, you must use your dataset. Or, if you're going to use the dataset present in the article with another model, you only must change the model 

In [1]:
!gdown --id 1hkVuZ7SicPZpR9Pf8R-uSivhhC6TRB0X

Downloading...
From: https://drive.google.com/uc?id=1hkVuZ7SicPZpR9Pf8R-uSivhhC6TRB0X
To: /content/zomato_en_900k.json
1.04GB [00:08, 117MB/s]


In [2]:
import pandas as pd
import json

with open('zomato_en_900k.json') as f:
  data = json.load(f)

df = pd.DataFrame(data)

# Pre-processing

## Utility

You must define the utility treshold, in this case we use 5. futhermore, you must normalize the values. 

In [3]:
df_utility = df[df.thumbsUp >=5]

## Normilize Utility



In [None]:
import numpy as np

maior = np.max(df_utility['thumbsUp'])

thumbs_up = list()

for num in list(df_utility['thumbsUp']):
  thumbs_up.append(num/maior)

bins = np.linspace(0, 1, 100) 
thumbs_up_normalized = np.digitize(thumbs_up, bins)

df_utility['TU_normalized'] = thumbs_up_normalized

## Bag-of-Words 

### Imports

In [15]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
nltk.download('stopwords') 
nltk.download('punkt') 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Tokenizer

In [6]:
def tokenize(text):
  
  p = re.compile('\d')

  tokens = nltk.word_tokenize(text)

  stems  = []
  for item in tokens:
    if len(item) > 2 and not p.match(item):  
      stems.append(SnowballStemmer("english").stem(item))
  return stems

### Bag-of-words term weights

In [7]:
stop_words = nltk.corpus.stopwords.words('english') 

dic_tw = {
    'TF' : CountVectorizer(tokenizer=tokenize, stop_words=stop_words, ngram_range=(1,1)),
    'TF-IDF' : TfidfVectorizer(tokenizer=tokenize, stop_words=stop_words, ngram_range=(1,1)),
    'Binary' : CountVectorizer(tokenizer=tokenize, stop_words=stop_words, ngram_range=(1,1), binary=True),
    'TF-Bigram' : CountVectorizer(tokenizer=tokenize, stop_words=stop_words, ngram_range=(1,2)),
    'TFIDF-Bigram' : TfidfVectorizer(tokenizer=tokenize, stop_words=stop_words, ngram_range=(1,2)),
    'Binary-Bigram' : CountVectorizer(tokenizer=tokenize, stop_words=stop_words, ngram_range=(1,2), binary=True)
}

# Functions to train the model

## Import models

In [8]:
from scipy.spatial.distance import cosine
from sklearn.neighbors import KNeighborsRegressor as KNR # similar ao KNN
from sklearn.svm import SVR # similar ao SVM
from sklearn.neural_network import MLPRegressor as MLPR # similar ao MLP
from sklearn.linear_model import BayesianRidge as NBR # similar ao NB

if you use the KNR its interessant use the metric cosine that is good for text data

In [9]:
def cosseno(x,y):
  dist = cosine(x,y)
  if np.isnan(dist):
   return 1
  return dist

## Regressors Variation

In [10]:
rgrs = {
    "KNR" : KNR(metric=cosseno),
    "MLPR" : MLPR(),
    "NBR" : NBR(),
    "SVR" : SVR()
}

## Define the algorithm that you will use

In [12]:
rgs = rgrs['MLPR']

## Train-Test division

First, you must define the train and the test set. *test_size* define the percent of examples of test set, consequently, the train set size is 1 - *test_size*

In [17]:
from sklearn.model_selection import train_test_split

df_train, df_test, y_train_utility, y_test_utility = train_test_split(df_utility.text, df_utility['TU_normalized'] ,test_size=0.25, random_state=42)

# Execution

## Pre-processing

In [16]:
vectorizer = dic_tw['TF-IDF']

vectorizer.fit(df_train)

x_train = vectorizer.transform(df_train).toarray()

x_test = vectorizer.transform(df_test).toarray()



## Train

In [18]:
rgs.fit(x_train,y_train_utility)



MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_fun=15000, max_iter=200,
             momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
             power_t=0.5, random_state=None, shuffle=True, solver='adam',
             tol=0.0001, validation_fraction=0.1, verbose=False,
             warm_start=False)

function to save the model

In [None]:
import pickle

pkl_filename = "pickle_MLPR_TFIDF.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(rgs, file)

if you want to load the model, use:

with open(pkl_filename, 'rb') as file: \\
rgs = pickle.load(file)

# Test the model

## Import metrics

In [21]:
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

In [19]:
y_pred = rgs.predict(x_test)

**Mean Absolut Error**

In [22]:
mae(y_test_utility,y_pred)

4.417804223737212

**Mean Squad Error**


In [23]:
mse(y_test_utility,y_pred)

61.888286175033414

**R^2 (coefficient of determination) regression score function.**

In [24]:
r2(y_test_utility,y_pred)

-0.6832659445344622

# Case Study

In [30]:
texts = ['This coment do not have anything important', 'The app is good and has a lot of functionality, you can easily access the files']

In [31]:
def Regression(text):
  bow_test = vectorizer.transform([text]).toarray()
  resp = rgs.predict(bow_test)
  print('The text: "' + text + '" has the utility: '+ str(resp[0])) 

In [32]:
for text in texts:
  Regression(text)

The text: "This coment do not have anything important" has the utility: -2.756250083506388
The text: "The app is good and has a lot of functionality, you can easily access the files" has the utility: 17.05477111646119
