# Review's Stars Prediction



Naïve Bayes algorithm will be used for the prediction. 
This algorithem assumes independence between the features, so the chosen features are independent as possible.
The chosen features are:

* **sentimemt & magnitude** Classified before using Naïve Bayes classifier, the reviews classified into 3 classes: positive, negative and neutral. The magnitude is the indicator for the significance of the sentiment.

* **user's stars** The average stars of previous reviews from a given user. Based on *"The rich get richer and the poor get poorer"* a user who wrote good reviews in the past has good chance to write good review this time.    

* **business' stars** The average stars of previous reviews for a given business. 

## prepare the data


The view with the relevant columns created with:

```SQL
Create view dataset as 
(SELECT f.review_id as id,f.sentimemt as sentiment,f.magnitude as magnitude,
u.average_stars as u_stars,b.b_stars as b_stars,f.r_stars as r_stars
FROM fact_table f
LEFT JOIN users as u
ON f.user_id = u.user_id
LEFT JOIN business as b
ON f.business_id = b.business_id);
```

Than the dataset splited into training_set and test_set, ratio 80/20:

```SQL

Create view training_set as 
(SELECT *
 FROM dataset
 ORDER BY id ASC
 LIMIT 143819);
 
 Create view test_set as 
(SELECT *
 FROM dataset
 ORDER BY id DESC
 LIMIT 35954);


```
**note: review_id column is encrypted, this fact helped to shuffle the dataset.** 

## likelihood calculation

The model based on the following equation:
$$
p(c|X)=p(x_{sentimemt}|c)*p(x_{user}|c)*p(x_{business}|c)*p(c)
$$

Where p(c|X) is the posterior probability of class (number of stars) given predictor (set of features X).
The class with the highest probability will be the prediction for the given review.

in the following blocks we present the queries we used to extract the likelihood probabilities of each feature.
each query saves the results in csv file and than pandas' dataframes used to manupulate the data.

In [11]:
import pandas as pd
import numpy as np
import operator
import tqdm
import os.path
if os.path.isfile("user.csv"):
    path=""
else:
    #mysql's default env
    path="C:\ProgramData\MySQL\MySQL Server 5.7"  

### user stars

```SQL
SELECT ROUND(u_stars, 0) AS u,ROUND(r_stars, 0) AS r, COUNT(*) AS COUNT
FROM   dev_local.training_set
GROUP  BY u,r
INTO OUTFILE 'C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/user.csv'
FIELDS TERMINATED BY ',';
```

In [12]:
user_prob = pd.read_csv(path+'\Uploads\user.csv',\
names=['u_stars', 'r_stars', 'prob'])
# normalization to get probability
user_prob["prob"]=user_prob["prob"]/float(user_prob["prob"].sum()) 
user_prob.head()   

Unnamed: 0,u_stars,r_stars,prob
0,1,1,0.020943
1,1,2,0.000633
2,1,3,3.5e-05
3,1,4,1.4e-05
4,1,5,4.2e-05


### business stars

```SQL
SELECT ROUND(b_stars, 0) AS b,ROUND(r_stars, 0) AS r, COUNT(*) AS COUNT
FROM   dev_local.training_set
GROUP  BY b,r
INTO OUTFILE 'C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/business.csv'
FIELDS TERMINATED BY ',';
```

In [13]:
bus_prob = pd.read_csv(path+'\Uploads\\business.csv',\
names=['b_stars', 'r_stars', 'prob'])
# normalization to get probability
bus_prob["prob"]=bus_prob["prob"]/float(bus_prob["prob"].sum()) 
bus_prob.head()   

Unnamed: 0,b_stars,r_stars,prob
0,1,1,0.002955
1,1,2,0.000167
2,1,3,1.4e-05
3,1,4,7e-06
4,1,5,7e-06


### sentiment and magnitude 

```SQL
SELECT (sentiment) AS sent,ROUND(magnitude, 1) AS mag,ROUND(r_stars, 0) AS r, COUNT(*) AS COUNT
FROM   dev_local.training_set
GROUP  BY sent,mag,r
INTO OUTFILE 'C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/sentiment_magnitude.csv'
FIELDS TERMINATED BY ',';
```

In [14]:
sent_mag_prob = pd.read_csv(path+'\Uploads\sentiment_magnitude.csv', \
names=['sentiment', 'magnitude', 'r_stars', 'prob'])
sent_mag_prob["prob"]=sent_mag_prob["prob"]/float(sent_mag_prob["prob"].sum())
sent_mag_prob.head()

Unnamed: 0,sentiment,magnitude,r_stars,prob
0,na,0.0,1,0.001613
1,na,0.0,2,0.001592
2,na,0.0,3,0.002559
3,na,0.0,4,0.006119
4,na,0.0,5,0.007078


### prior probability of class
```SQL
SELECT ROUND(r_stars, 0) AS r, COUNT(*) AS COUNT
FROM   dev_local.training_set
GROUP  BY r
INTO OUTFILE 'C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/prior.csv'
FIELDS TERMINATED BY ',';
```

In [5]:
prior_prob = pd.read_csv(path+'\Uploads\prior.csv',names=['r_stars', 'prob'])
 # normalization to get probability
prior_prob["prob"]=prior_prob["prob"]/float(prior_prob["prob"].sum())
prior_prob.head()   

Unnamed: 0,r_stars,prob
0,1,0.110743
1,2,0.093242
2,3,0.137409
3,4,0.279796
4,5,0.378809


## Model Evaluation


### Load test set
```SQL
SELECT *
FROM   dev_local.test_set
INTO OUTFILE 'C:/ProgramData/MySQL/MySQL Server 5.7/Uploads/test.csv'
FIELDS TERMINATED BY ',';
```

In [15]:
test = pd.read_csv(path+'\Uploads\\test.csv',\
names=['id','sentiment','magnitude','u_stars','b_stars', 'r_stars'])
test.head()   

Unnamed: 0,id,sentiment,magnitude,u_stars,b_stars,r_stars
0,___iDWC9iXf9R6BtxXgNcQ,pos,0.470707,2.86,4.5,5
1,___6FK0UHRSWL3kpUvCaEQ,pos,0.089197,1.0,1.5,1
2,__zNo9PDQytR7MTcMd4QQw,neg,0.490903,3.85,4.0,4
3,__yGuzTDg1gWWptrRLPxsQ,pos,0.298717,4.33,4.0,4
4,__XTzfeXw2Bw42-vFDJkRQ,neg,0.232015,4.13,4.5,4


In [7]:
def predict(x):
    # get feature vector x and returns the predicted number of stars
    p_x_c={}
    for stars in range(1,6):
        #user's stars feature
        p_x_c[stars]=user_prob.query('r_stars == %s & u_stars == %s'\
        %(stars,x[0])).iloc[0]['prob']
        #business's stars feature
        p_x_c[stars]*=bus_prob.query('r_stars == %s & b_stars == %s'\
        %(stars,x[1])).iloc[0]['prob']
        #sentiment & magnitude feature
        p_x_c[stars]*=sent_mag_prob.query('r_stars == %s & sentiment == "%s" & magnitude==%s'\
        %(stars,x[2],x[3])).iloc[0]['prob']
        #multiply with the prior p(c)
        p_x_c[stars]*=prior_prob.iloc[stars-1]['prob']
    #returns the key (num of stars) which associated to the highest probability 
    return (max(p_x_c.iteritems(), key=operator.itemgetter(1))[0])
    

In [8]:
counter=0
for index, row in tqdm.tqdm(test.iterrows()):
    x=[int(row['u_stars']), int(row['b_stars']),row["sentiment"],round(row["magnitude"],1)]
    if predict(x)==row["r_stars"]:
        counter+=1


35954it [20:19, 29.48it/s]


In [9]:
print ('accuracy: {}%'.format(round(float(counter)/35954,3)*100))

accuracy: 41.8%


One hypothesis for the results are the business' stars which may seem unrelated for a given review, so a proposed model was to ignore this feature:

In [10]:
def predict_without_bus(x):
    # get feature vector x and returns the predicted number of stars
    p_x_c={}
    for stars in range(1,6):
        #user's stars feature
        p_x_c[stars]=user_prob.query('r_stars == %s & u_stars == %s'%(stars,x[0])).iloc[0]['prob']
        #sentiment & magnitude feature
        p_x_c[stars]*=sent_mag_prob.query('r_stars == %s & sentiment == "%s" & magnitude==%s'\
        %(stars,x[2],x[3])).iloc[0]['prob']
        #multiply with the prior p(c)
        p_x_c[stars]*=prior_prob.iloc[stars-1]['prob']
    #returns the key (num of stars) which associated to the highest probability 
    return (max(p_x_c.iteritems(), key=operator.itemgetter(1))[0])
    

In [11]:
counter_without_bus=0
for index, row in tqdm.tqdm(test.iterrows()):
    x=[int(row['u_stars']), int(row['b_stars']),row["sentiment"],round(row["magnitude"],1)]
    if predict_without_bus(x)==row["r_stars"]:
        counter_without_bus+=1

35954it [14:08, 42.39it/s]


In [12]:
print ('accuracy: {}%'.format(round(float(counter_without_bus)/35954,3)*100))

accuracy: 40.5%


the results got worse.
than, prediction only by user's stars inspected:

In [13]:
def predict_by_user(x):
    # get feature vector x and returns the predicted number of stars
    p_x_c={}
    for stars in range(1,6):
        #user's stars feature
        p_x_c[stars]=user_prob.query('r_stars == %s & u_stars == %s'\
        %(stars,x[0])).iloc[0]['prob']
        #multiply with the prior p(c)
        p_x_c[stars]*=prior_prob.iloc[stars-1]['prob']
    #returns the key (num of stars) which associated to the highest probability 
    return (max(p_x_c.iteritems(), key=operator.itemgetter(1))[0])
    

In [14]:
counter_by_user=0
for index, row in tqdm.tqdm(test.iterrows()):
    x=[int(row['u_stars']), int(row['b_stars']),row["sentiment"],round(row["magnitude"],1)]
    if predict_by_user(x)==row["r_stars"]:
        counter_by_user+=1

35954it [06:28, 92.47it/s]


In [15]:
print ('accuracy: {}%'.format(round(float(counter_by_user)/35954,3)*100))

accuracy: 41.6%


the result is very close to the full model but each prediction took only 0.01 seconds - three times faster!

## conclusion

To sum up, a Naïve Bayes algorithm proposed to predict the star raiting of a given review.
The model succeed to predict about 40% of the test set. Although it sounds like a poor result, it is 2 times better than a random predictor. In addition, prediction of a given review takes about 0.001 seconds which is pretty fast.
To imporove preformance we suggest the following steps:

* look for better features - As diffrent features experimented the final model is based only on the previous reviews of the user. More features may improve accuracy if they will be picked wisely.  


* optimize the prediction function - As noted, the more features the more time each iteration takes. maybe there is a faster implemention of this function using matrix multipication