# User Weight
This notebook is to find the 'validity' of users based on information such as stars, numbers of different dimensions index of compliment, number of review, etc.

It's said that "elite" users are entitled by Yelp Company who are active in Yelp community. It's intuitive to view the comments from elite user to be more faithful and crucial to a business shop. So, the main idea for our model is to predict the possibility of a user to be elite user based on data including well-written reviews, high-quality photos, detailed personal profile and history of playing with others.  Overall, we would increase the weight for possible elite users' comment and scores they gave in the latter analysis.

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import matplotlib as mplPorterStemmer
import collections

## 1. Basic Information

In [2]:
user = pd.read_csv('user.csv')
user.head()

Unnamed: 0,average_stars,compliment_cool,compliment_cute,compliment_hot,compliment_list,compliment_more,compliment_note,compliment_photos,compliment_plain,compliment_profile,...,cool,elite,fans,friends,funny,name,review_count,useful,user_id,yelping_since
0,3.45,64,0,16,0,8,42,10,32,1,...,974,2015201620172018.0,71,"tRC9YLo4LHquMVXZ9VO4Ag, -YpfTgz88rsPwsOvlfKn7w...",1007,Javier,470,1655,pU6GoRTcl1rIOi6zMynjog,2011
1,4.4,1,0,1,0,0,1,0,3,0,...,45,20172018.0,4,"b6JabuZ8sNh91ZGqBJ-JHw, sKL3W6Yy1OJbSrNw7griiQ...",25,Karen,85,63,NcUMNz6tAahD6mDmuKwYIA,2008
2,4.57,1,0,0,0,0,1,2,0,0,...,9,,1,"k9dPWLh91nj46fEsMrPxYA, tFfvKiwajiXmUlY3SfF3KQ...",3,Noah,26,17,1a36KbE7XH31Uo4fkmdkqQ,2012
3,4.37,2,0,1,0,0,2,0,1,1,...,4,,1,"bMTNtbG6QxFNdyrPcx7V2g, 51qkOPAiCREk_2QFjn9Hyg...",3,Franz,27,17,7wFpWiJMxePaaGiL_c-_IQ,2017
4,4.32,3,0,5,0,0,1,1,5,0,...,87,,3,"pxzs-Dy2hXTis-PuNCV37Q, uZCy7wuptQo3arWvhpqZAA...",44,Victoria,95,114,i3dgAM1hWY9UdUCNMDnLXQ,2014


In [3]:
user.columns # column names

Index(['average_stars', 'compliment_cool', 'compliment_cute', 'compliment_hot',
       'compliment_list', 'compliment_more', 'compliment_note',
       'compliment_photos', 'compliment_plain', 'compliment_profile',
       'compliment_writer', 'cool', 'elite', 'fans', 'friends', 'funny',
       'name', 'review_count', 'useful', 'user_id', 'yelping_since'],
      dtype='object')

In [27]:
print('Number of users:',len(user))
print('Missing value of average_stars:',sum(user['average_stars'].isna()))
print('Missing value of compliment_cool:',sum(user['compliment_cool'].isna()))
print('Missing value of cool:',sum(user['cool'].isna()))
print('Missing value of elite:',sum(user['elite'].isna()))
print('Missing value of fans:',sum(user['fans'].isna()))
print('Missing value of friends:',sum(user['friends'].isna()))
print('Missing value of funny:',sum(user['funny'].isna()))
print('Missing value of review_count:',sum(user['review_count'].isna()))
print('Missing value of useful:',sum(user['useful'].isna()))
print('Missing value of yelping_since:',sum(user['yelping_since'].isna()))

Number of users 33564
Missing value of average_stars 0
Missing value of compliment_cool 0
Missing value of cool 0
Missing value of elite 26608
Missing value of fans 0
Missing value of friends 0
Missing value of funny 0
Missing value of review_count 0
Missing value of useful 0
Missing value of yelping_since 0


As we can see, except for "elite" all of other indexes don't have missing value. We don't need to impute for user's dataset. In addition, "elite" will be NULL if the user hasn't be an elite at all. In hence, we can encode those NULL as 0 and encode the rest of the records as 1. It indicates whether the user has been elite before.

In [32]:
# Encode elite
user['elite'] = user['elite'].fillna(0)
user['elite'] = np.where(user['elite']==0,0,1)
print('Proportion of elite user:',round(sum(user['elite'])/len(user),2))

Proportion of elite user: 0.21


In [70]:
# Drop name column
user = user.drop(['name'],axis=1)
print('Number of compliment_list greater than 0:',sum(user['compliment_list']>0))
print('Proportion of nonzero compliement_list:',round(sum(user['compliment_list']>0)/len(user),2))
# Sine the proportion is too small, we would like to get rid of this column.
user = user.drop(['compliment_list'],axis=1)

Number of compliment_list greater than 0: 1704
Proportion of nonzero compliement_list: 0.05


## 2. Feature Engineering

(1) Since "funny" and "cool" have similar meaning, we can add these two indexes up and get a new "cool_fun" index. Similarly, we think "compliment_cool", "compliment_cute", "compliment_hot" are all positive appraise, so we will aslo combine these indexes into "compliment_pos".

In [57]:
user['cool_fun'] = user['cool'] + user['funny']
user = user.drop(['cool','funny'],axis=1)
user['compliment_pos'] = user['compliment_cool'] + user['compliment_cute'] + user['compliment_hot']
user['compliment_pos'] = user['compliment_cool'] + user['compliment_cute'] + user['compliment_hot']
user = user.drop(['compliment_cool','compliment_cute','compliment_hot'],axis=1)

(2) We are only interested in the number of friends a user have rather than their specif ids. So, for 'friends' information we will only keep the amount of it.

In [42]:
user['friends'] = user['friends'].apply(lambda x: len(x.split(',')))

(3) 

In [71]:
user.head()

Unnamed: 0,average_stars,compliment_more,compliment_note,compliment_photos,compliment_plain,compliment_profile,compliment_writer,elite,fans,friends,review_count,useful,user_id,yelping_since,cool_fun,compliment_pos
0,3.45,8,42,10,32,1,35,1,71,694,470,1655,pU6GoRTcl1rIOi6zMynjog,2011,1981,80
1,4.4,0,1,0,3,0,3,1,4,231,85,63,NcUMNz6tAahD6mDmuKwYIA,2008,70,2
2,4.57,0,1,2,0,0,1,0,1,170,26,17,1a36KbE7XH31Uo4fkmdkqQ,2012,12,1
3,4.37,0,2,0,1,1,0,0,1,991,27,17,7wFpWiJMxePaaGiL_c-_IQ,2017,7,3
4,4.32,0,1,1,5,0,1,0,3,80,95,114,i3dgAM1hWY9UdUCNMDnLXQ,2014,131,8


In [74]:
user.columns

Index(['average_stars', 'compliment_more', 'compliment_note',
       'compliment_photos', 'compliment_plain', 'compliment_profile',
       'compliment_writer', 'elite', 'fans', 'friends', 'review_count',
       'useful', 'user_id', 'yelping_since', 'cool_fun', 'compliment_pos'],
      dtype='object')

In [72]:
user['compliment_profile'].describe()

count    33564.000000
mean         1.468925
std         25.082065
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max       2331.000000
Name: compliment_profile, dtype: float64

## 3. Logistic Regression

To begin with, we try with the full model with all features we have to predict "elite" status. Since "elite" status is a binary variable, we would use logistic regression to predict the "elite" status as well as its probability.

In [184]:
from sklearn import linear_model
import scipy.stats as stat

class LogisticReg:
    """
    Wrapper Class for Logistic Regression which has the usual sklearn instance 
    in an attribute self.model, and pvalues, z scores and estimated 
    errors for each coefficient in 
    
    self.z_scores
    self.p_values
    self.sigma_estimates
    
    as well as the negative hessian of the log Likelihood (Fisher information)
    
    self.F_ij
    """
    
    def __init__(self,*args,**kwargs):#,**kwargs):
        self.model = linear_model.LogisticRegression(*args,**kwargs)#,**args)

    def fit(self,X,y):
        self.model.fit(X,y)
        #### Get p-values for the fitted model ####
        denom = (2.0*(1.0+np.cosh(self.model.decision_function(X))))
        denom = np.tile(denom,(X.shape[1],1)).T
        F_ij = np.dot((X/denom).T,X) ## Fisher Information Matrix
        Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
        sigma_estimates = np.sqrt(np.diagonal(Cramer_Rao))
        z_scores = self.model.coef_[0]/sigma_estimates # z-score for eaach model coefficient
        p_values = [stat.norm.sf(abs(x))*2 for x in z_scores] ### two tailed test for p-values
        
        self.z_scores = z_scores
        self.p_values = p_values
        self.sigma_estimates = sigma_estimates
        self.F_ij = F_ij
        

In [185]:
y = user['elite']
x = user[['average_stars', 'compliment_more', 'compliment_note',
       'compliment_photos', 'compliment_plain', 'compliment_profile',
       'compliment_writer', 'fans', 'friends', 'review_count',
       'useful', 'yelping_since', 'cool_fun', 'compliment_pos']]
model_full = LogisticReg(random_state=123)
model_full.fit(x,y)



In [186]:
index = ['average_stars', 'compliment_more', 'compliment_note',
       'compliment_photos', 'compliment_plain', 'compliment_profile',
       'compliment_writer', 'fans', 'friends', 'review_count',
       'useful', 'yelping_since', 'cool_fun', 'compliment_pos']
table_list = list(zip(model_full.z_scores,model_full.sigma_estimates, model_full.p_values))
table_full = pd.DataFrame(table_list,index=index,columns=['z score','sigma estimate','p value'])
table_full

Unnamed: 0,z score,sigma estimate,p value
average_stars,0.197895,0.028402,0.8431276
compliment_more,0.783548,0.012747,0.4333052
compliment_note,-5.068249,0.002865,4.014929e-07
compliment_photos,-1.497339,0.001538,0.1343052
compliment_plain,-1.300731,0.000532,0.1933505
compliment_profile,-0.401226,0.011367,0.6882534
compliment_writer,6.395535,0.005098,1.599857e-10
fans,13.134345,0.004505,2.092933e-39
friends,11.688878,0.000124,1.45293e-31
review_count,25.61977,0.000365,9.187586e-145


As we can see from the table above, average score given by user is not necessarily correlated with elite status. Also, compliment_more, compliment_photos, compliment_plain, compliment_profile, compliment_pos are not statistically significant. There might have high correlation among them and number of fans/friends. For latter analysis, we mainly focus on thos significant indexes for modeling.

In [106]:
import statsmodels.api as sm
y = user['elite']
x = user[['compliment_note','compliment_writer', 'fans', 'friends', 'review_count',
       'useful', 'yelping_since', 'cool_fun']]
logit_model=sm.Logit(y,x)
result=logit_model.fit()
print(result.summary2())

Optimization terminated successfully.
         Current function value: 0.280855
         Iterations 11
                          Results: Logit
Model:                Logit            Pseudo R-squared: 0.450     
Dependent Variable:   elite            AIC:              18869.2320
Date:                 2019-11-10 22:33 BIC:              18936.6017
No. Observations:     33564            Log-Likelihood:   -9426.6   
Df Model:             7                LL-Null:          -17127.   
Df Residuals:         33556            LLR p-value:      0.0000    
Converged:            1.0000           Scale:            1.0000    
No. Iterations:       11.0000                                      
-------------------------------------------------------------------
                   Coef.  Std.Err.     z     P>|z|   [0.025  0.975]
-------------------------------------------------------------------
compliment_note   -0.0539   0.0037  -14.7103 0.0000 -0.0611 -0.0467
compliment_writer  0.0975   0.0072   13.

In [112]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(x, y)
y_proba = logreg.predict_proba(x)
y_proba

array([[2.76098446e-06, 9.99997239e-01],
       [7.85770889e-01, 2.14229111e-01],
       [9.18898377e-01, 8.11016235e-02],
       ...,
       [9.47890257e-01, 5.21097429e-02],
       [9.48392706e-01, 5.16072936e-02],
       [9.47802704e-01, 5.21972965e-02]])

In [187]:
from sklearn.metrics import roc_auc_score
print('ROC AUC score:',roc_auc_score(y, y_proba[:,1])) # ROC AUC

ROC AUC score: 0.960613479038589


The ROC AUC score is pretty high for our LR model which means this model have excellent ability to distinguish positve and negative samples.

Our goal is to predict the elite status, in other word, we want to find the probability of a user to be a trustful customer. Although some users are not elite users by far, they may also have many high-quality reviews and active interaction with others. We would also view them as trustful users. In hence, for prediction, we intend to use a lower threshold which may generate more "elite" users than that in reality. As for evaluation, we concern more about recall rate which indicates how many real elite users could be detected by our model.

In [195]:
evaluation = []
for i in np.arange(0,1,0.1):
    y_pred = np.where(y_proba[:,1]>i,1,0)
    evaluation.append([sum(y_pred)/len(y_pred), sum(y_pred==user['elite'])/len(y_pred),
                      sum((y_pred==user['elite'])&(user['elite']==1))/sum(user['elite']==1)])
    

In [198]:
pd.DataFrame(evaluation,index=np.arange(0,1,0.1),columns=['Proportion','Accuracy','Recall'])

Unnamed: 0,Proportion,Accuracy,Recall
0.0,1.0,0.207246,1.0
0.1,0.32806,0.858688,0.950546
0.2,0.217703,0.911661,0.812105
0.3,0.181742,0.911036,0.723836
0.4,0.158116,0.90606,0.65483
0.5,0.14301,0.900489,0.604945
0.6,0.128948,0.8931,0.553191
0.7,0.11703,0.886545,0.508626
0.8,0.1055,0.879544,0.463916
0.9,0.091854,0.870605,0.409431


The original proportion of "elite" users is 0.21. From the table above, we find the result with 0.2 threshold is most similar to the real one. While we hope to obtain more potential "elite" users, we would choose a lower threshold with higher recall and proper accuracy. Hence, we sould choose 0.1 threshold.

In [208]:
y_pred = np.where(y_proba[:,1]>0.1,1,0)
print('Proportion of elite users in prediction:',round(sum(y_pred)/len(y_pred),3))
print('Accuracy:',round(sum(y_pred==user['elite'])/len(y_pred),3))
print('Recall rate:',round(sum((y_pred==user['elite'])&(user['elite']==1))/sum(user['elite']==1),3))

Proportion of elite users in prediction: 0.328
Accuracy: 0.859
Recall rate: 0.951


Reviews from potential elite users might be more valuable and faithful, so we would give it higher weight for latter analysis. In addition, the weight could not be too absolute since reviews from other users are not meaningless. Overall, we decide to give elite users' reviews with weight 2 while common users's review weight 1.

In [209]:
userweight = np.where(y_pred==1,2,1)

In [215]:
userweight = pd.DataFrame(userweight,columns=['weight'])
userweight['user_id'] = user['user_id']

In [220]:
userweight.head()

Unnamed: 0,weight,user_id
0,2,pU6GoRTcl1rIOi6zMynjog
1,2,NcUMNz6tAahD6mDmuKwYIA
2,1,1a36KbE7XH31Uo4fkmdkqQ
3,2,7wFpWiJMxePaaGiL_c-_IQ
4,2,i3dgAM1hWY9UdUCNMDnLXQ


In [221]:
userweight.to_csv('userweight.csv',index=False)