#           Yelp Data Analysis
##            Wednesday_Group 6

# 1. Introduction
Given the dataset, we first clean the text data and change it into vectors, and extract features after building our own sentiment dictionary, then we try models such as linear regression, multinomial logistic regression, XGboost etc, in order to figure out what makes a review positive or negative and predict stars in the test dataset.  

# 2. Background and Goal
It is easy for human to understand whether a review is positive or negative through reading the text. However, when facing a large set of reviews, we need machines to do this for us. Therefore, the goal is to build models and then teach machine to do predictions of ratings as accurate as possible.

# 3. Data Cleaning
## 3.1 Tokenization
All the reviews firstly are splitted into single words and then stored in the new dataset, while all the others like punctuations, whitespace and special symbols are removed. 
## 3.2 Transform to lower case
Secondly, we change all the upper case in all of the words into lower case in order to treat them the same and count their frequency of use.
## 3.3 Word stemming and lemmatization
Thirdly, in order to recognize the same words which appear as different forms, we do word stemming and lemmatization. For example, 'playing' and 'played' are changed into the same as their shared root 'play'.
## 3.4 Remove stop words 
At last, all the stop words like 'there', 'do', 'not' are removed since they are meanningless and helpless to the further analysing.
## 3.5 Output
Both training dataset and test dataset are dealed with all these data cleaning steps and after this we have the basement to do further analysis and model fitting.

# 4. From text to variables
## 4.1 Word to vector
Firstly, by using R package 'text2vec', we count how many words are used in the whole training dataset and then treat each of them as a variable. For each cleaned review, the value becomes 0 if it does not cotain this word, otherwise the value is its count of use in this review. After every words in every review is assigned a value, we obtain a sparse matrix. The number of columns equals the number of words in total and the number of rows is the number of reviews.
## 4.2 TF-IDF
TF-IDF is a method that makes the change into the sparse matrix according to the frequecy of the words in the whole dataset. Although all the values that equal 0 in the sparse matrix will not be changed, other values in one review will be dealed with a weighted average based on its times of use in the whole dataset.
## 4.3 Sentiment Dictionary
We try to build a sentiment dictionary for words selected in LDA model. First, we use words with the clearest tendency to build a base for estimation.
Let 

$$\textrm{Positive%}_{word}=\frac{\textrm{appearance of word in 5&4-star reviews}}{\textrm{appearance of word in all reviews}}$$
$$\textrm{Negative%}_{word}=\frac{\textrm{appearance of word in 1&2-star reviews}}{\textrm{appearance of word in all reviews}}$$

Then we set thresholds

$$\textrm{Positive threshold}= mean(\textrm{Positive%}_{word})+sd(\textrm{Positive%}_{word})$$ 
$$\textrm{Negative threshold}= mean(\textrm{Negative%}_{word})+sd(\textrm{Negative%}_{word})$$ 

And select words with Positive% > Positive threshold to be positive words, and words with Negative% > Negative threshold to be negative. The intensity of sentiment is the corresponding Positive% or Negative%.
Some words in base：
<img src="t1.png">
And Positive threshold = 0.7511, Negative threshold = 0.3650, which means that the positive words have much higher intensities than the negative words.


By using Word2vec method, we transform each word into a vector of 100 dimension. Then we denote the distance between two words by their cosine similarity

$$Distance_{(a, b)}=\frac{<a, b>}{||a||·||b||}$$

For a “foreign” word not in the base, we select 20 words closest to it in the base. We then compute a weighted average of the sentiment of these 20 words and assign it to the “foreign” word. The weight is used to compensate the different proportions and intensities between positive and negative words, and the final formula is 

$$Sum_{word, p}=\textrm{sum of the Intensity of all positive words selected}$$
$$Sum_{word, n}=\textrm{sum of the Intensity of all negative words selected}$$
$$mean_p=\textrm{mean Intensity of positive words in the base}$$
$$mean_n=\textrm{mean Intensity of negative words in the base}$$
$$λ=\textrm{proportion of positive words in the base}$$


$$Intensity_{word}=\frac{Sum_{word, p}/mean_p/λ-Sum_{word, n}/mean_n/(1-λ)}{2}$$

Then we add the “foreign” word into the base and randomly select another “foreign” word. The proportions and mean intensities of positive and negative words are updated each iteration, but positive words eventually count for about 60% of all the words in dictionary.

After all the words are added into the base, we again construct thresholds

$$\textrm{Positive threshold}= mean(\textrm{Intensity}_\textrm{positive word})+sd(\textrm{Intensity}_\textrm{positive word})$$ 
$$\textrm{Negative threshold}= mean(\textrm{Intensity}_\textrm{negative word})-sd(\textrm{Intensity}_\textrm{negative word})$$ 

And select a new base. Then we repeat all the work again to make sure the dictionary is robust. Since the proportions and mean intensities of positive and negative words are quite similar in two versions of dictionary，we conclude that the algorithm is convergent.

Some interesting words from the dictionary.
<img src="t2.png">
$$Positive$$
<img src="t3.png">
$$Negative$$

In most people's mind, the taste like home is a great attraction, especially for those lives far from their home town. That why word related with home have strong positive sentiment. Also, people often describe themselves as picky in order to show the great pleasure the restaurant have brought to them.

On the other hand, people do care about money, even words like "discount" and "giftcard" has strong negative sentiment. It seems that when people start to talk about money, they are no way to be satisfied. Surprisingly, most material words appear to be negative, it seems that people care about what kind of containers are used to serve their meal.   

## 4.4 Latent dirichlet allocation
1. Firstly we use the "FindTopicNumber" function in package "topicmodels" to obtain 15 topics as the best selection of topics' number based on 4 metrics: Griffths 2004, Caojuan2009, Arun 2010, Deveaud2014. 

2. Then we use LDA function in package"ldatuning" to extract 15 topics and for method, we choose "Gibbs".

3. After we did LDA, we obtained a table of LDA to high-frequency words and their beta coefficients. Multiplying the frequency of each word in each review and the sentiment score of each word in our self-created sentiment dictionary, and then summing them up, we obtained each features' value in each review as the independent value. 

4. With these values of 15 features in dirrefent reviews, we can then do regression about the dependent variable y.


The following plot shows how to determine the best number of topics. Based on minimizing "Griffths 2004" and "Caojuan 2009" as well as maximizing "Arun 2010" and "Deveaud 2014", we choose 15 as the number of topics according to the plot.
<img src="figure.jpg">

The following plot shows the words content in each topic according to their decreasing beta coefficients.
<img src="list.jpg">

# 5. Models Fitting and Prediction
## 5.1 Linear regression
To begin with, we naively choose variables of the top 1000 most frequent used words to fit a linear regression model. 

This figure below shows some words that has significant effect to a review. It is obvious that words like worse, horrible awful will definitly hurt a reviw.  

What's more, we concern about the other factors like location, length of a review. By fitting linear regression, it turns to be statistical significant bewteen the stars and the length of a review that before cleaned, while not significant between stars and the city. Thus, the variable that the length of reviews is added into model.
With this linear regression model, we predict the stars of test dataset, the mse turns out to be over 0.7 which is not good enough.
<img src="words1.jpg">

## 5.2 Multinomial logistic regression
We try function 'multinom' in 'nnet' package based on variables selected by the previous linear regression model, to fit a multinomial logistic regression model since the response variable 'stars' is not binomial but has level more than 2. Firslt, we choose only 10 words with top 10 lowest p-values to be predictors, and then 20, 50, 100, 150, 180, 200... At last, we find that the mse of the cross validation by training dataset will not be smaller a lot when we keep number of variables over 180. Therefore, we finally keep 180 variables of words in our model.

After the multinomial logistic regression model is build, we do prediction for training dataset by using two different prediction type: 'class' and 'probs'. When applying 'class', it gives the exact outcomes to be 1 or 2 or 3 or 4 or 5. And 'probs' means that it returns the probability values to predict the result as 1 to 5. And then we do weighted average to obtain our final predictions of 'probs' type. It turns out to be that type of 'probs' is much more accurate than type of 'class' and the mse is quite low.
<img src="words2.jpg">
Above are some top significant words (changed from baseline). For example, the first one 'gem' is of great effect on rating, which we can understand from its literal meaning. Howeverm the word like 'ist' seems to be unlikely to appear here, which may caused by problems in cleanning.

Therefore, we use this model to predict stars in training data. The result turns to be acceptable (having lowest rmse among all models we tried), which is exact our final model.

## 5.3 XGBoost

In this part, we set the parameter 'nround' which means times of iteration to be 200 and the 'etc' which means the reduction of rmse in each iteration to be 0.01. Actually, rmse does not converge at last. Therefore, we add more and more the 'nrounds' and corresponding higher 'etc' and finally the rmse does not become lower any more in the last several iterations when 'nrounds' setted 500. 

## 5.4 Models rebuilding via features from LDA 
We repeat our model building process using features instead of words. However, the results are not as good as what we did before.


# 6. Strengths and Weaknesses
Stengths : The advantages of our final model is easy to interpret with higher accuracy in the cross validation of training dataset. What's more, we create sentimental dictionary by ourself and find out some words which have high sentiments but easy to be ignored like 'hometown'.

Weaknesses : We do not make full use of the whole data, for example, other information like time of reviews, categories are wasted. In addition, our final rmse is 0.68, not low enough campared to other groups. We consider it is because our prediction is right skewed and more likely to predict it positive, thus, for true values of 1 and 2 we have higher mse even if our accuracy is relative high.

# 7. Conclusion
Having done all the process dealing with the Yelp dataset and model selection, we meet goals of this module. Our final model is multinomial logistic regression with 'probs' type of prediction which has good accuracy.

#### Contribution
Data cleaning: Wenqin Xiong, Yiding Ding

Feature extraction and dictionary: Wenqin Xiong, Yiding Ding

Model building: Jianxiong Wang, Jin Tao

Summary: Jianxiong Wang, Jin Tao, Wenqin Xiong, Yiding Ding

Slide: Jianxiong Wang, Jin Tao, Wenqin Xiong, Yiding Ding