# NLP and Machine Learning

## Background

We have a data set called review which has the information of the review section. In this project, we will use review text, the star that the user give the business, how many people think this review is useful and business ID to perform NLP and machine learning.

Below is a sample of review text that the user gave to the restaurant:

```python
test['text'][0] 
```

>"The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo / Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say..."

## NLP
(This part corresponds to Machine Learning NLP.ipynb)

**We will perform the sentiment extraction on selected review text. In our case, we choose reviews that have more than 8 'useful' upvotes which we believe is a good review.**    



### 1. The importance of sentiment analysis
>It's important to use sentiments to evaluate the review of business. Although 'stars' seems to be a sufficient estimator, it lacks objectivity. A person can give different stars under his/her certain mood even though the actual quality of the business is constant. Thus, bringing up sentiment into account is a good way of measuring the actually quality of the business

### 2. Model selection

**We have 3 models to be experimented on: TextBlob, SpaCy Text Categorizer and Google NLP api:**               
1. TextBlob:   
Textblob doesn't have a acceptable accuracy especially for food review. Naivebayes is slow and inaccurate. I did a little research on how textblob calculates the sentiment. It turns out they have a XML file that contains polarity score for each words, and the overall polarity score is just the average of polarity scores of each word (Link: https://planspace.org/20150607-textblob_sentiment/ ). This is a poor way of estimating the sentiment score. Consider training my own NLP model.                                                
2. SpaCy Text Categorizer:  
SpaCy is a package that allows developer to build their own NLP model. The base model they provide is CNN.        
We use Amazon's food review dataset to train the model and the result is following:
         Warning: Unnamed vectors -- this won't allow multiple vectors models to be loaded. (Shape: (0, 0))
         Training the model...
         LOSS 	  P  	  R  	  F  
         204.768	0.910	0.983	0.945
         time for one iteration is 155.03080892562866
         123.836	0.925	0.975	0.949
         time for one iteration is 178.083074092865
         94.492	0.930	0.973	0.951
         time for one iteration is 176.36578583717346
         84.068	0.934	0.969	0.951
         time for one iteration is 181.42677283287048
         74.010	0.934	0.969	0.951
         time for one iteration is 185.3597228527069
         68.926	0.933	0.969	0.950
         time for one iteration is 188.23954820632935
         64.172	0.935	0.966	0.950
         time for one iteration is 177.8659210205078
         60.054	0.935	0.968	0.951
         time for one iteration is 180.7558081150055
         63.498	0.935	0.968	0.951
         time for one iteration is 178.3223419189453
         57.024	0.936	0.968	0.951
         time for one iteration is 174.05425381660461
         CPU times: user 48min 14s, sys: 1min 54s, total: 50min 9s
         Wall time: 29min 35s                  
The trained model doesn't perform well as well. The training takes 29 mins for just 10000 samples.         
3. Google NLP api:               
Google's API performs with the best accuracy. However, it costs.            

**Conclusion: Textblob is not accurate, SpaCy Text Categorizer is slow in terms of training, Google API is accurate but expensive. We will proceed with Google's API and Textblob since we don't have time for training the spacy model**
 


### 3. Comparison between textblob and Google's NLP API

We performed both textblob and Google's API on the text. And we plot the distribution of the sentiments:    
<img src="textblb_vs_Google.png">

We can see that textblob is more concentrated on 0-0.25 range. Google API is more spread which reflects the real situation.           



**Conclusion: We finally decide to choose Google's api because we have $300 credit for each of us. Since we will use this information for furthur training. We need the most accurate model.**
 

### 4. Sentiment analysis

* **Question: What's the relationship between review star and sentiments?**
<img src= "review_starsvsse.png">
It's not clear how they are related. Ideally, we expect high sentiment with how review score. However, this is not the case here.

* **Questions: why there are some reviews having high stars but with low sentiments and vice versa?**         
Check the review that has difference of stars and sentiments bigger than 1.5  

```python
df[abs(a - df['sentiments']) > 1.5].reset_index().loc[1,'text']
```
>the vacuums suck, that's for sure\nsnagged a twenty, inside door\npocket, swept it, straight away\ncest le vie, most would say\nbut I asked, "could it be found?"\n"sure, we'll shut, the vacuums down"\nowners, no less, had to go\nto a place, only, Mike Rowe\nof Dirty Jobs, would lurk around\nthe filth, from every car in town\ncollecting, in some giant vat\nwhile they looked, there I sat\nadmiring, this great idea\nand how, ideas, are crystal clear\nto some, but not to others\nbuild a better mouse trap, brother\nwhich they have, at Clean Freak\nget your car washed, on the cheap\npseudo self-serve, shoot the tube\nfelt, just like, some surfer dude\nin a blue wave, closing out\nsoap and suds, sloshed about\nspot free rinse, turtle wax\nspit out, spotless, front to back\nya, I bought, the monthly pass\nbefore, my car, it looked liked ass\nfilthy, crusty, dusty, dirty \nah shucks, now, she sure looks purdy\n'course Clean Freak's got, karma's attraction\nfound, returned, my andrew jackson'

```python
df[abs(a - df['sentiments']) > 1.5].reset_index().loc[1,['review_stars','sentiments']]
```
>rview_stars      5       
sentiments     -0.7         
Name: 1, dtype: object           


As you can see, some reviews are clearly negative, yet have a 5 star rating and vice versa. These negativities are captured by sentiments and should be considered "dishonest review" which can be detected by the code above.

## Machine Learning
(This corresponds to Machine learning.ipynb)

### Goal
We want to predict the star that user is going to give the restaurant based on the review's sentiments, checkin counts and restaurant's information

### 1. Data preprocessing
* Drop reviews that are not honest(difference of sentiments and star is bigger than 1.5)
* turn nun-numerical data into string for encoding later
* drop NA values in sentiments
* encode the categorical variable to foat arrays using onehotencoder
* split traning and testing set

### 2. Neural network
#### Architecture
The neural network we choose has 4 layers total:      
1st layer has 985 nodes with the input dimension of 984 (one node for bias)           
2nd and 3rd layers are hidden layers with 3 nodes (converge fast)           
4th layer is output layer with 5 nodes       
#### Training
We train the model with training dataset. The optimizer is sgd(stochastic gradient descent) and loss is defined as categorical_crossentropy.
```python
sgd = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
model.compile(optimizer=sgd,
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.fit(x_train,y_train,epochs=100)
```
#### Result
The model ends up with a loss of 0.8438 and a testing accuracy of 0.6278. Due to limited computation power, it hasn't converged yet. Here is the traning curve where x axis is epochs and y axis is accuracy
<img src= 'nn.png'>


### 3. Logistic regression
The logistic model use l1 penalty and it has a 0.613910530283991 accuracy
```python
logi = LogisticRegression(penalty='l1').fit(x_train,y_train_logis)
```
### 4. SVM with cross validation
We use 10 folds cross validation with SVM, then draw the ROC curve to mensure the performance.
Due to computation limitation, we can only perform this algorithm on 5000 samples. 
Since we have multiclasses, we decide to compute `micro-average` ROC curve and AUC. We also compared the performance with and without feature selection.      

* SVM CV with feature selection (logistic regression with l1 penalty selected features)
```python
svm_with_cv(X, Y, 'review_stars', True)
```
<img src='roc1.png'>
<img src='pr1.png'>

* SVM CV without feature selection (logistic regression with l1 penalty selected features)
```python
svm_with_cv(X, Y, 'review_stars', False)
```
<img src='roc2.png'>
<img src='pr2.png'>

**Conclusion: We can see that the ROC with feature selection has AUC of 0.81 and AURC(area under PR curve) of 0.56. And the ROC without feature selection has AUC of 0.81 abd AURC of 0.54. We can safely conclude that feature selection works very well.**
### 5. SVD dimension reduction
We also try to reduce the dimension of input. We use SVD method with components=20.   
```python
svd = TruncatedSVD(n_components=20, random_state=42)
reduced_X = svd.fit_transform(X)
svm_with_cv(reduced_X, Y, 'review_stars', False)
```
The ROC curves are follows:   
<img src='rocd.png'>
<img src='prd.png'>

**Conclusion: SVD dimension reduction doesn't maintain or improve model accuracy.**
### 6. Conclusion
We tried different models and each have their pros and cons, Nueral Net is more accurate and adjustable. We can do grid search to tune the hyperparameter in the future. Logistic regression is fast yet less accurate. SVM is accurate but takes a long time to train. We can also explore more dimension reduction methods in the future. For now, we decide to use trained model from neural network to do the application.

## Application

Given an user's review and business's information, we can predict the star that the user is going to give. However, some users will give inaccurate stars which don't suit their reviews. Then our algorithms can alert them so that it doesn't affect the overall review of the business.       
Here is an example:
<img src='os1.png'>
<img src='os2.png'>
We can see that before our prediction, there are more 2 and 3 stars reviews. After prediction, we have less 2 and 3 stars reviews. We suspect that users always tend to give lower scores if they are not satisfied even if the actual quality is not as bad as they think especillay for 2 or 3 stars range.