## Classifying and Natural Language Processing with Yelp Reviews Data
#### W207 Section 3, Group - <span style="color:orange"><strong>C</strong></span>olors
#### Summer, 2018
#### Team members:
- Chandra Sekar, chandra-sekar@ischool.berkeley.edu
- Guangyu (Gary) Pei, guangyu.pei@ischool.berkeley.edu
- Jooyeon (Irene) Seo, jooyeon@ischool.berkeley.edu
- Sijie (Anne) Yu, syu.anne@berkeley.edu

In [105]:
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# Config Jupyter session
%config IPCompleter.greedy=True

# Set the randomizer seed so results are the same each time.
np.random.seed(0)

# Global configurations
np.set_printoptions(precision=4, suppress=True)

# Config system logs
import logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s %(levelname)s %(message)s')

#### Goals
Our project’s primary concept is to utilize Yelp data (from kaggle) to rate new business. That is, we are going to get Yelp user review data, use review texts to predict review is **positive** or **negative**. When people talk about a new business, we can capture their words, fit into the model, then predict its rating, sort of understand its quality and potential.

#### The Yelp Review Dataset
We write a shell [script](https://github.com/annesjyu/m207_summer_2018) to select $10,000$ reviews for training, testing and dev respectively, each set consisting in 50% negative and 50% positive reviews. We keep only review text and stars columns, then binarize stars into target label: 
- if starts >= $3.0$, review is *positive*
- otherwise, it's *negative*.

The following code will load the dataset and split it into $3$ sets:

In [103]:
with np.warnings.catch_warnings():
    # There are some bad data, we just dont want to see a lot of warning messages
    np.warnings.filterwarnings('ignore', r'Some errors were detected')
    data = np.genfromtxt('data.csv',dtype='str', delimiter='|', skip_header=1, 
                         usecols = (0,1), invalid_raise=False, loose=True)

    print ("Full data dim: ", data.shape)
    
    # Shuffle the data, each dataset will have roughly the same number of examples for each label.
    shuffle = np.random.permutation(np.arange(data.shape[0]))
    X, Y = data[shuffle, 0], data[shuffle, 1]    
    
    train_data, train_labels = X[0:12000], Y[:12000].astype(np.int)
    test_data, test_labels = X[12000:22000], Y[12000:22000].astype(np.int)
    dev_data, dev_labels = X[22000:-1], Y[22000:-1].astype(np.int)

    NUM_OF_TRAINING_DATA = len(train_data)
    NUM_OF_TESTING_DATA = len(test_data)
    NUM_OF_DEV_DATA = len(dev_data)

    print ('train data dim: ', NUM_OF_TRAINING_DATA)
    print ('test data size: ', NUM_OF_TESTING_DATA)
    print ('dev data size: ', NUM_OF_DEV_DATA)
    print ('_'*80)
    print ('Training examples:\n', train_data[0:2])
    print (train_labels[0:2])
    print ('_'*80)
    print ('Testing examples:\n', test_data[0:2])
    print (test_labels[0:2])
    print ('_'*80)
    print ('Dev examples:\n', dev_data[0:2])
    print (dev_labels[0:2])

Full data dim:  (29701, 2)
train data dim:  12000
test data size:  10000
dev data size:  7700
________________________________________________________________________________
Training examples:
 ["I was visiting Phoenix and Yelped a few places to try. My friend and I decided to go to Cornish Pasty in Tempe. It was dark and open seating  definitely had a pub vibe inside. We decided to sit outside at the patio to enjoy the Fall weather.   I got the Oggie and my friend got Shepard's pie with a side Oven Chips - added the garlic and jalepenos. I love anything pie crust and these pasties definitely hit the spot.  The minced lamb from the Shepard's pie was  cooked perfectly and not too gamey. The Oggie was a classic and a comfort food. Oven chips were like french fries cooked with minced garlic and jalepenos that weren't spicy at all. I found out there's a location in Vegas and will definitely have to visit there to try the other pasties. "
 "Went inside and ordered a DOUBLE JACK COMBO and s

Analyze train, dev and test datasets to find out data distributions. Ideally we want to have 50% examples for either label.

In [104]:
print ('positive train data: ', len(np.where(train_labels==1)[0]), 
       ', negative train data: ', len(np.where(train_labels==0)[0]))
print ('positive test data: ', len(np.where(test_labels==1)[0]), 
       ', negative test data: ', len(np.where(test_labels==0)[0]))
print ('positive dev data: ', len(np.where(dev_labels==1)[0]), 
       ', negative dev data: ', len(np.where(dev_labels==0)[0]))

positive train data:  6011 , negative train data:  5989
positive test data:  4990 , negative test data:  5010
positive dev data:  3862 , negative dev data:  3838


#### Natural Language Processing

We will create doc-term matrix from data, so can fit all classifiers. There are a couple of steps of doing it.
...
...

##### Create a baseline using default CountVectorizer and NB

In [114]:
v = CountVectorizer(strip_accents='ascii', stop_words='english', min_df=0.001)
train_dtm = v.fit_transform(train_data)
print (train_dtm.shape)

bnb = BernoulliNB(alpha=0.01)
bnb.fit(train_dtm, train_labels)
predicted = bnb.predict(v.transform(dev_data))

print (classification_report(predicted, dev_labels))

terms = v.get_feature_names()
top100 = np.argsort(bnb.coef_[0])[-100:]
print ('top 100 terms:\n',[terms[int(w)] for w in top100])
print ('_'*80)
bottom100 = np.argsort(bnb.coef_[0])[:100]
print ('lest important 100 terms:\n',[terms[int(w)] for w in bottom100])
print ('_'*80)

(12000, 5303)
             precision    recall  f1-score   support

          0       0.76      0.87      0.81      3372
          1       0.89      0.79      0.84      4328

avg / total       0.83      0.83      0.83      7700

top 100 terms:
 ['kind', 'stars', 'different', 'favorite', 'probably', 'took', 'perfect', 'dinner', 'table', 'worth', 'hot', 'sweet', 'vegas', 'inside', 'home', 'bad', 'need', 'meal', 'drinks', 'salad', 'excellent', 'work', 'feel', 'quite', 'awesome', 'tried', 'happy', 'overall', 'cheese', 'long', 'location', 'thing', 'clean', 'looking', 'price', 'prices', 'wasn', 'big', 'lunch', 'sauce', 'night', 'super', 'eat', 'times', 'new', 'bar', 'lot', 'say', 'want', 'bit', 'fresh', 'day', 'll', 'wait', 'going', 'sure', 'small', 'experience', 'know', 'area', 'way', 'think', 'right', 'better', 'chicken', 'went', 'recommend', 'amazing', 'order', 'restaurant', 'didn', 'came', 'people', 'did', 'menu', 'make', 'ordered', 'pretty', 'delicious', 'come', 'got', 'staff', 'try', '