### <center>Classifying and Natural Language Processing with Yelp Reviews</center>
#### W207 Section 3, Group - <span style="color:orange"><strong>C</strong></span><span style="color:purple">olors</color>
#### Summer, 2018
#### Team members:
- Chandra Sekar, chandra-sekar@ischool.berkeley.edu
- Guangyu (Gary) Pei, guangyu.pei@ischool.berkeley.edu
- Jooyeon (Irene) Seo, jooyeon@ischool.berkeley.edu
- Sijie (Anne) Yu, syu.anne@berkeley.edu


In [87]:
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# Config Jupyter session
%config IPCompleter.greedy=True

# Set the randomizer seed so results are the same each time.
np.random.seed(0)

# Global configurations
np.set_printoptions(precision=4, suppress=True)

# Config system logs
import logging
logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s %(levelname)s %(message)s')



#### Goals
Our project’s primary concept is to utilize Yelp data (from kaggle) to rate new business. That is, we are going to get Yelp user review data, use review texts to predict review is **positive** or **negative**. When people talk about a new business, we can capture their words, fit into the model, then predict its rating, sort of understand its quality and potential.

#### The Yelp Review Dataset
We write a shell [script](https://github.com/annesjyu/m207_summer_2018) to select $10,000$ reviews for training, testing and dev respectively, each set consisting in 50% negative and 50% positive reviews. We keep only review text and stars columns, then binarize stars into target label: 
- if starts >= $3.0$, review is *positive*
- otherwise, it's *negative*.

The following code will load the dataset and split it into $3$ sets:

In [88]:
data = None
try:
    data = np.genfromtxt('data.csv',dtype='str', delimiter='|', skip_header=1, 
                         #converters = {1: lambda s: int(s)},
                         usecols = (0,1), invalid_raise=False)
except Exception as ex:
    print (ex)
finally:
    print ("Full data dim: ", data.shape)

    # Shuffle the data, each dataset will have roughly the same number of examples for each label.
    shuffle = np.random.permutation(np.arange(data.shape[0]))
    X, Y = data[shuffle, 0], data[shuffle, 1]    
    
    train_data, train_labels = X[0:12000], Y[:12000].astype(np.int)
    test_data, test_labels = X[12000:22000], Y[12000:22000].astype(np.int)
    dev_data, dev_labels = X[22000:-1], Y[22000:-1].astype(np.int)

    NUM_OF_TRAINING_DATA = len(train_data)
    NUM_OF_TESTING_DATA = len(test_data)
    NUM_OF_DEV_DATA = len(dev_data)

    print ('train data dim: ', NUM_OF_TRAINING_DATA)
    print ('test data size: ', NUM_OF_TESTING_DATA)
    print ('dev data size: ', NUM_OF_DEV_DATA)
    print ('_'*80)
    print ('Training examples:\n', train_data[0:2])
    print (train_labels[0:2])
    print ('_'*80)
    print ('Testing examples:\n', test_data[0:2])
    print (test_labels[0:2])
    print ('_'*80)
    print ('Dev examples:\n', dev_data[0:2])
    print (dev_labels[0:2])

    Line #568 (got 1 columns instead of 2)
    Line #597 (got 1 columns instead of 2)
    Line #828 (got 1 columns instead of 2)
    Line #917 (got 1 columns instead of 2)
    Line #1212 (got 1 columns instead of 2)
    Line #1444 (got 1 columns instead of 2)
    Line #1541 (got 1 columns instead of 2)
    Line #1846 (got 1 columns instead of 2)
    Line #1995 (got 1 columns instead of 2)
    Line #2006 (got 1 columns instead of 2)
    Line #2027 (got 1 columns instead of 2)
    Line #2139 (got 1 columns instead of 2)
    Line #2415 (got 1 columns instead of 2)
    Line #2492 (got 1 columns instead of 2)
    Line #2552 (got 1 columns instead of 2)
    Line #2621 (got 1 columns instead of 2)
    Line #2622 (got 1 columns instead of 2)
    Line #2654 (got 1 columns instead of 2)
    Line #2749 (got 1 columns instead of 2)
    Line #2892 (got 1 columns instead of 2)
    Line #2933 (got 1 columns instead of 2)
    Line #2979 (got 1 columns instead of 2)
    Line #3003 (got 1 columns instea

Full data dim:  (29701, 2)
train data dim:  12000
test data size:  10000
dev data size:  7700
________________________________________________________________________________
Training examples:
 ['Orange Blossom Beer is the best  However atmosphere and Service (Female Bartender/Server) were really bad... Go to Chevron get a growler  go home and Enjoy!!! '
 "Tried this place one time shortly after they opened.   I wasn't really impressed with their food. The place did look neat and clean and the staff was very friendly.  Although I wasn't impressed with my one visit based on the customer service I'll probably try it again at some point. "]
[0 1]
________________________________________________________________________________
Testing examples:
 ['Not worth going to. I ordered a ramen and the chicken was very dry and over done  the soup was way too salty and tasted very artificial as well. Definitely way better ramen shops in the gta. The chicken bao was absolutely horrible because  1. It

Analyze train, dev and test datasets to find out data distributions. Ideally we want to have 50% examples for either label.

In [89]:
print ('positive train data: ', len(np.where(train_labels==1)[0]), 
       ', negative train data: ', len(np.where(train_labels==0)[0]))
print ('positive test data: ', len(np.where(test_labels==1)[0]), 
       ', negative test data: ', len(np.where(test_labels==0)[0]))
print ('positive dev data: ', len(np.where(dev_labels==1)[0]), 
       ', negative dev data: ', len(np.where(dev_labels==0)[0]))

positive train data:  6050 , negative train data:  5950
positive test data:  5000 , negative test data:  5000
positive dev data:  3812 , negative dev data:  3888


#### Natural Language Processing

We will create doc-term matrix from data, so can fit all classifiers. There are a couple of steps of doing it.
...
...

##### Create a baseline using default CountVectorizer and NB

In [93]:
v = CountVectorizer(strip_accents='ascii', stop_words='english', min_df=0.001)
train_dtm = v.fit_transform(train_data)

print (train_dtm.shape)
print ('terms:', v.get_feature_names())

pl = Pipeline([('vectorizer', v), 
                     ('classifier', BernoulliNB(alpha=0.01))])
pl.fit(train_data, train_labels)
predicted = pl.predict(dev_data)

print (classification_report(predicted, dev_labels))

(12000, 5402)


             precision    recall  f1-score   support

          0       0.76      0.87      0.81      3422
          1       0.88      0.78      0.83      4278

avg / total       0.83      0.82      0.82      7700

