# 연습 문제: Yelp 리뷰 별점 예측

## 개요

데이터 출처: [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013).

**데이터 설명:**

- 데이터셋은 **`./data/yelp.csv`** 에 저장되어 있습니다.
- 각 행은 사용자가 특정 음식점에 대해 입력한 리뷰입니다.
    - **stars** 열은 사용자가 남긴 별점을 나타냅니다.
    - **text** 열은 리뷰 텍스트를 나타냅니다.

**목표:** 리뷰 텍스트를 자질로 활용하여 별점을 예측하는 것.

**팁:** 각 단계를 수행할 때마다 결과값의 `shape`을 확인해보시는 걸 권장합니다. 결과가 의도한 대로 나왔는지를 체크하기 위함입니다.

## Task 1

**`yelp.csv`** 를 pandas DataFrame으로 로드한 후 특성 파악해보기 (3가지 이상)

In [1]:
# read yelp.csv using a relative path
import pandas as pd
path = 'data/yelp.csv'
yelp = pd.read_csv(path)

In [2]:
# examine the shape
yelp.shape

(10000, 10)

In [3]:
# examine the first row
yelp.head(1)

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0


In [13]:
# examine the class distribution
yelp.stars.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

## Task 2

**5-star** 와 **1-star** 인 리뷰'만'을 포함하는 DataFrame 새로 생성하기 (필터링). 태스크를 binary classification으로 만들기 위함.

- **힌트:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29).

In [5]:
# filter the DataFrame using an OR condition
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# equivalently, use the 'loc' method
yelp_best_worst = yelp.loc[(yelp.stars==5) | (yelp.stars==1), :]

In [6]:
# examine the shape
yelp_best_worst.shape

(4086, 10)

## Task 3

Task 2에서 생성한 DataFrame으로부터 X와 y를 생성한 후, 각각을 train/test 셋으로 분리해주세요.
- X는 text 열만을, y는 stars 열만을 포함해야 함.
- **힌트:** X는 DataFrame이 아니라 Series 타입이어야 함. 그래야 `CountVectorizer`의 입력값으로 활용할 수 있음.

In [7]:
# define X and y
X = yelp_best_worst.text
y = yelp_best_worst.stars

In [8]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [9]:
# examine the object shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(3064,)
(1022,)
(3064,)
(1022,)


## Task 4

`CountVectorizer`를 활용하여 X_train, X_test로부터 **문서-단어 행렬**을 생성해주세요.

In [11]:
# import and instantiate CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [12]:
# fit and transform X_train into X_train_dtm
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(3064, 16825)

In [13]:
# transform X_test into X_test_dtm
X_test_dtm = vect.transform(X_test)
X_test_dtm.shape

(1022, 16825)

## Task 5

Multinomial Naive Bayes 모델을 사용하여 다음 작업을 수행하시오.
1. 별점 예측하는 모델 훈련 및 예측 수행
1. 모델의 정확도 (accuracy) 계산
1. 모델의 오차 행렬 (confusion matrix) 출력


**힌트:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb)에서 '분류 정확도' 및 '오차 행렬'을 어떻게 해석해야 하는지 설명.

In [14]:
# import and instantiate MultinomialNB
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [15]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [18]:
y_pred_class

array([5, 5, 5, ..., 5, 1, 5], dtype=int64)

In [19]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9187866927592955

In [22]:
nb.classes_

array([1, 5], dtype=int64)

In [23]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[126,  58],
       [ 25, 813]], dtype=int64)

## Task 6

단순히 가장 많이 등장하는 클래스(별점)으로 예측했을 시 달성 가능한 분류 정확도(i.e. null accuracy)를 계산하세요.

- **힌트:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb)에서 null accuracy를 계산하는 두가지 방법을 소개하고 있음. 다만 이번 태스크에는 둘 중 하나만 적용 가능.

In [24]:
# examine the class distribution of the testing set
y_test.value_counts()

5    838
1    184
Name: stars, dtype: int64

In [25]:
# calculate null accuracy
y_test.value_counts().head(1) / y_test.shape

5    0.819961
Name: stars, dtype: float64

In [29]:
# calculate null accuracy manually
838 / float(838 + 184)

0.8199608610567515

## Task 7

Naive Bayes 모델의 결과에서 두 가지 오류 케이스인 **false positive** 와 **false negative** 샘플들을 살펴보고, 왜 모델이 분류에 실패했을지에 대한 해석을 달아주세요.
* false positive: 

In [26]:
# first 10 false positives (1-star reviews incorrectly classified as 5-star reviews)
X_test[y_test < y_pred_class].head(10)

2175    This has to be the worst restaurant in terms o...
1781    If you like the stuck up Scottsdale vibe this ...
2674    I'm sorry to be what seems to be the lone one ...
9984    Went last night to Whore Foods to get basics t...
3392    I found Lisa G's while driving through phoenix...
8283    Don't know where I should start. Grand opening...
2765    Went last week, and ordered a dozen variety. I...
2839    Never Again,\nI brought my Mountain Bike in (w...
321     My wife and I live around the corner, hadn't e...
1919                                         D-scust-ing.
Name: text, dtype: object

In [27]:
# false positive: model is reacting to the words "good", "impressive", "nice"
X_test[1781]

"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating."

In [28]:
# false positive: model does not have enough data to work with
X_test[1919]

'D-scust-ing.'

In [29]:
# first 10 false negatives (5-star reviews incorrectly classified as 1-star reviews)
X_test[y_test > y_pred_class].head(10)

7148    I now consider myself an Arizonian. If you dri...
4963    This is by far my favourite department store, ...
6318    Since I have ranted recently on poor customer ...
380     This is a must try for any Mani Pedi fan. I us...
5565    I`ve had work done by this shop a few times th...
3448    I was there last week with my sisters and whil...
6050    I went to sears today to check on a layaway th...
2504    I've passed by prestige nails in walmart 100s ...
2475    This place is so great! I am a nanny and had t...
241     I was sad to come back to lai lai's and they n...
Name: text, dtype: object

In [30]:
# false negative: model is reacting to the words "complain", "crowds", "rushing", "pricey", "scum"
X_test[4963]

'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\'m in. The shoe SA\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \n\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\'s is not only the prompt attention of SA\'s, but the fact that they aren\'t rushing around trying to help 35 people at once. The SA\'s at Barney\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I hav

## Task 8

리뷰가 **5-star 리뷰**와 **1-star 리뷰**일 확률을 가장 많이 높이는 top 10 단어를 각각 출력하세요.
- 5-star: [...]
- 1-star: [...]


**힌트:** NB 모델의 `feature_count_`와 `class_count_` 속성을 활용.

In [31]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

16825

In [32]:
# first row is one-star reviews, second row is five-star reviews
nb.feature_count_.shape

(2, 16825)

In [33]:
# store the number of times each token appears across each class
one_star_token_count = nb.feature_count_[0, :]
five_star_token_count = nb.feature_count_[1, :]

In [34]:
# create a DataFrame of tokens with their separate one-star and five-star counts
tokens = pd.DataFrame({'token':X_train_tokens, 'one_star':one_star_token_count, 'five_star':five_star_token_count}).set_index('token')

In [35]:
# add 1 to one-star and five-star counts to avoid dividing by 0
tokens['one_star'] = tokens.one_star + 1
tokens['five_star'] = tokens.five_star + 1

In [36]:
# first number is one-star reviews, second number is five-star reviews
nb.class_count_

array([ 565., 2499.])

In [37]:
# convert the one-star and five-star counts into frequencies
tokens['one_star'] = tokens.one_star / nb.class_count_[0]
tokens['five_star'] = tokens.five_star / nb.class_count_[1]

In [38]:
# calculate the ratio of five-star to one-star for each token
tokens['five_star_ratio'] = tokens.five_star / tokens.one_star

In [39]:
# sort the DataFrame by five_star_ratio (descending order), and examine the first 10 rows
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('five_star_ratio', ascending=False).head(10)

Unnamed: 0_level_0,one_star,five_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
fantastic,0.00354,0.077231,21.817727
perfect,0.00531,0.098039,18.464052
yum,0.00177,0.02481,14.017607
favorite,0.012389,0.138055,11.143029
outstanding,0.00177,0.019608,11.078431
brunch,0.00177,0.016807,9.495798
gem,0.00177,0.016006,9.043617
mozzarella,0.00177,0.015606,8.817527
pasty,0.00177,0.015606,8.817527
amazing,0.021239,0.185274,8.723323


In [40]:
# sort the DataFrame by five_star_ratio (ascending order), and examine the first 10 rows
tokens.sort_values('five_star_ratio', ascending=True).head(10)

Unnamed: 0_level_0,one_star,five_star,five_star_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
staffperson,0.030088,0.0004,0.013299
refused,0.024779,0.0004,0.016149
disgusting,0.042478,0.0008,0.018841
filthy,0.019469,0.0004,0.020554
unprofessional,0.015929,0.0004,0.025121
unacceptable,0.015929,0.0004,0.025121
acknowledge,0.015929,0.0004,0.025121
ugh,0.030088,0.0008,0.026599
fuse,0.014159,0.0004,0.028261
boca,0.014159,0.0004,0.028261


## Task 9 (어려움)

- 지금까지는 5-star 혹은 1-star인 두 경우만을 분류하는 **binary classification** 문제를 풀었습니다.
- 위 과정을 적절히 변경하여 모든 별점을 분류할 수 있는 NB 모델을 훈련 및 평가해주세요.
    - **5-class classification problem**

필요한 절차:

- Task 1에서 생성한 원본 DataFrame으로 X, y 생성 (y는 5개의 클래스를 가짐)
- X, y를 train/test 셋으로 분리
- CountVectorizer로 문서-단어 행렬 생성.
- Multinomial Naive Bayes 모델의 test 셋 성능 평가.
- 모델의 Test 셋 성능을 null accuracy와 비교해보고 결과에 대한 해석 달기
- 모델의 오차 행렬을 출력해보고 해석 달기 (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) 에서 multi-class confusion matrix를 어떻게 해석해야 하는지 설명하고 있음.)
- 모델의 [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report)를 출력하고 해석 달기.

In [41]:
# define X and y using the original DataFrame
X = yelp.text
y = yelp.stars

In [42]:
# check that y contains 5 different classes
y.value_counts().sort_index()

1     749
2     927
3    1461
4    3526
5    3337
Name: stars, dtype: int64

In [43]:
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [44]:
# create document-term matrices using CountVectorizer
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [45]:
y_train

651     5
6560    4
8974    5
2348    5
5670    4
       ..
2895    5
7813    5
905     5
5192    4
235     2
Name: stars, Length: 7500, dtype: int64

In [46]:
# fit a Multinomial Naive Bayes model
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [47]:
# make class predictions
y_pred_class = nb.predict(X_test_dtm)

In [48]:
# calculate the accuary
metrics.accuracy_score(y_test, y_pred_class)

0.4712

In [49]:
# calculate the null accuracy
y_test.value_counts().head(1) / y_test.shape

4    0.3536
Name: stars, dtype: float64

**Accuracy comments:** At first glance, 47% accuracy does not seem very good, given that it is not much higher than the null accuracy. However, I would consider the 47% accuracy to be quite impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews.

In [50]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[ 55,  14,  24,  65,  27],
       [ 28,  16,  41, 122,  27],
       [  5,   7,  35, 281,  37],
       [  7,   0,  16, 629, 232],
       [  6,   4,   6, 373, 443]], dtype=int64)

**Confusion matrix comments:**

- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.
- 1-star, 2-star, and 3-star reviews are most commonly classified as 4 stars, probably because it's the predominant class in the training data.

![k-NN](./assets/confusion-matrix-5-class.png)

In [51]:
# print the classification report
print(metrics.classification_report(y_test, y_pred_class))

              precision    recall  f1-score   support

           1       0.54      0.30      0.38       185
           2       0.39      0.07      0.12       234
           3       0.29      0.10      0.14       365
           4       0.43      0.71      0.53       884
           5       0.58      0.53      0.55       832

    accuracy                           0.47      2500
   macro avg       0.45      0.34      0.35      2500
weighted avg       0.46      0.47      0.43      2500



**Precision** answers the question: "When a given class is predicted, how often are those predictions correct?" To calculate the precision for class 1, for example, you divide 55 by the sum of the first column of the confusion matrix.

In [55]:
# manually calculate the precision for class 1
precision = 55 / float(55 + 28 + 5 + 7 + 6)
print(precision)

0.5445544554455446


**Recall** answers the question: "When a given class is the true class, how often is that class predicted?" To calculate the recall for class 1, for example, you divide 55 by the sum of the first row of the confusion matrix.

In [56]:
# manually calculate the recall for class 1
recall = 55 / float(55 + 14 + 24 + 65 + 27)
print(recall)

0.2972972972972973


**F1 score** is a weighted average of precision and recall.

In [57]:
# manually calculate the F1 score for class 1
f1 = 2 * (precision * recall) / (precision + recall)
print(f1)

0.38461538461538464


**Support** answers the question: "How many observations exist for which a given class is the true class?" To calculate the support for class 1, for example, you sum the first row of the confusion matrix.

In [58]:
# manually calculate the support for class 1
support = 55 + 14 + 24 + 65 + 27
print(support)

185


**Classification report comments:**

- Class 1 has low recall, meaning that the model has a hard time detecting the 1-star reviews, but high precision, meaning that when the model predicts a review is 1-star, it's usually correct.
- Class 5 has high recall and precision, probably because 5-star reviews have polarized language, and because the model has a lot of observations to learn from.