# Tutorial Exercise: Yelp reviews

## Introduction

This exercise uses a small subset of the data from Kaggle's [Yelp Business Rating Prediction](https://www.kaggle.com/c/yelp-recsys-2013) competition.

**Description of the data:**

- **`yelp.csv`** contains the dataset. It is stored in the repository (in the **`data`** directory), so there is no need to download anything from the Kaggle website.
- Each observation (row) in this dataset is a review of a particular business by a particular user.
- The **stars** column is the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it is the rating of the business by the person who wrote the review.
- The **text** column is the text of the review.

**Goal:** Predict the star rating of a review using **only** the review text.

**Tip:** After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.

## Task 1

Read **`yelp.csv`** into a pandas DataFrame and examine it.

In [1]:
import pandas as pd
import numpy as np
import sklearn as sk

In [2]:
yelp_df=pd.read_csv("data/yelp.csv")

In [6]:
yelp_df.dtypes

business_id    object
date           object
review_id      object
stars           int64
text           object
type           object
user_id        object
cool            int64
useful          int64
funny           int64
dtype: object

In [7]:
yelp_df.columns

Index([u'business_id', u'date', u'review_id', u'stars', u'text', u'type',
       u'user_id', u'cool', u'useful', u'funny'],
      dtype='object')

In [8]:
yelp_df.shape

(10000, 10)

## Task 2

Create a new DataFrame that only contains the **5-star** and **1-star** reviews.

- **Hint:** [How do I apply multiple filter criteria to a pandas DataFrame?](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb#9.-How-do-I-apply-multiple-filter-criteria-to-a-pandas-DataFrame%3F-%28video%29) explains how to do this.

In [14]:
yelp_df_1_5=yelp_df[(yelp_df['stars']==1) | (yelp_df['stars']==5)]

In [15]:
yelp_df_1_5.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


## Task 3

Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the **review text** as the only feature and the **star rating** as the response.

- **Hint:** Keep in mind that X should be a pandas Series (not a DataFrame), since we will pass it to CountVectorizer in the task that follows.

In [18]:
X=yelp_df_1_5["text"]
y=yelp_df_1_5["stars"]
print X.shape,y.shape

((4086,), (4086,))

In [24]:
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.3)
print X_train.shape,X_test.shape,y_train.shape,y_test.shape

(2860,) (1226,) (2860,) (1226,)


## Task 4

Use CountVectorizer to create **document-term matrices** from X_train and X_test.

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
Count=CountVectorizer()
X_train_dtm=Count.fit_transform(X_train)
X_test_dtm=Count.transform(X_test)

## Task 5

Use multinomial Naive Bayes to **predict the star rating** for the reviews in the testing set, and then **calculate the accuracy** and **print the confusion matrix**.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains how to interpret both classification accuracy and the confusion matrix.

In [36]:
from sklearn.naive_bayes import MultinomialNB

In [38]:
nb=MultinomialNB()
nb.fit(X_train_dtm,y_train)
y_pred=nb.predict(X_test_dtm)

In [42]:
from sklearn.metrics import confusion_matrix,accuracy_score
print "Accuracy=",accuracy_score(y_test,y_pred)
print confusion_matrix(y_test,y_pred)

Accuracy= 0.911092985318
[[149  84]
 [ 25 968]]


In [56]:
print y_test.value_counts().index[0]

5


## Task 6 (Challenge)

Calculate the **null accuracy**, which is the classification accuracy that could be achieved by always predicting the most frequent class.

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains null accuracy and demonstrates two ways to calculate it, though only one of those ways will work in this case. Alternatively, you can come up with your own method to calculate null accuracy!

In [134]:
class_counts=y_train.value_counts()
most_freq_class=class_counts.idxmax()


In [137]:
y_dummy=pd.Series(np.array([most_freq_class]*y_test.shape[0]))




In [138]:
print "Accuracy=",accuracy_score(y_test,y_dummy)
print confusion_matrix(y_test,y_dummy)

Accuracy= 0.809951060359
[[  0 233]
 [  0 993]]


## Task 7 (Challenge)

Browse through the review text of some of the **false positives** and **false negatives**. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?

- **Hint:** [Evaluating a classification model](https://github.com/justmarkham/scikit-learn-videos/blob/master/09_classification_metrics.ipynb) explains the definitions of "false positives" and "false negatives".
- **Hint:** Think about what a false positive means in this context, and what a false negative means in this context. What has scikit-learn defined as the "positive class"?

In [64]:
X_test_fp=X_test[y_pred>y_test]
X_test_fn=X_test[y_pred<y_test]

In [75]:
for i in range(3):
    print X_test_fn[X_test_fn.index[i]]

I've passed by prestige nails in walmart 100s of times but never really thought of having a pedicure there (even though they are always busy!) As I stared at my feet, long overdue for a pedicure, I thought it was about time to try them...since walmart rarely let's me down why should the nail salon inside?

To my surprise I got a wonderful pedicure or $23 not too bad this day in age...my to mention it was just as good as going to the more upscale salon just across the street! 

I'm glad to be the first to review them they deserve it! Now if only they did facials at walmart and hair I'd be set!
We happened upon this location when meeting a friend for dinner.  When I showed up there was a line of about 20 people ahead of me.  i asked them to put me in first available and they said 15-20min and then 45min for a booth.

Ok, I'll wait.

Our friends showed up and my girlfriend had asked if there was a spot at the bar we could sit in the mean time.  We were there for about 10min already.  That

## Task 8 (Challenge)

Calculate which 10 tokens are the most predictive of **5-star reviews**, and which 10 tokens are the most predictive of **1-star reviews**.

- **Hint:** Naive Bayes automatically counts the number of times each token appears in each class, as well as the number of observations in each class. You can access these counts via the `feature_count_` and `class_count_` attributes of the Naive Bayes model object.

In [107]:
Starryness_df=pd.DataFrame({"Word":Count.get_feature_names(),"1 star appearances":nb.feature_count_[0],"5 star appearances":nb.feature_count_[1]})

In [108]:
Starryness_df=Starryness_df.set_index('Word')


In [109]:
Starryness_df

Unnamed: 0_level_0,1 star appearances,5 star appearances
Word,Unnamed: 1_level_1,Unnamed: 2_level_1
00,27.0,36.0
000,4.0,5.0
00a,1.0,0.0
00am,3.0,1.0
00pm,1.0,6.0
01,0.0,1.0
02,1.0,1.0
03,1.0,0.0
03342,1.0,0.0
04,0.0,1.0


In [112]:
Starryness_df["1 starryness"]=(nb.feature_count_[0]+1)/nb.class_count_[0]
Starryness_df["5 starryness"]=(nb.feature_count_[1]+1)/nb.class_count_[1]

In [120]:
Starryness_df["Differentiability"]=Starryness_df["5 starryness"]/Starryness_df["1 starryness"]

In [127]:
Starryness_df.sort_values("Differentiability",ascending=False)

Unnamed: 0_level_0,1 star appearances,5 star appearances,1 starryness,5 starryness,Differentiability
Word,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
fantastic,0.0,183.0,0.001938,0.078498,40.505119
perfect,2.0,232.0,0.005814,0.099403,17.097270
fabulous,0.0,74.0,0.001938,0.031997,16.510239
yum,0.0,60.0,0.001938,0.026024,13.428328
favorite,4.0,289.0,0.009690,0.123720,12.767918
awesome,5.0,278.0,0.011628,0.119027,10.236348
fruit,0.0,43.0,0.001938,0.018771,9.686007
pasty,0.0,40.0,0.001938,0.017491,9.025597
bianco,0.0,39.0,0.001938,0.017065,8.805461
gem,0.0,38.0,0.001938,0.016638,8.585324


## Task 9 (Challenge)

Up to this point, we have framed this as a **binary classification problem** by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a **5-class classification problem**.

Here are the steps:

- Define X and y using the original DataFrame. (y should contain 5 different classes.)
- Split X and y into training and testing sets.
- Create document-term matrices using CountVectorizer.
- Calculate the testing accuracy of a Multinomial Naive Bayes model.
- Compare the testing accuracy with the null accuracy, and comment on the results.
- Print the confusion matrix, and comment on the results. (This [Stack Overflow answer](http://stackoverflow.com/a/30748053/1636598) explains how to read a multi-class confusion matrix.)
- Print the [classification report](http://scikit-learn.org/stable/modules/model_evaluation.html#classification-report), and comment on the results. If you are unfamiliar with the terminology it uses, research the terms, and then try to figure out how to calculate these metrics manually from the confusion matrix!

In [129]:
Xf=yelp_df["text"]
yf=yelp_df["stars"]
Xf_train,Xf_test,yf_train,yf_test=train_test_split(Xf,yf,random_state=42,test_size=0.3)
Countf=CountVectorizer()
Xf_train_dtm=Countf.fit_transform(Xf_train)
Xf_test_dtm=Countf.transform(Xf_test)
nbf=MultinomialNB()
nbf.fit(Xf_train_dtm,yf_train)
yf_pred=nbf.predict(Xf_test_dtm)

In [131]:
print "Accuracy=",accuracy_score(yf_test,yf_pred)
print confusion_matrix(yf_test,yf_pred,labels=[1,2,3,4,5])

Accuracy= 0.494
[[ 59  23  17  84  35]
 [ 18  24  42 152  29]
 [  3   7  49 333  50]
 [  7   4  11 805 260]
 [  3   0   5 435 545]]


In [133]:
from sklearn.metrics import classification_report
print classification_report(yf_test,yf_pred)

             precision    recall  f1-score   support

          1       0.66      0.27      0.38       218
          2       0.41      0.09      0.15       265
          3       0.40      0.11      0.17       442
          4       0.44      0.74      0.56      1087
          5       0.59      0.55      0.57       988

avg / total       0.50      0.49      0.46      3000



In [141]:
from sklearn.dummy import DummyClassifier
dumb=DummyClassifier()
dumb.fit(Xf_train_dtm,yf_train)
yf_dummy=dumb.predict(Xf_test_dtm)
print accuracy_score(yf_test,yf_dummy)
print confusion_matrix(yf_test,yf_dummy)

0.291
[[ 17  25  28  72  76]
 [ 14  26  37  91  97]
 [ 35  37  67 165 138]
 [ 86 109 140 389 363]
 [ 63  77 130 344 374]]


In [146]:
print np.sum([ 63,77,130,344,374])/3000.0

0.329333333333


# Trying out SVC and RF

In [148]:
from sklearn.svm import LinearSVC
svc=LinearSVC()
svc.fit(Xf_train_dtm,yf_train)
ysvc_pred=svc.predict(Xf_test_dtm)
print "Accuracy=",accuracy_score(yf_test,ysvc_pred)
print confusion_matrix(yf_test,ysvc_pred)
print classification_report(yf_test,ysvc_pred)

Accuracy= 0.459666666667
[[ 80  48  27  29  34]
 [ 44  75  65  48  33]
 [ 18  45 131 164  84]
 [ 18  48 129 504 388]
 [ 12  12  49 326 589]]
             precision    recall  f1-score   support

          1       0.47      0.37      0.41       218
          2       0.33      0.28      0.30       265
          3       0.33      0.30      0.31       442
          4       0.47      0.46      0.47      1087
          5       0.52      0.60      0.56       988

avg / total       0.45      0.46      0.46      3000



In [152]:
from sklearn.ensemble import RandomForestClassifier as Rf
rf=Rf()
rf.fit(Xf_train_dtm,yf_train)
yrf_pred=rf.predict(Xf_test_dtm)
print "Accuracy=",accuracy_score(yf_test,yrf_pred)
print confusion_matrix(yf_test,yrf_pred)
print classification_report(yf_test,yrf_pred)

Accuracy= 0.406
[[ 27  15  37  74  65]
 [ 26  18  40 128  53]
 [  8  16  75 249  94]
 [ 17  22  85 637 326]
 [  8  13  41 465 461]]
             precision    recall  f1-score   support

          1       0.31      0.12      0.18       218
          2       0.21      0.07      0.10       265
          3       0.27      0.17      0.21       442
          4       0.41      0.59      0.48      1087
          5       0.46      0.47      0.46       988

avg / total       0.38      0.41      0.38      3000



# Trying out GridSearchCV with LinearSVC

In [155]:
Xc_train,Xc_cv,yc_train,tc_cv=train_test_split(X_train,y_train,random_state=42)

In [170]:
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV

In [193]:
from sklearn.pipeline import Pipeline
pipe=Pipeline([('vect', CountVectorizer()),('svc',LinearSVC())])
params={'vect__ngram_range':[(1,1),(1,2),(2,2)],'svc__C':[0.01,0.1,1.0,10.0,100.0]}

In [195]:
best_classifier=GridSearchCV(pipe,params,verbose=4,cv=3)
best_classifier.fit(Xf_train,yf_train)

Fitting 3 folds for each of 15 candidates, totalling 45 fits
[CV] vect__ngram_range=(1, 1), svc__C=0.01 ...........................
[CV] .. vect__ngram_range=(1, 1), svc__C=0.01, score=0.500857 -   2.0s
[CV] vect__ngram_range=(1, 1), svc__C=0.01 ...........................
[CV] .. vect__ngram_range=(1, 1), svc__C=0.01, score=0.500857 -   2.0s
[CV] vect__ngram_range=(1, 1), svc__C=0.01 ...........................
[CV] .. vect__ngram_range=(1, 1), svc__C=0.01, score=0.491424 -   2.0s
[CV] vect__ngram_range=(1, 2), svc__C=0.01 ...........................
[CV] .. vect__ngram_range=(1, 2), svc__C=0.01, score=0.505141 -   4.8s
[CV] vect__ngram_range=(1, 2), svc__C=0.01 ...........................
[CV] .. vect__ngram_range=(1, 2), svc__C=0.01, score=0.520566 -   4.8s
[CV] vect__ngram_range=(1, 2), svc__C=0.01 ...........................
[CV] .. vect__ngram_range=(1, 2), svc__C=0.01, score=0.487136 -   4.8s
[CV] vect__ngram_range=(2, 2), svc__C=0.01 ...........................
[CV] .. vect__ng

[Parallel(n_jobs=1)]: Done  24 tasks       | elapsed:  1.5min


[CV] ... vect__ngram_range=(2, 2), svc__C=1.0, score=0.449871 -   3.4s
[CV] vect__ngram_range=(2, 2), svc__C=1.0 ............................
[CV] ... vect__ngram_range=(2, 2), svc__C=1.0, score=0.455013 -   3.5s
[CV] vect__ngram_range=(2, 2), svc__C=1.0 ............................
[CV] ... vect__ngram_range=(2, 2), svc__C=1.0, score=0.446827 -   3.5s
[CV] vect__ngram_range=(1, 1), svc__C=10.0 ...........................
[CV] .. vect__ngram_range=(1, 1), svc__C=10.0, score=0.424165 -   3.6s
[CV] vect__ngram_range=(1, 1), svc__C=10.0 ...........................
[CV] .. vect__ngram_range=(1, 1), svc__C=10.0, score=0.447301 -   3.7s
[CV] vect__ngram_range=(1, 1), svc__C=10.0 ...........................
[CV] .. vect__ngram_range=(1, 1), svc__C=10.0, score=0.438679 -   3.6s
[CV] vect__ngram_range=(1, 2), svc__C=10.0 ...........................
[CV] .. vect__ngram_range=(1, 2), svc__C=10.0, score=0.467866 -   6.6s
[CV] vect__ngram_range=(1, 2), svc__C=10.0 ...........................
[CV] .

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed:  3.8min finished


GridSearchCV(cv=3, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'vect__ngram_range': [(1, 1), (1, 2), (2, 2)], 'svc__C': [0.01, 0.1, 1.0, 10.0, 100.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=4)

In [196]:
best_classifier.grid_scores_

[mean: 0.49771, std: 0.00445, params: {'vect__ngram_range': (1, 1), 'svc__C': 0.01},
 mean: 0.50429, std: 0.01366, params: {'vect__ngram_range': (1, 2), 'svc__C': 0.01},
 mean: 0.46857, std: 0.01036, params: {'vect__ngram_range': (2, 2), 'svc__C': 0.01},
 mean: 0.47471, std: 0.00948, params: {'vect__ngram_range': (1, 1), 'svc__C': 0.1},
 mean: 0.49057, std: 0.01655, params: {'vect__ngram_range': (1, 2), 'svc__C': 0.1},
 mean: 0.46186, std: 0.01184, params: {'vect__ngram_range': (2, 2), 'svc__C': 0.1},
 mean: 0.44714, std: 0.00801, params: {'vect__ngram_range': (1, 1), 'svc__C': 1.0},
 mean: 0.48129, std: 0.01541, params: {'vect__ngram_range': (1, 2), 'svc__C': 1.0},
 mean: 0.45057, std: 0.00338, params: {'vect__ngram_range': (2, 2), 'svc__C': 1.0},
 mean: 0.43671, std: 0.00955, params: {'vect__ngram_range': (1, 1), 'svc__C': 10.0},
 mean: 0.47443, std: 0.01149, params: {'vect__ngram_range': (1, 2), 'svc__C': 10.0},
 mean: 0.44629, std: 0.00500, params: {'vect__ngram_range': (2, 2), 'sv

In [197]:
best_pred=best_classifier.predict(Xf_test)
print "Accuracy=",accuracy_score(yf_test,best_pred)
print confusion_matrix(yf_test,best_pred)
print classification_report(yf_test,best_pred)

Accuracy= 0.530666666667
[[ 99  34  21  30  34]
 [ 49  70  63  48  35]
 [ 13  24 127 193  85]
 [  6  19  79 587 396]
 [  3   3  18 255 709]]
             precision    recall  f1-score   support

          1       0.58      0.45      0.51       218
          2       0.47      0.26      0.34       265
          3       0.41      0.29      0.34       442
          4       0.53      0.54      0.53      1087
          5       0.56      0.72      0.63       988

avg / total       0.52      0.53      0.52      3000



In [201]:
pipe2=Pipeline([('vect', CountVectorizer(ngram_range=(1,2))),('rf',Rf())])
params={'rf__max_depth':[5,6,7,8,None],'rf__n_estimators':[5,10,15,20,25]}

In [207]:
best_classifier2=GridSearchCV(pipe2,params,verbose=4,cv=2)
best_classifier2.fit(Xf_train,yf_train)

Fitting 2 folds for each of 25 candidates, totalling 50 fits
[CV] rf__max_depth=5, rf__n_estimators=5 .............................
[CV] .... rf__max_depth=5, rf__n_estimators=5, score=0.376642 -   3.5s
[CV] rf__max_depth=5, rf__n_estimators=5 .............................
[CV] .... rf__max_depth=5, rf__n_estimators=5, score=0.382504 -   3.4s
[CV] rf__max_depth=5, rf__n_estimators=10 ............................
[CV] ... rf__max_depth=5, rf__n_estimators=10, score=0.389492 -   3.4s
[CV] rf__max_depth=5, rf__n_estimators=10 ............................
[CV] ... rf__max_depth=5, rf__n_estimators=10, score=0.385649 -   3.4s
[CV] rf__max_depth=5, rf__n_estimators=15 ............................
[CV] ... rf__max_depth=5, rf__n_estimators=15, score=0.383210 -   3.5s
[CV] rf__max_depth=5, rf__n_estimators=15 ............................
[CV] ... rf__max_depth=5, rf__n_estimators=15, score=0.401658 -   3.5s
[CV] rf__max_depth=5, rf__n_estimators=20 ............................
[CV] ... rf__max

[Parallel(n_jobs=1)]: Done  24 tasks       | elapsed:  1.5min


[CV] ... rf__max_depth=7, rf__n_estimators=15, score=0.381496 -   4.0s
[CV] rf__max_depth=7, rf__n_estimators=15 ............................
[CV] ... rf__max_depth=7, rf__n_estimators=15, score=0.384505 -   3.8s
[CV] rf__max_depth=7, rf__n_estimators=20 ............................
[CV] ... rf__max_depth=7, rf__n_estimators=20, score=0.391776 -   3.8s
[CV] rf__max_depth=7, rf__n_estimators=20 ............................
[CV] ... rf__max_depth=7, rf__n_estimators=20, score=0.407376 -   3.8s
[CV] rf__max_depth=7, rf__n_estimators=25 ............................
[CV] ... rf__max_depth=7, rf__n_estimators=25, score=0.405197 -   4.1s
[CV] rf__max_depth=7, rf__n_estimators=25 ............................
[CV] ... rf__max_depth=7, rf__n_estimators=25, score=0.400800 -   4.1s
[CV] rf__max_depth=8, rf__n_estimators=5 .............................
[CV] .... rf__max_depth=8, rf__n_estimators=5, score=0.380925 -   3.7s
[CV] rf__max_depth=8, rf__n_estimators=5 .............................
[CV] .

[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  3.7min finished


GridSearchCV(cv=2, error_score='raise',
       estimator=Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        st...n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'rf__max_depth': [5, 6, 7, 8, None], 'rf__n_estimators': [5, 10, 15, 20, 25]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=4)

In [208]:
best_pred2=best_classifier2.predict(Xf_test)
print "Accuracy=",accuracy_score(yf_test,best_pred2)
print confusion_matrix(yf_test,best_pred2)
print classification_report(yf_test,best_pred2)

Accuracy= 0.453
[[ 13   3  15 105  82]
 [  4   4  22 172  63]
 [  1   1  19 331  90]
 [  0   1  14 685 387]
 [  0   2   6 342 638]]
             precision    recall  f1-score   support

          1       0.72      0.06      0.11       218
          2       0.36      0.02      0.03       265
          3       0.25      0.04      0.07       442
          4       0.42      0.63      0.50      1087
          5       0.51      0.65      0.57       988

avg / total       0.44      0.45      0.39      3000



In [209]:
best_classifier2.grid_scores_

[mean: 0.37957, std: 0.00293, params: {'rf__max_depth': 5, 'rf__n_estimators': 5},
 mean: 0.38757, std: 0.00192, params: {'rf__max_depth': 5, 'rf__n_estimators': 10},
 mean: 0.39243, std: 0.00922, params: {'rf__max_depth': 5, 'rf__n_estimators': 15},
 mean: 0.39471, std: 0.00237, params: {'rf__max_depth': 5, 'rf__n_estimators': 20},
 mean: 0.40429, std: 0.01195, params: {'rf__max_depth': 5, 'rf__n_estimators': 25},
 mean: 0.37443, std: 0.00493, params: {'rf__max_depth': 6, 'rf__n_estimators': 5},
 mean: 0.38443, std: 0.00151, params: {'rf__max_depth': 6, 'rf__n_estimators': 10},
 mean: 0.39743, std: 0.00480, params: {'rf__max_depth': 6, 'rf__n_estimators': 15},
 mean: 0.39957, std: 0.00049, params: {'rf__max_depth': 6, 'rf__n_estimators': 20},
 mean: 0.39957, std: 0.00677, params: {'rf__max_depth': 6, 'rf__n_estimators': 25},
 mean: 0.37957, std: 0.00407, params: {'rf__max_depth': 7, 'rf__n_estimators': 5},
 mean: 0.38314, std: 0.00250, params: {'rf__max_depth': 7, 'rf__n_estimators': 