This notebook reads in vectorizer and logistic regression joblib files (scikitlearn said it's better than pickle for large things) trained on whole, extreme reviews (with 100+ reviews) from the academic dataset.

It uses these trained algorithms to score each service sentence and append these as a new column in Francine's file.

#You can read the file this produces with the following:
#pd.read_csv("review_sentences_final_academic_service_words_scored.csv", index_col=0)

In [1]:
import pandas as pd
from sklearn.externals import joblib
from sklearn import metrics
import re

Here load pickle (joblib) files trained on Extreme Reviews file with balanced # of good and bad reviews (thursday 2015/07/30 edition).

In [3]:
vc = joblib.load('Yelp2015/pickle/vectorizer_thursday.pkl')
lr = joblib.load('Yelp2015/pickle/logreg_thursday.pkl')

First just confirm can load and run the pickle files.

In [2]:
extreme = pd.read_csv("Yelp2015/yelp_academic_reviews_extreme_partial.csv", index_col = 0)
extreme.head(3)

Unnamed: 0,clean,score
0,i don t care what other people say top dog is ...,1
1,top dog saved the day again with hot wieners a...,1
2,when i m in the area i usually drop in a for q...,1


In [4]:
#Vectorize and do regression on clean whole reviews.
X = extreme.clean
Xvec = vc.transform(X).toarray()
y_pred = lr.predict(Xvec)

In [5]:
from sklearn import metrics
print metrics.accuracy_score(extreme.score, y_pred)

0.989644561916


All the above shows I can load the predictor and run it on the set it was trained on and recapitulate the scores.  NOW LETS SEE HOW IT DOES WITH SENTENCES!

In [6]:
#Try to evaluate sentences from Francine's service words sentences file from today.
df = pd.read_csv("Downloads/review_sentences_final_academic_service_words_NEW.csv", index_col=0)
df.drop('Unnamed: 0.1', axis=1, inplace=True)
print df.shape
df.head(2)

(113278, 8)


Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,votes
0,qHmamQPCAKkia9X0uryA8g,2012-06-15,hIrFN-5jhCo04AvhsNtimg,5,You order in 3 seconds and you're out in 5 min...,review,P-xcy872BvGcClkNpNlPqQ,"{u'funny': 0, u'useful': 1, u'cool': 1}"
1,qHmamQPCAKkia9X0uryA8g,2010-12-10,C3XPzVWPoqK_FpgItJFtjg,4,The guy behind the counter was very friendly a...,review,S6BXOUedzPH58K0rYY1loQ,"{u'funny': 0, u'useful': 0, u'cool': 0}"


In [7]:
#Clean the sentences
def cleantext(text):
    text = re.sub("[^a-zA-Z]", " ", text).lower()  #This removes everything that isn't a letter
    return re.sub( '\s+', ' ', text).strip()  #This removes excess spaces

clean = df.text.apply(cleantext).values
df['clean'] = clean

In [8]:
print df.iloc[0,4]
print df.iloc[0,8]

You order in 3 seconds and you're out in 5 minutes!
you order in seconds and you re out in minutes


In [9]:
#Vectorize sentences
sentencevec = vc.transform(df.clean).toarray()

In [10]:
#Do logistic regression
senpred = lr.predict(sentencevec)
senproba = lr.predict_proba(sentencevec)

In [11]:
proba_bad, proba_good = zip(*senproba)
df['senpred'] = senpred
df['proba_bad'] = proba_bad
df['proba_good'] = proba_good

In [1]:
df.head(2)

NameError: name 'df' is not defined

In [13]:
print len(df)
print len(df[df.proba_good>0.9]) 
print len(df[df.proba_bad>0.9])


113278
12727
20827


In [14]:
#Save to file
df.to_csv("review_sentences_final_academic_service_words_NEW_sen_proba.csv", index_col=0)

In [19]:
#Now let's test random phrases for fun.

test = ["time", "minutes", "seconds minutes", "three seconds", "ugly", "flkslkfdj", "unicorn", "amazing minutes"]
testvec = vc.transform(test)
testpred = lr.predict(testvec)
testprob = lr.predict_proba(testvec)
for i in range(len(test)):
    print testpred[i], testprob[i], test[i]


1 [ 0.49654368  0.50345632] time
0 [ 0.6615704  0.3384296] minutes
0 [ 0.70393383  0.29606617] seconds minutes
0 [ 0.5213172  0.4786828] three seconds
1 [ 0.45095211  0.54904789] ugly
1 [ 0.48727826  0.51272174] flkslkfdj
1 [ 0.46116682  0.53883318] unicorn
1 [ 0.15472862  0.84527138] amazing minutes
