In [5]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

## Step 1: Build The DataFrame and Define ML Problem

In [7]:
filename = os.path.join(os.getcwd(), "bookReviews.csv")
df = pd.read_csv(filename, header=0)

In [8]:
df.head()

Unnamed: 0,Review,Positive Review
0,This was perhaps the best of Johannes Steinhof...,True
1,This very fascinating book is a story written ...,True
2,The four tales in this collection are beautifu...,True
3,The book contained more profanity than I expec...,False
4,We have now entered a second time of deep conc...,True


## Step 2. Create Labeled Examples from the Data Set
Let's create labeled examples from our dataset. We will have one text feature and one label.
The code cell below carries out the following steps:

* Gets the `Positive_Review` column from DataFrame `df` and assign it to the variable `y`. This will be our label. Note that the label contains True or False values that indicate whether a given book review is a positive one.
* Gets the column `Review` from DataFrame `df` and assigns it to the variable `X`. This will be our feature. Note that the `Review` feature contains the book review.

In [9]:
y = df['Positive Review']
X = df['Review']

X.shape

(1973,)

In [10]:
X.head()

Unnamed: 0,Review
0,This was perhaps the best of Johannes Steinhof...
1,This very fascinating book is a story written ...
2,The four tales in this collection are beautifu...
3,The book contained more profanity than I expec...
4,We have now entered a second time of deep conc...


In [11]:
print('A Positive Review: \n\n', X[67])
print('A Negative Review: \n\n', X[85])

A Positive Review: 

 I am not going to go over the contents of the book, or much about Charles Bukowski, because if you are considering this book you must know something about the man and his work. I will just give you my impression of this collection of work.
No collection can ever really be complete, there are always new things to add, new commentary, newly discovered works, transcripts of records and unpublished letters, but this book does an excellent job in its attempt.
To me Charles Bukowski will always be one of the greatest American writers of the twentieth century, because of the sheer brutality and honesty his work emanates. It is funny, sad, sadistic, cruel, scathing, enlightening and thought provoking. Everything I like to read. This is poetry for people who are disgusted by verse of flowers, trees and Greek mythology. This is RAW human emotion and experience smeared out onto paper. It is not perfect, and it is not trying to be. It doesn't always work, but there in lies th

## Step 3: Create Training and Test Data Sets


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.75, random_state=1234)

X_train.head()

Unnamed: 0,Review
500,"There is a reason this book has sold over 180,..."
1047,There is one thing that every cookbook author ...
1667,Being an engineer in the aerospace industry I ...
1646,I have no idea how this book has received the ...
284,It is almost like dream comes true when I saw ...


## Step 4:  Implement TF-IDF Vectorizer to Transform Text

A popular technique when transforming text to numerical feature vectors is to use the TF-IDF statistical measure. TF-IDF calculates how relevant a word (token) is in a document relative to a collection of documents. It weighs words to indicate the words that are the most unique to the document and therefore can be used to represent the characteristics of the document. For example, the word "the" appears in many documents and therefore is not characteristic of one particular document in a collection. On the other hand, if a word appears often in one document and rarely in other documents in the collection, the word is given a higher value of importance to that one document.
Because TF-IDF provides an understanding of the context of the textual data, using TF-IDF features when performing classification for sentiment analysis yields more accurate results.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [14]:
# 1. Create a TfidfVectorizer object
tfidf_vectorizer = TfidfVectorizer()

# 2. Fit the vectorizer to X_train
tfidf_vectorizer.fit(X_train)

# 3. Print the first 50 items in the vocabulary
print("Vocabulary size {0}: ".format(len(tfidf_vectorizer.vocabulary_)))
print(str(list(tfidf_vectorizer.vocabulary_.items())[0:50])+'\n')


# 4. Transform *both* the training and test data using the fitted vectorizer and its 'transform' attribute
X_train_tfidf = tfidf_vectorizer.transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


# 5. Print the matrix
print(X_train_tfidf.todense())

Vocabulary size 18558: 
[('there', 16673), ('is', 9043), ('reason', 13533), ('this', 16714), ('book', 2189), ('has', 7803), ('sold', 15423), ('over', 11793), ('180', 73), ('000', 1), ('copies', 3867), ('it', 9076), ('gets', 7240), ('right', 14207), ('to', 16835), ('the', 16627), ('point', 12568), ('accompanies', 444), ('each', 5372), ('strategy', 15943), ('with', 18277), ('visual', 17844), ('aid', 750), ('so', 15386), ('you', 18497), ('can', 2604), ('get', 7239), ('mental', 10534), ('picture', 12402), ('in', 8491), ('your', 18501), ('head', 7844), ('further', 7051), ('its', 9088), ('section', 14743), ('on', 11601), ('analyzing', 974), ('stocks', 15886), ('and', 984), ('commentary', 3384), ('state', 15782), ('of', 11543), ('financial', 6568), ('statements', 15786), ('market', 10286), ('are', 1220), ('money', 10863), ('if', 8336), ('just', 9282), ('starting', 15774)]

[[0.         0.16185315 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.       

## Step 5: Fit a Logistic Regression Model to the Transformed Training Data and Evaluate the Model

In [15]:
# 1. Create a LogisticRegression model object, and fit a Logistic Regression model to the transformed training data
model = LogisticRegression(max_iter=200)
model.fit(X_train_tfidf, y_train)

# 2. Make predictions on the transformed test data using the predict_proba() method and
# save the values of the second column
probability_predictions = model.predict_proba(X_test_tfidf)[:,1]

# 3. Make predictions on the transformed test data using the predict() method
class_label_predictions = model.predict(X_test_tfidf)

# 4. Compute the Area Under the ROC curve (AUC) for the test data. Note that this time we are using one
# function 'roc_auc_score()' to compute the auc rather than using both 'roc_curve()' and 'auc()' as we have
# done in the past
auc = roc_auc_score(y_test, probability_predictions)
print('AUC on the test data: {:.4f}'.format(auc))

# 5. Print out the size of the resulting feature space using the 'vocabulary_' attribute of the vectorizer
len_feature_space = len(tfidf_vectorizer.vocabulary_)
print('The size of the feature space: {0}'.format(len_feature_space))

# 6. Get a glimpse of the features:
first_five = list(tfidf_vectorizer.vocabulary_.items())[0:5]
print('Glimpse of first 5 entries of the mapping of a word to its column/feature index \n{}:'.format(first_five))

AUC on the test data: 0.9147
The size of the feature space: 18558
Glimpse of first 5 entries of the mapping of a word to its column/feature index 
[('there', 16673), ('is', 9043), ('reason', 13533), ('this', 16714), ('book', 2189)]:


In [16]:
print('Review #1:\n')
print(X_test.to_numpy()[124])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[124]))

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[124]))

Review #1:

I've been a fan of Carol Dweck's scholarly work for years. Her work on self-esteem, self-concept, and the incremental vs. entity theories of intelligence provides some of the most powerfully useful tools I've encountered for educators and parents in their work with children, as well as in their own self-awareness and lives. I'm delighted to see this information written here in such a user-friendly conversational tone, rich with stories that illustrate the nuances and complexities of Dweck's research and ideas. I'm recommending this book to all of my graduate students (teachers and principals working with gifted learners), as well as to parents of high-ability children.

Dona Matthews, Ph.D., Director of the Hunter College Center for Gifted Studies and Education, City University of New York


Prediction: Is this a good review? True

Actual: Is this a good review? True



In [17]:
print('Review #2:\n')
print(X_test.to_numpy()[238])

print('\nPrediction: Is this a good review? {}\n'.format(class_label_predictions[238]))

print('Actual: Is this a good review? {}\n'.format(y_test.to_numpy()[238]))

Review #2:

I have read other books by Alesia Holliday and enjoyed them so I looked forward to reading this book.  Unfortunately, I could not get any farther than the first 25 pages.  I even tried diving in further into the book to see if it got better and I still could not read more than 5 pages without turning away.  The best I can do to pin down why I dislike it so much is to say that it tries too hard.  No character seems to even approach reality.  They are all, including the main character and her love interest, over the top


Prediction: Is this a good review? False

Actual: Is this a good review? False

