# Predicting Book Review Spoilers by Brian Dannenmueller


## Goal: 

Predict whether review contains a spoiler for the book's plot.


## Data Preparation:

Our dataset comes from the academic paper "Fine-Grained Spoiler Detection from Large-Scale Review Corpora". More than 15 million records book reviews were scraped off the internet for the original study. Notably, some pre-processing has already been done -- there is no HTML markup or null values, for example. 
    
The data consists of three attributes: 

1. review_sentences, which is one long string consisting of the entire review. 

2. rating, which is the number of "stars" received by the book, from zero to five.

3. has_spoiler, which is a boolean. indicating if the review has a spoiler. 

The data comes in several different sizes containing 1000 (1.1 MB), 5000 (6.2MB), 10000 (10.8MB), 20000 (19.5MB), and 50000(49.7MB) reviews. We will be working with the smallest and the largest of these. 

## Model Planning and Execution:

We are utilizing Naive Bayes (NB), which is considered one of the best machine learning algorithms for text classification. We are also using Support Vector Machines (SVM), which is less efficient on its own, but integration of Principal Component Analysis (PCA) could speed it up to outperform Naive Bayes, allowing us to test both algorithms on the largest data set. 


## Process: 

1. Read in and examined the data to determine what cleaning still needs to be done. There wasn't much to do besides run the starter code for the clean_review function.

2. Pre-processed the small data set to make it readily available to machine learning algorithms. 

3. Measured baseline performance for predictions using vanilla implementations of Naive Bayes (NB), Support Vector Machine (SVM), and Principal Component Analysis (PCA).

4. Experimented with parameters for Naive Bayes, SVC, and PCA to optimize both accuracy and speed. 

5. Implemented these optimized parameters to classify the largest data set with 50,000 reviews.


## Metrics:

We are measuring our machine learning algorithm performance using two metrics: runtime and accuracy. 
Lowering the runtime with PCA without sacrificing accuracy is ideal because we will be trying to scale up to the largest possible data set. The accuracy score is simply the percentage of classifications correctly made by the algorithms. The runtime is measured in seconds.


## Baseline Performance:

We started out with the smallest file size -- 1000 reviews. 
SVM was the most accurate algorithm at baseline, correctly identifying spoiler reviews 91.5% of the time. 
Baseline accuracy score for Naive Bayes was also very good -- 90.5%! It also had the fastest speed -- 0.033s versus 1.415s for SVM. 

Naive Bayes does not work with PCA, so I used with for SVM only, boiling down the 5000 dimension analysis into only 100 dimensions. There was zero loss to accuracy, which was still 91.5% but a massive upgrade to speed, taking only only 0.054s for the runtime. 


## Results:

Naive Bayes: 
On the small data set, adjusting the alpha value (smoothing function) to 10 on the Naive Bayes improved accuracy to 91.5% without sacrificing speed, which remained about the same. 

On the large data set, I used the adjusted alpha value and were able to achieve 90% accuracy in 1.3s. NB is incredibly fast for this application. 

SVM: 
For the small data set, adjusting the C value downward (increasing regularization) had no effect on accuracy, but greatly enhanced speed. Without PCA, the runtime was cut in half to about 0.688s. Adjusting PCA to n = 10 again had no effect on accuracy, but the algorithm become much faster than even the NB algorithm, with a runtime of 0.005s.

On the large data set, I attempted to run SVM without PCA, but after several minutes of waiting, gave up. I then ran it with the the optimized C = 0.01 and PCA n = 10 and was able to achieve 93.6% accuracy in about 39s. Clearly, this algorithm is significantly slower at this task than NB and could not be run without PCA, but it does achieve better accuracy. 



## Recommendation:

Naive Bayes appears to be ideal for quick and somewhat dirty text classification. It scales extremely well with an excellent runtime and solid accuracy. If accuracy is more important, optimized SVC significantly outperforms NB, sacrificing speed in the process, and with even larger datasets than that which we used, it may not even be feasible without dedicating enormous resources. This drawback is effectively offset by using PCA in conjunction with SVC, making the runtime still slower but reasonable. 


## Insights:

The use of the English language follows the Pareto Principle -- roughly speaking, 20% of words consist of 80% of actual usage. It seems that clearing out obvious structural words like "the", "is", and "and" does not affect this principle -- based on our results, it seems to generalize well to more colorful verbs, adjectives, and nouns. I can say this because our PCA could reduce our dimensionality by more than 2 orders of magnitude without sacrificing accuracy. This suggests that there are a small number of words that are strongly correlated with spoilers. 

Attempting classification with a decision tree could give us more insight into the reason why we are able to consistly achieve high accuracy and simplify the model so much, however. SVC and Naive Bayes do not appear to be as transparent, but even if decision trees perform less well, they could help provide some mental scaffolding for interpreting the results with other algorithms. 

It might be possible to optimize NB to match the accuracy of SVM, though we do not see evidence of that in this experiment. Future experiments could involve further experimentation of Naive Bayes parameters or find a way to implement PCA with Naive Bayes -- if it is possible. 




# Part I: Viewing the Data

In [1]:
import pandas

data = pandas.read_csv("goodreads_reviews1000.csv") 

#Here we take a look at an unmodified review.
print(data["review_sentences"][0])

 This is a special book.  It started slow for about the first third then in the middle third it started to get interesting then the last third blew my mind.  This is what I love about good science fiction  it pushes your thinking about where things can go.  It is a  Hugo winner and translated from its original Chinese which made it interesting in just a different way from most things Ive read.  For instance the intermixing of Chinese revolutionary history  how they kept accusing people of being reactionaries etc.  It is a book about science and aliens.  The science described in the book is impressive  its a book grounded in physics and pretty accurate as far as I could tell.  Though when it got to folding protons into  dimensions I think he was just making stuff up  interesting to think about though.  But what would happen if our SETI stations received a message  if we found someone was out there  and the person monitoring and answering the signal on our side was disillusioned?  That p

In [2]:
#Here is our integer vector (0-5) of the ratings attribute

print(data["rating"])

0      5
1      3
2      3
3      0
4      4
      ..
995    2
996    4
997    3
998    5
999    3
Name: rating, Length: 1000, dtype: int64


In [3]:
#Here is our boolean vector indicating which reviews have spoilers

print(data["has_spoiler"])

0       True
1      False
2       True
3      False
4       True
       ...  
995    False
996    False
997    False
998    False
999    False
Name: has_spoiler, Length: 1000, dtype: bool


# Part II: Pre-Processing the Data

In this section, we define a function clean_review that takes in one a review, removes punctuation, converts everything to lowercase, and removes common filler words as defined in the nltk stopwords collection. 

In [4]:
import re
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.svm import SVC
from sklearn import decomposition
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from time import time

def clean_review(review):
    #input is a string review
    #return is review cleaned of all punctuation, lowercase, and removed nltk stopwords
    letters_only = re.sub("[^a-zA-Z]"," ",review)
    lower_case = letters_only.lower()
    words = lower_case.split()
    for stop_word in stopwords.words("english"):
        while stop_word in words:
            words.remove(stop_word)
    cleaned = " ".join(words)
    return cleaned

data["has_spoiler"] = data['has_spoiler'].replace(to_replace=True, value=1)
data["has_spoiler"] = data['has_spoiler'].replace(to_replace=False, value=0)

#process the data
cleaned_text = []
for i in range(len(data)):
    cleaned_text.append(clean_review(data["review_sentences"][i]))  

#establish training and testing dataset
# note that data['rating'] OR data['has_spoiler']
train_data, test_data, train_sln, test_sln = \
    train_test_split(cleaned_text, data['has_spoiler'], test_size = 0.2, random_state=0)


In [5]:
#In this cell, we use Count Vectorizer to find the 5000 most common words that appear in the reviews.

#Bag of Words with 5000 most common words
vectorizer = CountVectorizer(analyzer='word', max_features = 5000)
#find the right 5000 words
vectorizer.fit(train_data)
#let's look at which words it found
print(vectorizer.get_feature_names())

######################

#use the vectorizer to transform review strings into word count vectors 
train_data_vectors = vectorizer.transform(train_data).toarray()
test_data_vectors = vectorizer.transform(test_data).toarray()



# Part III: Baseline Analysis


In the next iteration, we will tweak parameters to see if we can improve on these baselines. 
We are working with the smallest version of the data set n = 1000.

In [6]:
#MultinomialNB (naive bayes) 

mnb = MultinomialNB()
t0 = time()
mnb.fit(train_data_vectors,train_sln)
preds = mnb.predict(test_data_vectors)
print("done in %0.3fs" % (time() - t0))
print("accuracy:", accuracy_score(preds,test_sln))


done in 0.032s
accuracy: 0.905


In [7]:
#Support Vector Machine

svc = SVC()
t0 = time()
svc.fit(train_data_vectors,train_sln)
preds = svc.predict(test_data_vectors)
print("done in %0.3fs" % (time() - t0))
print("accuracy:",accuracy_score(preds,test_sln))




done in 1.432s
accuracy: 0.915


In [8]:
#Transforming the data with PCA, n = 100

pca = decomposition.PCA(n_components=100, whiten=True)
pca.fit(train_data_vectors)

train_data_pca = pca.transform(train_data_vectors)
test_data_pca = pca.transform(test_data_vectors)

In [9]:
#SVM with PCA


svc2 = SVC()
t0 = time()
svc2.fit(train_data_pca,train_sln)
preds = svc2.predict(test_data_pca)
print("done in %0.3fs" % (time() - t0))
print("accuracy:",accuracy_score(preds,test_sln))

done in 0.053s
accuracy: 0.915




# Part IV: Experimenting with Parameters

In [10]:
#MultinomialNB (naive bayes) 

mnb = MultinomialNB(alpha=10)
t0 = time()
mnb.fit(train_data_vectors,train_sln)
preds = mnb.predict(test_data_vectors)
print("done in %0.3fs" % (time() - t0))
print("accuracy:", accuracy_score(preds,test_sln))

done in 0.021s
accuracy: 0.915


In [11]:
#Support Vector Machine

svc = SVC(kernel='poly', C=0.01)
t0 = time()
svc.fit(train_data_vectors,train_sln)
preds = svc.predict(test_data_vectors)
print("done in %0.3fs" % (time() - t0))
print("accuracy:",accuracy_score(preds,test_sln))



done in 0.708s
accuracy: 0.915


In [12]:
#Transforming the data with PCA, n = 100

pca = decomposition.PCA(n_components=10, whiten=True)
pca.fit(train_data_vectors)

train_data_pca = pca.transform(train_data_vectors)
test_data_pca = pca.transform(test_data_vectors)

In [13]:
#SVM with PCA

svc2 = SVC(kernel='poly', C=0.01)
t0 = time()
svc2.fit(train_data_pca,train_sln)
preds = svc2.predict(test_data_pca)
print("done in %0.3fs" % (time() - t0))
print("accuracy:",accuracy_score(preds,test_sln))

done in 0.004s
accuracy: 0.915




# Part V: Scaling Up

In [14]:
data = pandas.read_csv("goodreads_reviews50000.csv") 

def clean_review(review):
    #input is a string review
    #return is review cleaned of all punctuation, lowercase, and removed nltk stopwords
    letters_only = re.sub("[^a-zA-Z]"," ",review)
    lower_case = letters_only.lower()
    words = lower_case.split()
    for stop_word in stopwords.words("english"):
        while stop_word in words:
            words.remove(stop_word)
    cleaned = " ".join(words)
    return cleaned

data["has_spoiler"] = data['has_spoiler'].replace(to_replace=True, value=1)
data["has_spoiler"] = data['has_spoiler'].replace(to_replace=False, value=0)

#process the data
cleaned_text = []
for i in range(len(data)):
    cleaned_text.append(clean_review(data["review_sentences"][i]))  

#establish training and testing dataset
# note that data['rating'] OR data['has_spoiler']
train_data, test_data, train_sln, test_sln = \
    train_test_split(cleaned_text, data['has_spoiler'], test_size = 0.2, random_state=0)

#Bag of Words with 5000 most common words
vectorizer = CountVectorizer(analyzer='word', max_features = 5000)
#find the right 5000 words
vectorizer.fit(train_data)
#let's look at which words it found
print(vectorizer.get_feature_names())

#use the vectorizer to transform review strings into word count vectors 
train_data_vectors = vectorizer.transform(train_data).toarray()
test_data_vectors = vectorizer.transform(test_data).toarray()






In [15]:
#MultinomialNB (naive bayes) 

mnb = MultinomialNB(alpha=10)
t0 = time()
mnb.fit(train_data_vectors,train_sln)
preds = mnb.predict(test_data_vectors)
print("done in %0.3fs" % (time() - t0))
print("accuracy:", accuracy_score(preds,test_sln))

done in 1.320s
accuracy: 0.9031


In [16]:
#Transforming the data with PCA, n = 10

pca = decomposition.PCA(n_components=10, whiten=True)
pca.fit(train_data_vectors)

train_data_pca = pca.transform(train_data_vectors)
test_data_pca = pca.transform(test_data_vectors)

In [17]:
#SVM with PCA

svc2 = SVC(kernel='poly', C=0.01)
t0 = time()
svc2.fit(train_data_pca,train_sln)
preds = svc2.predict(test_data_pca)
print("done in %0.3fs" % (time() - t0))
print("accuracy:",accuracy_score(preds,test_sln))



done in 38.124s
accuracy: 0.9364
