# Review Sentiment Classifier

This notebook implements a sentiment analysis classifier for book reviews using machine learning.

**Project Overview:**
- Classifies reviews as POSITIVE, NEGATIVE, or NEUTRAL based on rating scores
- Uses TF-IDF vectorization for text feature extraction
- Implements multiple ML models (SVM, Decision Tree, Logistic Regression)
- Includes hyperparameter tuning with GridSearchCV

**Dataset:** Amazon Book Reviews (Books_small_10000.json)

## 1. Data Structures and Helper Classes

In [5]:
import random

# Sentiment categories as class constants
class Sentiment:
  NEGATIVE = "NEGATIVE"
  NEUTRAL = "NEUTRAL"
  POSITIVE = "POSITIVE"


class Review:
  """Represents a single review with text and rating"""

  def __init__(self, text, score):
    self.text = text
    self.score = score
    self.sentiment = self._classify_sentiment()

  def _classify_sentiment(self):
    """Convert rating to sentiment category"""
    if self.score <= 2:
      return Sentiment.NEGATIVE
    elif self.score == 3:
      return Sentiment.NEUTRAL
    return Sentiment.POSITIVE


class ReviewContainer:
  """Container for managing multiple reviews"""

  def __init__(self, reviews):
    self.reviews = reviews

  def evenly_distribute(self):
    """Balance positive and negative samples"""
    # Separate reviews by sentiment
    negative_reviews = [r for r in self.reviews if r.sentiment == Sentiment.NEGATIVE]
    positive_reviews = [r for r in self.reviews if r.sentiment == Sentiment.POSITIVE]

    print(f"Negative Reviews: {len(negative_reviews)}")
    print(f"Positive Reviews: {len(positive_reviews)}")

    # Match sample sizes
    min_count = len(negative_reviews)
    balanced_positive = positive_reviews[:min_count]

    # Combine and shuffle to prevent order bias
    self.reviews = negative_reviews + balanced_positive
    random.shuffle(self.reviews)

    print(f"Evenly Distributed Reviews: {len(self.reviews)}")

  def get_text(self):
    """Extract all review texts"""
    return [review.text for review in self.reviews]

  def get_sentiment(self):
    """Extract all sentiment labels"""
    return [review.sentiment for review in self.reviews]

## 2. Load Data

In [6]:
import json

# Load reviews from JSON file
file_name = 'Books_small_10000.json'
reviews = []

with open(file_name) as f:
  for line in f:
    try:
      # Parse each JSON line and create Review object
      data = json.loads(line)
      reviews.append(Review(data['reviewText'], data['overall']))
    except json.JSONDecodeError as e:
      print(f"Skipping malformed line: {line.strip()} - Error: {e}")

# Verify data loaded correctly
# Only attempt to access reviews if the list is not empty
if reviews:
    print(f"Successfully loaded {len(reviews)} reviews.")
    print(f"Example sentiment: {reviews[0].sentiment}")
else:
    print("No reviews were loaded.")


Successfully loaded 10000 reviews.
Example sentiment: POSITIVE


## 3. Prepare Data for Training

### 3.1 Train-Test Split

In [7]:
from sklearn.model_selection import train_test_split

# Split: 67% train, 33% test
training, test = train_test_split(reviews, test_size=0.33, random_state=42)

# Wrap in containers
train_container = ReviewContainer(training)
test_container = ReviewContainer(test)

### 3.2 Balance Class Distribution

In [8]:
# Balance training data
train_container.evenly_distribute()
train_x = train_container.get_text()
train_y = train_container.get_sentiment()

# Balance test data
test_container.evenly_distribute()
test_x = test_container.get_text()
test_y = test_container.get_sentiment()

# Verify balanced distribution
print(train_y.count(Sentiment.POSITIVE))
print(train_y.count(Sentiment.NEGATIVE))

Negative Reviews: 436
Positive Reviews: 5611
Evenly Distributed Reviews: 872
Negative Reviews: 208
Positive Reviews: 2767
Evenly Distributed Reviews: 416
436
436


## 4. Text Vectorization (TF-IDF)

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF gives importance weights to words
vectorizer = TfidfVectorizer()

# Fit on training data and transform both sets
train_x_vectors = vectorizer.fit_transform(train_x)
test_x_vectors = vectorizer.transform(test_x)

## 5. Model Training and Evaluation

### 5.1 Support Vector Machine (SVM)

In [10]:
from sklearn.svm import SVC

# Linear SVM for text classification
clf_svm = SVC(kernel='linear')
clf_svm.fit(train_x_vectors, train_y)

# Evaluate accuracy
print(clf_svm.score(test_x_vectors, test_y))

0.8076923076923077


### 5.2 Decision Tree Classifier

In [11]:
from sklearn.tree import DecisionTreeClassifier

# Decision tree model
clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)

print(clf_dec.score(test_x_vectors, test_y))

0.6610576923076923


### 5.3 Naive Bayes Classifier

In [12]:
from sklearn.naive_bayes import GaussianNB

# Gaussian Naive Bayes (requires dense arrays)
clf_gnb = GaussianNB()
clf_gnb.fit(train_x_vectors.toarray(), train_y)

print(clf_gnb.score(test_x_vectors.toarray(), test_y))

0.6610576923076923


### 5.4 Logistic Regression

In [13]:
from sklearn.linear_model import LogisticRegression

# Logistic regression model
clf_log = LogisticRegression(max_iter=1000)
clf_log.fit(train_x_vectors, train_y)

print(clf_log.score(test_x_vectors, test_y))

0.8028846153846154


## 6. Model Evaluation with F1-Score

In [14]:
from sklearn.metrics import f1_score

# Calculate F1 scores for all models
sentiment_labels = [Sentiment.POSITIVE, Sentiment.NEGATIVE]

print(f1_score(test_y, clf_svm.predict(test_x_vectors),
               labels=sentiment_labels, average='macro'))
print(f1_score(test_y, clf_dec.predict(test_x_vectors),
               labels=sentiment_labels, average='macro'))
print(f1_score(test_y, clf_gnb.predict(test_x_vectors.toarray()),
               labels=sentiment_labels, average='macro'))
print(f1_score(test_y, clf_log.predict(test_x_vectors),
               labels=sentiment_labels, average='macro'))

0.807674526121128
0.6608989738401503
0.6610087209806335
0.8028663892741563


## 7. Testing on New Reviews

In [15]:
# Test with custom reviews
sample_reviews = ['Very bad book, waste of money',
                  'Best read of the year!',
                  'Meh, could be better']

# Transform and predict
sample_vectors = vectorizer.transform(sample_reviews)
predictions = clf_svm.predict(sample_vectors)

print(predictions)

['NEGATIVE' 'POSITIVE' 'NEGATIVE']


## 8. Hyperparameter Tuning with GridSearchCV

In [16]:
from sklearn.model_selection import GridSearchCV

# Parameter grid for SVM
param_grid = {
    'C': (1, 4, 8, 16, 32),
    'kernel': ('linear', 'rbf')
}

# Grid search with 5-fold cross-validation
svc_base = SVC()
clf = GridSearchCV(svc_base, param_grid, cv=5)
clf.fit(train_x_vectors, train_y)

In [17]:
# Evaluate tuned model
print(clf.score(test_x_vectors, test_y))

0.8100961538461539


## 9. Save Model for Future Use

In [18]:
import pickle

# Save trained model to disk
model_filename = 'sentiment_classifier.pkl'
with open(model_filename, 'wb') as f:
  pickle.dump(clf, f)

## Summary

**Results:**
- Best Model: Tuned SVM (C=4, rbf kernel)
- Test Accuracy: ~81%
- Successfully classifies book reviews into positive/negative sentiment

**Future Improvements:**
- Try Word2Vec or BERT embeddings
- Experiment with deep learning (LSTM, Transformers)
- Include neutral sentiment classification
- Collect more training data