# Sentiment Analysis for Product Reviews

## Business Context: Amazon's Review Analysis

Amazon processes millions of customer reviews daily to:
- Identify products with quality or satisfaction issues
- Highlight top-performing products for promotion
- Track sentiment trends over time for product categories
- Allow customers to filter by review sentiment
- Provide personalized recommendations based on sentiment patterns

In this notebook, we'll build a simple sentiment classifier similar to what Amazon might use as part of their review processing pipeline.

## 1. Setup and Data Loading

First, let's import the necessary libraries and load our dataset of product reviews.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Set random seed for reproducibility
np.random.seed(42)

In [None]:
# Create a sample dataset of product reviews
# In a real application, this would be loaded from a file or database
data = {
    'review_text': [
        "This product exceeded my expectations. The quality is amazing and it works perfectly.",
        "Great value for money, highly recommend this to anyone looking for a reliable option.",
        "Disappointed with the build quality. It broke after just two weeks of normal use.",
        "The instructions were confusing and customer service was unhelpful when I called.",
        "Decent product for the price, but nothing extraordinary. Does what it's supposed to do.",
        "Absolutely love this! Best purchase I've made this year.",
        "Shipping was quick but the product doesn't match the description at all.",
        "Wish I could give zero stars. Complete waste of money and time.",
        "Product is okay but the app that goes with it constantly crashes.",
        "The design is beautiful and it's very user-friendly. Exactly what I needed.",
        "Average performance. Not bad but not great either.",
        "Item arrived on time and was as described in the listing.",
        "This product changed my life! I use it every day and it saves me so much time.",
        "Poor quality materials and it stopped working after a month.",
        "Fantastic customer service and the product is excellent too.",
        "Not worth the premium price. You're paying for the brand name only.",
        "Simple to set up and works reliably. No complaints.",
        "The battery life is terrible, needs charging every few hours.",
        "Perfect size and weight, and the performance is outstanding.",
        "Received a damaged item and return process was a nightmare."
    ],
    'rating': [5, 5, 2, 2, 3, 5, 2, 1, 3, 5, 3, 4, 5, 1, 5, 2, 4, 2, 5, 1]
}

# Create DataFrame
df = pd.DataFrame(data)

# Display the first few rows
df.head()

## 2. Data Exploration and Preparation

Let's explore our dataset and prepare it for sentiment analysis. In a real business context, Amazon would have millions of reviews with star ratings.

In [None]:
# Check the distribution of ratings
plt.figure(figsize=(8, 5))
sns.countplot(x='rating', data=df)
plt.title('Distribution of Review Ratings')
plt.xlabel('Rating (1-5 stars)')
plt.ylabel('Count')
plt.show()

# Display basic statistics
print(f"Total reviews: {len(df)}")
print(f"Average rating: {df['rating'].mean():.2f}")

In [None]:
# Convert ratings to sentiment labels (binary classification)
# 4-5 stars = Positive, 1-2 stars = Negative, 3 stars = Neutral
df['sentiment'] = df['rating'].apply(lambda x: 'positive' if x >= 4 else ('negative' if x <= 2 else 'neutral'))

# For simplicity in this demonstration, we'll focus on positive vs. negative
# In a real application, you might want to include neutral or use the full 1-5 scale
df_binary = df[df['sentiment'] != 'neutral']

# Display the binary sentiment distribution
plt.figure(figsize=(8, 5))
sns.countplot(x='sentiment', data=df_binary)
plt.title('Distribution of Binary Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

## 3. Model Building: Sentiment Classifier

Now we'll build a simple sentiment classifier using TF-IDF for feature extraction and a Naive Bayes classifier. This represents a basic version of what companies like Amazon might use as part of their review analysis systems.

In [None]:
# Split the data into training and testing sets
X = df_binary['review_text']
y = df_binary['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")

In [None]:
# Create a pipeline that first transforms text to TF-IDF vectors and then applies Naive Bayes
sentiment_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=1000, stop_words='english')),
    ('classifier', MultinomialNB())
])

# Train the model
sentiment_pipeline.fit(X_train, y_train)

## 4. Model Evaluation

Let's evaluate our model to see how well it performs on the test set.

In [None]:
# Make predictions on the test set
y_pred = sentiment_pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Negative', 'Positive'], 
            yticklabels=['Negative', 'Positive'])
plt.title('Confusion Matrix')
plt.ylabel('True Sentiment')
plt.xlabel('Predicted Sentiment')
plt.show()

## 5. Business Application: Analyzing New Reviews

Now let's see how our model would classify new, unseen product reviews. This is how Amazon might use sentiment analysis to automatically categorize incoming reviews.

In [None]:
# New sample reviews to classify
new_reviews = [
    "The product works as described but shipping took longer than expected.",
    "While the design is beautiful, the functionality is disappointing.",
    "I've had this for six months now and it still works like new. Very happy with my purchase.",
    "It's not terrible, but I wouldn't buy it again or recommend it to friends.",
    "Great customer service helped me resolve an issue quickly."
]

# Predict sentiment for new reviews
predictions = sentiment_pipeline.predict(new_reviews)

# Get probability scores for each class
prediction_proba = sentiment_pipeline.predict_proba(new_reviews)

# Display results with confidence scores
results = pd.DataFrame({
    'Review': new_reviews,
    'Predicted Sentiment': predictions,
    'Confidence': np.max(prediction_proba, axis=1)
})

results

## 6. Understanding Our Model

Let's analyze which words have the strongest influence on sentiment classification. This helps business users understand what aspects of products drive positive or negative sentiment.

In [None]:
# Extract the feature names (words) from the TF-IDF vectorizer
feature_names = sentiment_pipeline.named_steps['tfidf'].get_feature_names_out()

# Get the coefficients from the Naive Bayes classifier
# For MultinomialNB, log probabilities reflect word importance
coefs = sentiment_pipeline.named_steps['classifier'].feature_log_prob_

# Calculate the difference between positive and negative class log probabilities
# This gives us a measure of how strongly each word indicates positive or negative sentiment
coef_diff = coefs[1] - coefs[0]  # Assuming binary classification (positive=1, negative=0)

# Get the top positive and negative words
top_positive_idx = coef_diff.argsort()[-10:]  # Top 10 positive words
top_negative_idx = coef_diff.argsort()[:10]   # Top 10 negative words

# Display top positive words
print("Top words indicating POSITIVE sentiment:")
for idx in reversed(top_positive_idx):
    print(f"  {feature_names[idx]}")

print("\nTop words indicating NEGATIVE sentiment:")
for idx in top_negative_idx:
    print(f"  {feature_names[idx]}")

## 7. Business Value and Implementation

### Business Benefits of Sentiment Analysis

1. **Early Problem Detection:**
   - Automatically flag negative reviews for immediate attention
   - Identify emerging product issues before they become widespread

2. **Customer Service Prioritization:**
   - Focus support resources on customers with negative experiences
   - Analyze sentiment trends to staff customer service appropriately

3. **Product Development Insights:**
   - Identify which product features generate positive sentiment
   - Understand pain points for future product iterations

4. **Marketing Opportunities:**
   - Highlight products with consistently positive sentiment
   - Extract positive testimonials automatically

### Implementation Considerations

While our simple model demonstrates the concept, a production-scale implementation would include:

- More sophisticated models (BERT, RoBERTa) for better accuracy
- Fine-tuning on domain-specific review data
- Multi-class sentiment (1-5 stars) rather than binary classification
- Aspect-based sentiment analysis to identify which specific product features receive positive/negative comments
- Integration with business intelligence dashboards for real-time monitoring
- Human review for edge cases where confidence is low

## 8. Learning Challenge

Now it's your turn to experiment with sentiment analysis!

### Challenges:

1. Try writing your own product review and see how the model classifies it
2. Modify a review slightly to see if you can change its predicted sentiment
3. Think about edge cases that might be difficult for our simple model:
   - Reviews with both positive and negative aspects
   - Sarcastic reviews
   - Technical reviews with specialized terminology

Use the cell below to experiment:

In [None]:
# Add your own reviews to test
your_reviews = [
    # Add your test reviews here
    "Write your own review here to test the model"
]

# Predict sentiment
your_predictions = sentiment_pipeline.predict(your_reviews)
your_confidence = sentiment_pipeline.predict_proba(your_reviews).max(axis=1)

# Display results
for review, prediction, confidence in zip(your_reviews, your_predictions, your_confidence):
    print(f"Review: {review}")
    print(f"Predicted sentiment: {prediction} (confidence: {confidence:.2f})\n")

## Conclusion

In this notebook, we've built a simple sentiment analysis model that demonstrates how businesses like Amazon can automatically analyze customer reviews. Our approach uses text vectorization and a Naive Bayes classifier to determine whether a review expresses positive or negative sentiment.

Remember that real-world implementations would be more sophisticated, but the fundamental process remains similar:
1. Collect and preprocess text data
2. Transform text into numerical features
3. Train a classification model
4. Deploy the model to analyze new reviews automatically
5. Extract business insights from the results

By implementing sentiment analysis, businesses can systematically track customer satisfaction, identify product issues, and respond proactively to customer needs—all at a scale that would be impossible with manual review.