# Sentiment Analysis for Product Reviews

This notebook demonstrates a practical implementation of sentiment analysis for product reviews, inspired by Amazon's approach. We'll build a simple sentiment classifier that can categorize reviews as positive, negative, or neutral.

## Business Context

Amazon processes millions of customer reviews daily. Automatically analyzing the sentiment of these reviews allows them to:
- Identify products with quality issues
- Highlight highly-rated products
- Track customer satisfaction trends
- Feed data into recommendation systems

By the end of this notebook, you'll understand how sentiment analysis works and how to implement a basic version yourself.

## 1. Setup and Data Loading

First, let's import the necessary libraries and load a sample dataset of product reviews.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download required NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 8)

In [None]:
# Load sample dataset (in practice, this would be your own data)
# For this demo, we'll use a subset of Amazon product reviews
# The dataset format should have at least 'review_text' and 'rating' columns

# Sample code to load data:
# df = pd.read_csv('amazon_reviews_sample.csv')

# For demonstration, we'll create a small sample dataset
data = {
    'review_text': [
        "This product exceeded my expectations. The quality is outstanding and it works perfectly.",
        "Absolutely terrible. Broke after two days and customer service was unhelpful.",
        "Good value for the price, but the instructions were confusing.",
        "Love this product! Would definitely buy again and recommend to friends.",
        "Mediocre at best. Does the job but nothing special.",
        "Waste of money. Don't buy this product.",
        "Decent quality but shipping took too long.",
        "The perfect solution to my problem. Extremely satisfied with this purchase.",
        "Received a damaged item. Very disappointed.",
        "Not what I expected based on the description, but it's okay."
    ],
    'rating': [5, 1, 3, 5, 3, 1, 3, 5, 1, 2]
}

df = pd.DataFrame(data)

# Display the first few rows
df.head()

## 2. Data Preparation

Now, let's prepare our data for sentiment analysis. We'll:
1. Convert ratings to sentiment categories
2. Create a text preprocessing function
3. Split the data into training and testing sets

In [None]:
# Convert ratings to sentiment categories
# This is a simplified approach - in real applications, you might use the original ratings
def convert_rating_to_sentiment(rating):
    if rating >= 4:  # 4-5 stars
        return 'positive'
    elif rating <= 2:  # 1-2 stars
        return 'negative'
    else:  # 3 stars
        return 'neutral'

df['sentiment'] = df['rating'].apply(convert_rating_to_sentiment)

# Display the sentiment distribution
print("Sentiment distribution:")
print(df['sentiment'].value_counts())

# Plot the sentiment distribution
plt.figure(figsize=(8, 6))
df['sentiment'].value_counts().plot(kind='bar', color=['green', 'gray', 'red'])
plt.title('Sentiment Distribution')
plt.ylabel('Count')
plt.xlabel('Sentiment')
plt.show()

In [None]:
# Create a text preprocessing function
def preprocess_text(text):
    """Function to preprocess text by lowercasing, removing punctuation, and lemmatizing"""
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Get English stopwords
    stop_words = set(stopwords.words('english'))
    
    # Tokenize, lowercase, and remove punctuation
    tokens = nltk.word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]
    
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    
    # Join tokens back into a string
    return ' '.join(tokens)

# Apply preprocessing to the review text
df['processed_text'] = df['review_text'].apply(preprocess_text)

# Display original and processed text for comparison
comparison = pd.DataFrame({
    'Original Text': df['review_text'],
    'Processed Text': df['processed_text'],
    'Sentiment': df['sentiment']
})
comparison.head(3)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['processed_text'], 
    df['sentiment'], 
    test_size=0.3,  # 30% for testing
    random_state=42,  # For reproducibility
    stratify=df['sentiment']  # Maintain sentiment distribution
)

print(f"Training data size: {X_train.shape[0]} reviews")
print(f"Testing data size: {X_test.shape[0]} reviews")

## 3. Feature Engineering

Now we'll convert our text data into numerical features that can be used by machine learning algorithms. We'll use two popular approaches:
1. Bag of Words (CountVectorizer)
2. TF-IDF (Term Frequency-Inverse Document Frequency)

In [None]:
# Bag of Words approach
count_vectorizer = CountVectorizer(ngram_range=(1, 2))  # Include both unigrams and bigrams
X_train_counts = count_vectorizer.fit_transform(X_train)

# Show the vocabulary size
print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")

# Show the feature names (words) in the vocabulary
print("\nFirst 20 features in the vocabulary:")
print(list(count_vectorizer.vocabulary_.keys())[:20])

# Show the sparse matrix shape
print(f"\nFeature matrix shape: {X_train_counts.shape}")

In [None]:
# TF-IDF approach
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Convert to DataFrame for better visualization
feature_names = tfidf_vectorizer.get_feature_names_out()
df_tfidf = pd.DataFrame(X_train_tfidf.toarray(), columns=feature_names)

# Show a sample of the TF-IDF matrix
print("Sample of TF-IDF features for the first review:")
# Get non-zero features for the first document
non_zero_cols = df_tfidf.iloc[0].loc[df_tfidf.iloc[0] > 0].sort_values(ascending=False)
print(non_zero_cols.head(10))

## 4. Building the Sentiment Analysis Model

We'll build and compare two models using scikit-learn's Pipeline:
1. Naive Bayes with Count Vectorizer
2. Logistic Regression with TF-IDF

In [None]:
# Model 1: Naive Bayes with Count Vectorizer
nb_pipeline = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2))),
    ('classifier', MultinomialNB())
])

# Train the model
nb_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred_nb = nb_pipeline.predict(X_test)

# Evaluate the model
print("Naive Bayes Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_nb):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_nb))

In [None]:
# Model 2: Logistic Regression with TF-IDF
lr_pipeline = Pipeline([
    ('vectorizer', TfidfVectorizer(ngram_range=(1, 2))),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train the model
lr_pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = lr_pipeline.predict(X_test)

# Evaluate the model
print("Logistic Regression Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

In [None]:
# Visualize confusion matrices for both models
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Naive Bayes confusion matrix
cm_nb = confusion_matrix(y_test, y_pred_nb)
sns.heatmap(cm_nb, annot=True, fmt='d', cmap='Blues', xticklabels=nb_pipeline.classes_, 
            yticklabels=nb_pipeline.classes_, ax=axes[0])
axes[0].set_ylabel('True Sentiment')
axes[0].set_xlabel('Predicted Sentiment')
axes[0].set_title('Naive Bayes Model')

# Logistic Regression confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues', xticklabels=lr_pipeline.classes_, 
            yticklabels=lr_pipeline.classes_, ax=axes[1])
axes[1].set_ylabel('True Sentiment')
axes[1].set_xlabel('Predicted Sentiment')
axes[1].set_title('Logistic Regression Model')

plt.tight_layout()
plt.show()

## 5. Feature Importance Analysis

Let's examine which words are most indicative of positive and negative sentiment:

In [None]:
# Extract coefficients from logistic regression model
vectorizer = lr_pipeline.named_steps['vectorizer']
classifier = lr_pipeline.named_steps['classifier']

# Get feature names
feature_names = vectorizer.get_feature_names_out()

# For multiclass, we need to extract coefficients for each class
coefs = classifier.coef_

# Create a DataFrame with coefficients
coefficients_df = pd.DataFrame()

for i, sentiment in enumerate(classifier.classes_):
    coefficients_df[f'{sentiment}_coef'] = coefs[i]

coefficients_df['feature'] = feature_names

# Sort features by their importance for each sentiment
top_features = {}
for sentiment in classifier.classes_:
    # Get top positive coefficients
    top_positive = coefficients_df.sort_values(f'{sentiment}_coef', ascending=False)
    top_features[f'top_{sentiment}'] = top_positive[['feature', f'{sentiment}_coef']].head(10)

# Display top features for each sentiment
for sentiment, features_df in top_features.items():
    print(f"\n{sentiment.replace('top_', 'Top 10 Features for ')} Sentiment:")
    print(features_df)

In [None]:
# Visualize top words for positive and negative sentiment
fig, axes = plt.subplots(1, len(classifier.classes_), figsize=(18, 6))

for i, sentiment in enumerate(classifier.classes_):
    # Get the top features DataFrame
    top_df = top_features[f'top_{sentiment}']
    
    # Create horizontal bar chart
    ax = axes[i]
    ax.barh(top_df['feature'], top_df[f'{sentiment}_coef'])
    ax.set_title(f'Top 10 Features for {sentiment.capitalize()} Sentiment')
    ax.set_xlabel('Coefficient Value')
    
    # Invert y-axis to have highest coefficient at the top
    ax.invert_yaxis()

plt.tight_layout()
plt.show()

## 6. Making Predictions on New Reviews

Now let's use our model to analyze sentiment in new, unseen reviews:

In [None]:
# Define a function to predict sentiment with probability
def predict_sentiment(review_text, model=lr_pipeline):
    """Predict sentiment of a review with probability scores"""
    # Get the prediction
    sentiment = model.predict([review_text])[0]
    
    # Get probability scores for each class
    proba = model.predict_proba([review_text])[0]
    proba_dict = dict(zip(model.classes_, proba))
    
    return {
        'text': review_text,
        'predicted_sentiment': sentiment,
        'confidence': proba_dict[sentiment],
        'probability_scores': proba_dict
    }

# Test on new reviews
new_reviews = [
    "This product is amazing! It works exactly as described and the quality is excellent.",
    "I'm very disappointed with this purchase. It broke after just a week of use.",
    "The product is okay. Not great, not terrible. It does what it's supposed to do.",
    "While the shipping was slow, the product itself is pretty good.",
    "This is the worst purchase I've ever made. Avoid at all costs!"
]

# Analyze each review
for review in new_reviews:
    result = predict_sentiment(review)
    print(f"\nReview: {result['text']}")
    print(f"Predicted sentiment: {result['predicted_sentiment']} (Confidence: {result['confidence']:.2f})")
    print("Probability scores:")
    for sentiment, score in result['probability_scores'].items():
        print(f"  {sentiment}: {score:.2f}")

## 7. Business Application Demonstration

Let's simulate how Amazon might use sentiment analysis in their business processes:

In [None]:
# Simulate a batch of new product reviews
product_reviews = [
    {"product_id": "B001", "product_name": "Wireless Headphones", "review": "Excellent sound quality and battery life. Very comfortable to wear."},
    {"product_id": "B001", "product_name": "Wireless Headphones", "review": "Decent headphones but they hurt my ears after an hour."},
    {"product_id": "B001", "product_name": "Wireless Headphones", "review": "Battery drains too quickly. Not worth the money."},
    {"product_id": "B002", "product_name": "Smart Watch", "review": "Love this watch! The fitness tracking is accurate and the battery lasts for days."},
    {"product_id": "B002", "product_name": "Smart Watch", "review": "Screen cracked after just one week. Poor quality."},
    {"product_id": "B002", "product_name": "Smart Watch", "review": "Good features but the app is buggy."},
    {"product_id": "B003", "product_name": "Bluetooth Speaker", "review": "Amazing sound for such a small speaker. Highly recommend!"},
    {"product_id": "B003", "product_name": "Bluetooth Speaker", "review": "Decent speaker but the connection keeps dropping."},
    {"product_id": "B003", "product_name": "Bluetooth Speaker", "review": "Terrible build quality. It stopped working after a month."}
]

# Create a DataFrame
reviews_df = pd.DataFrame(product_reviews)

# Add sentiment predictions
reviews_df['sentiment'] = reviews_df['review'].apply(lambda x: predict_sentiment(x)['predicted_sentiment'])
reviews_df['confidence'] = reviews_df['review'].apply(lambda x: predict_sentiment(x)['confidence'])

# Display the results
print("Product Reviews with Sentiment Analysis:")
reviews_df

In [None]:
# Analyze sentiment by product
product_sentiment = reviews_df.groupby(['product_id', 'product_name', 'sentiment']).size().unstack(fill_value=0)

# Calculate sentiment percentages
product_sentiment_pct = product_sentiment.div(product_sentiment.sum(axis=1), axis=0) * 100

# Create a sentiment score (-100 to +100 scale)
def calculate_sentiment_score(row):
    # If columns don't exist, use 0
    positive = row.get('positive', 0)
    negative = row.get('negative', 0)
    neutral = row.get('neutral', 0)
    
    # Calculate weighted score
    total = positive + negative + neutral
    if total == 0:
        return 0
    return ((positive * 100) - (negative * 100)) / total

# Add sentiment score
product_sentiment['sentiment_score'] = product_sentiment.apply(calculate_sentiment_score, axis=1)

# Display product sentiment analysis
print("Product Sentiment Analysis:")
print(product_sentiment)
print("\nProduct Sentiment Percentages:")
print(product_sentiment_pct)

# Visualize the sentiment distribution by product
plt.figure(figsize=(12, 6))
product_sentiment_pct.drop(columns=['sentiment_score'] if 'sentiment_score' in product_sentiment_pct.columns else []).plot(
    kind='bar', stacked=True, 
    color=['green', 'gray', 'red'] if all(col in product_sentiment_pct.columns for col in ['positive', 'neutral', 'negative']) 
    else None
)
plt.title('Sentiment Distribution by Product')
plt.xlabel('Product')
plt.ylabel('Percentage')
plt.xticks(rotation=45)
plt.legend(title='Sentiment')
plt.tight_layout()
plt.show()

In [None]:
# Identify common themes in negative reviews (simplified version of aspect-based sentiment analysis)
negative_reviews = reviews_df[reviews_df['sentiment'] == 'negative']

print("Negative Reviews Analysis:")
for _, row in negative_reviews.iterrows():
    print(f"\nProduct: {row['product_name']} (ID: {row['product_id']})")
    print(f"Review: {row['review']}")
    
    # In a real system, this would use more sophisticated techniques
    # For simplicity, we'll just check for common issue keywords
    issues = []
    if any(word in row['review'].lower() for word in ['battery', 'drain', 'charge', 'power']):
        issues.append('Battery Issues')
    if any(word in row['review'].lower() for word in ['broke', 'broken', 'crack', 'damage', 'quality']):
        issues.append('Quality/Durability Issues')
    if any(word in row['review'].lower() for word in ['app', 'software', 'bug', 'connection', 'bluetooth']):
        issues.append('Software/Connectivity Issues')
    if any(word in row['review'].lower() for word in ['uncomfortable', 'pain', 'hurt', 'fit']):
        issues.append('Comfort Issues')
        
    if issues:
        print(f"Potential Issues: {', '.join(issues)}")
    else:
        print("No specific issues identified.")

## 8. Learning Challenge

Now it's your turn to apply what you've learned! Try to complete the following tasks:

1. Modify the preprocessing function to improve performance (e.g., handle emoticons, exclamation marks)
2. Try a different classifier (e.g., SVM, Random Forest)
3. Implement a function to extract the specific aspect being discussed in a review (aspect-based sentiment analysis)

### Challenge Code Template

In [None]:
# 1. Enhanced preprocessing function
def enhanced_preprocess_text(text):
    """Enhanced preprocessing with emoticon handling and exclamation preservation"""
    # Your implementation here
    
    return processed_text

# 2. Try a different classifier
# Your implementation here

# 3. Aspect-based sentiment analysis
def extract_aspects(review_text):
    """Extract aspects (product features) from review text and their sentiment"""
    # Your implementation here
    
    return aspects

## 9. Business Implications

Let's discuss how sentiment analysis creates business value:

1. **Product Development**: Identifying product issues from negative reviews helps prioritize improvements
2. **Customer Service**: Automatically flagging strongly negative reviews for response
3. **Marketing**: Extracting positive feedback for testimonials and marketing materials
4. **Competitive Analysis**: Comparing sentiment across competing products
5. **Recommendation Systems**: Incorporating sentiment to improve product recommendations

For Amazon, sentiment analysis is not just a technical exercise but a core business capability that drives decisions across the organization. By automating the understanding of millions of customer opinions, they can identify trends, address issues, and improve the customer experience at scale.

## Conclusion

In this notebook, we've built a basic sentiment analysis system for product reviews. We've seen how to:

1. Preprocess text data for sentiment analysis
2. Convert text to numerical features using vectorization
3. Train machine learning models to classify sentiment
4. Evaluate model performance
5. Apply the model to analyze new reviews
6. Extract business insights from sentiment analysis

While our simple model demonstrates the core concepts, production systems at companies like Amazon would include more advanced techniques such as:
- Deep learning models (BERT, RoBERTa)
- Aspect-based sentiment analysis
- Multilingual support
- Real-time processing pipelines

However, the fundamental principles remain the same, and you now understand how sentiment analysis works and how it creates business value.