# Sentiment Analysis on Amazon Product Reviews

## 1. Dataset Overview
- **Dataset Description**:
  - Analyze an Amazon product review dataset containing textual reviews (`reviewText`) and corresponding sentiment labels (`Positive`).
  - Sentiment is binary: 1 for positive, 0 for negative.
- **Objective**:
  - Predict the sentiment of a product review based on its textual content.


In [None]:
# --- Import Libraries and Load Data ---
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Amazon product reviews dataset
# Replace 'amazon_reviews.csv' with your actual file path
reviews_df = pd.read_csv('amazon_reviews.csv')

# Preview the first few rows
reviews_df.head()

In [None]:
# --- Data Exploration ---
# Check the shape of the dataset
print(f"Dataset shape: {reviews_df.shape}")

# Check for missing values
print("Missing values per column:")
print(reviews_df.isnull().sum())

# Check class distribution (assuming 'sentiment' column exists)
if 'sentiment' in reviews_df.columns:
    print("\nSentiment class distribution:")
    print(reviews_df['sentiment'].value_counts())

Unnamed: 0,reviewText,Positive
0,This is a one of the best apps acording to a b...,1
1,This is a pretty good version of the game for ...,1
2,this is a really cool game. there are a bunch ...,1
3,"This is a silly game and can be frustrating, b...",1
4,This is a terrific game on any pad. Hrs of fun...,1


## 2. Data Preprocessing
- Handle missing values, if any.
- Perform text preprocessing on the `reviewText` column:
  - Convert text to lowercase.
  - Remove stop words, punctuation, and special characters.
  - Tokenize and lemmatize text data.
- Split the dataset into training and testing sets.


In [None]:
# --- Text Preprocessing ---
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download NLTK resources if not already present
import nltk
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing to the 'review' column
reviews_df['clean_review'] = reviews_df['review'].apply(preprocess_text)

# Preview cleaned reviews
reviews_df[['review', 'clean_review']].head()

In [None]:
# --- Exploratory Data Analysis (EDA) ---
from wordcloud import WordCloud

# 1. Word Cloud for positive and negative reviews
plt.figure(figsize=(10,5))
if 'sentiment' in reviews_df.columns:
    pos_text = ' '.join(reviews_df[reviews_df['sentiment']=='positive']['clean_review'])
    neg_text = ' '.join(reviews_df[reviews_df['sentiment']=='negative']['clean_review'])
    wordcloud_pos = WordCloud(width=800, height=400, background_color='white').generate(pos_text)
    wordcloud_neg = WordCloud(width=800, height=400, background_color='black').generate(neg_text)
    plt.subplot(1,2,1)
    plt.imshow(wordcloud_pos, interpolation='bilinear')
    plt.title('Positive Reviews Word Cloud')
    plt.axis('off')
    plt.subplot(1,2,2)
    plt.imshow(wordcloud_neg, interpolation='bilinear')
    plt.title('Negative Reviews Word Cloud')
    plt.axis('off')
    plt.show()

# 2. Sentiment distribution
if 'sentiment' in reviews_df.columns:
    sns.countplot(x='sentiment', data=reviews_df)
    plt.title('Sentiment Distribution')
    plt.show()

# 3. Review length analysis
reviews_df['review_length'] = reviews_df['clean_review'].apply(lambda x: len(x.split()))
sns.histplot(reviews_df['review_length'], bins=30)
plt.title('Distribution of Review Lengths')
plt.xlabel('Number of Words')
plt.show()

In [None]:
# --- Feature Extraction ---
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF features from cleaned reviews
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(reviews_df['clean_review']).toarray()

# Target variable (assuming 'sentiment' column exists)
if 'sentiment' in reviews_df.columns:
    y = reviews_df['sentiment'].map({'positive':1, 'negative':0})

# Preview feature matrix shape
print(f"TF-IDF feature matrix shape: {X.shape}")

## 3. Model Selection
- Choose at least three machine learning models for sentiment classification:
  - Statistical Models:
    - Logistic Regression
    - Random Forest
    - Support Vector Machine (SVM)
    - Naïve Bayes
    - Gradient Boosting (e.g., XGBoost, AdaBoost, CatBoost)
  - Neural Models:
    - LSTM (Long Short-Term Memory)
    - GRUs (Gated Recurrent Units)


In [None]:
# --- Model Training ---
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

In [None]:
# --- Model Evaluation ---
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Predict on test set
y_pred_logreg = logreg.predict(X_test)
y_pred_rf = rf.predict(X_test)

# Evaluate Logistic Regression
print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_logreg):.2f}")
print(classification_report(y_test, y_pred_logreg))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_logreg))

# Evaluate Random Forest
print("\nRandom Forest Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.2f}")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

In [None]:
# --- Visualization of Model Performance ---
from sklearn.metrics import roc_curve, auc

# ROC Curve for Random Forest
y_prob_rf = rf.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob_rf)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'Random Forest (AUC = {roc_auc:.2f})')
plt.plot([0,1], [0,1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

# Feature Importance for Random Forest
importances = rf.feature_importances_
indices = np.argsort(importances)[-20:][::-1]  # Top 20 features
plt.figure(figsize=(10,6))
plt.bar(range(len(indices)), importances[indices])
plt.xticks(range(len(indices)), [vectorizer.get_feature_names_out()[i] for i in indices], rotation=90)
plt.title('Top 20 Important Features (Random Forest)')
plt.tight_layout()
plt.show()

## 4. Model Training
- Train each selected model on the training dataset.
- Utilize vectorization techniques for text data:
  - TF-IDF (Term Frequency-Inverse Document Frequency)
  - Word embeddings (e.g., Word2Vec, GloVe)


In [None]:
# --- Predict Sentiment on New Reviews ---
# Example: Predict sentiment for new/unseen reviews
new_reviews = [
    "This product is amazing! Highly recommend.",
    "Terrible quality, very disappointed.",
    "Works as expected, good value for money."
]

# Preprocess new reviews
new_clean = [preprocess_text(review) for review in new_reviews]
new_features = vectorizer.transform(new_clean).toarray()

# Predict using Random Forest
new_pred = rf.predict(new_features)

# Map predictions to labels
label_map = {1: 'positive', 0: 'negative'}
for review, pred in zip(new_reviews, new_pred):
    print(f"Review: '{review}'\nPredicted Sentiment: {label_map[pred]}\n")

In [None]:
# --- Insights & Recommendations ---
# Summarize findings and provide actionable recommendations

print("\n--- Insights & Recommendations ---")
if accuracy_score(y_test, y_pred_rf) > 0.85:
    print("The Random Forest model performs well. Consider deploying it for automated sentiment analysis.")
else:
    print("Model accuracy is moderate. Try more advanced NLP models or tune hyperparameters.")

# Common positive and negative words
if 'sentiment' in reviews_df.columns:
    from collections import Counter
    pos_words = ' '.join(reviews_df[reviews_df['sentiment']=='positive']['clean_review']).split()
    neg_words = ' '.join(reviews_df[reviews_df['sentiment']=='negative']['clean_review']).split()
    print("\nTop 10 Positive Words:", Counter(pos_words).most_common(10))
    print("Top 10 Negative Words:", Counter(neg_words).most_common(10))

print("\nRecommendation: Monitor product reviews regularly to identify customer pain points and improve product quality.")

In [None]:
# --- Hyperparameter Tuning ---
from sklearn.model_selection import GridSearchCV

# Random Forest parameter grid
grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}
rf_search = GridSearchCV(RandomForestClassifier(random_state=42), grid_rf, cv=3, scoring='accuracy', n_jobs=-1)
rf_search.fit(X_train, y_train)
print(f"Best RF parameters: {rf_search.best_params_}")
print(f"Best RF cross-validated accuracy: {rf_search.best_score_:.2f}")

# Logistic Regression parameter grid
grid_lr = {
    'C': [0.01, 0.1, 1, 10],
    'solver': ['liblinear', 'lbfgs']
}
lr_search = GridSearchCV(LogisticRegression(max_iter=1000), grid_lr, cv=3, scoring='accuracy', n_jobs=-1)
lr_search.fit(X_train, y_train)
print(f"Best LR parameters: {lr_search.best_params_}")
print(f"Best LR cross-validated accuracy: {lr_search.best_score_:.2f}")

## 5. Formal Evaluation
- Evaluate the performance of each model on the testing set using the following metrics:
  - Accuracy
  - Precision
  - Recall
  - F1 Score
  - Confusion Matrix


In [None]:
# --- Error Analysis ---
# Find misclassified examples in test set
misclassified_idx = np.where(y_test != y_pred_rf)[0]
print(f"Number of misclassified reviews: {len(misclassified_idx)}")

# Show a few misclassified reviews
for i in misclassified_idx[:5]:
    print(f"Review: {reviews_df.iloc[X_test[i].argmax()]['review']}")
    print(f"True Sentiment: {label_map[y_test.iloc[i]]}, Predicted: {label_map[y_pred_rf[i]]}\n")

In [None]:
# --- Save Model and Vectorizer ---
import joblib

# Save Random Forest model
joblib.dump(rf, 'sentiment_rf_model.joblib')

# Save TF-IDF vectorizer
joblib.dump(vectorizer, 'tfidf_vectorizer.joblib')

print("Model and vectorizer saved successfully.")

In [None]:
# --- Load Model and Predict on New Data ---
import joblib

# Load saved Random Forest model and TF-IDF vectorizer
loaded_rf = joblib.load('sentiment_rf_model.joblib')
loaded_vectorizer = joblib.load('tfidf_vectorizer.joblib')

# Example: Predict sentiment for new/unseen reviews
new_reviews = [
    "The product exceeded my expectations!",
    "Not worth the money, very poor quality.",
    "Average experience, nothing special."
]

# Preprocess new reviews
new_clean = [preprocess_text(review) for review in new_reviews]
new_features = loaded_vectorizer.transform(new_clean).toarray()

# Predict using loaded model
new_pred = loaded_rf.predict(new_features)

# Map predictions to labels
label_map = {1: 'positive', 0: 'negative'}
for review, pred in zip(new_reviews, new_pred):
    print(f"Review: '{review}'\nPredicted Sentiment: {label_map[pred]}\n")

## 6. Hyperparameter Tuning
- Perform hyperparameter tuning for selected models using:
  - Grid Search
  - Random Search
- Explain the chosen hyperparameters and justify their selection.


In [None]:
# --- Comparative Analysis ---
# Compare performance of Logistic Regression and Random Forest

models = ['Logistic Regression', 'Random Forest']
accuracies = [accuracy_score(y_test, y_pred_logreg), accuracy_score(y_test, y_pred_rf)]

plt.figure(figsize=(6,4))
sns.barplot(x=models, y=accuracies)
plt.ylim(0,1)
plt.ylabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.show()

print(f"Logistic Regression Accuracy: {accuracies[0]:.2f}")
print(f"Random Forest Accuracy: {accuracies[1]:.2f}")

# Print classification reports for both models
print("\nLogistic Regression Classification Report:")
print(classification_report(y_test, y_pred_logreg))
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))

## 7. Comparative Analysis
- Compare the performance of all models based on evaluation metrics.
- Identify strengths and weaknesses of each model (e.g., speed, accuracy, interpretability).


In [None]:
# --- Conclusion & Comments ---
# Summarize findings and provide final comments

print("\n--- Conclusion & Comments ---")
if accuracies[1] > accuracies[0]:
    print("Random Forest outperformed Logistic Regression in sentiment classification.")
else:
    print("Logistic Regression outperformed Random Forest in sentiment classification.")

print("Both models achieved reasonable accuracy. For further improvement, consider:")
print("- Using advanced NLP models (e.g., BERT, LSTM)")
print("- More feature engineering (e.g., sentiment lexicons, n-grams)")
print("- Hyperparameter tuning and cross-validation")
print("- Addressing class imbalance if present")

print("Regular monitoring of product reviews can help businesses respond to customer feedback and improve products.")

In [None]:
# --- Comments & Suggestions for Future Work ---
# Provide suggestions for further analysis and improvements

print("\n--- Suggestions for Future Work ---")
print("1. Try deep learning models for better accuracy.")
print("2. Use more sophisticated text preprocessing (lemmatization, stemming).")
print("3. Explore unsupervised sentiment clustering.")
print("4. Visualize sentiment trends over time.")
print("5. Integrate with real-time review monitoring systems.")

## 8. Conclusion & Comments
- Summarize the findings of the project.
- Provide insights into the challenges faced during data preprocessing, model training, and evaluation.
- Highlight key lessons learned.
- Add clear and concise comments to the code for each step of the project.
- Highlight key results, visualizations, and model comparisons.
