# Customer Sentiment Analysis Project  

## Project Overview  
This project aims to analyze customer sentiments based on product reviews collected from Flipkart.com. Using natural language processing (NLP) techniques and machine learning models, the project focuses on identifying customer sentiment as **positive**, **neutral**, or **negative**. The insights generated can help businesses understand customer feedback, improve their services, and make data-driven decisions for product development and marketing strategies.

---

## About the Dataset  

### Dataset Description  
The dataset comprises customer reviews of 104 product categories from Flipkart.com, including electronics, clothing, home decor, and automated systems. It contains **205,053 rows** and **6 columns**, providing detailed information about product reviews, ratings, and sentiments.

## Data Dource

This dataset is available on Kaggle in the following link:
> https://www.kaggle.com/datasets/niraliivaghani/flipkart-product-customer-reviews-dataset

### Features  

| Column         | Description                                                                                     |
|-----------------|-------------------------------------------------------------------------------------------------|
| `Product_name` | Name of the product reviewed.                                                                   |
| `Product_price`| Price of the product.                                                                           |
| `Rate`         | Customer's rating of the product (on a scale of 1 to 5).                                        |
| `Review`       | Text of the customer's review for each product.                                                 |
| `Summary`      | Descriptive summary of the customer's thoughts about the product.                               |
| `Sentiment`    | Multiclass label for sentiment: **Positive**, **Negative**, or **Neutral** (derived from Summary). |

### Data Cleaning  
- Missing values in the `Review` column are handled, and `NaN` values are included for products with no review but an available `Summary`.  
- Data cleaning was performed using Python's `NumPy` and `Pandas` libraries.  
- Sentiment labeling was conducted using the **VADER model** and manually validated for accuracy.  

### Data Collection  
The dataset was obtained via **web scraping** using the `BeautifulSoup` library from Flipkart.com in December 2022.

---

## Objectives  

1. **Sentiment Analysis:**  
   - Classify customer reviews as **Positive**, **Negative**, or **Neutral** using NLP models.  

2. **Predictive Modeling:**  
   - Use features like ratings, reviews, and summaries to predict customer behavior and product preferences.  

3. **Text Classification:**  
   - Develop text classification models for tasks such as spam detection, topic classification, and intent recognition.  

4. **NLP Applications:**  
   - Train and evaluate NLP algorithms for sentiment analysis and other domains.  

5. **Customer Insights:**  
   - Extract actionable insights from customer reviews to improve customer service and product offerings.  

---

## Usage  

### Applications of the Dataset  
1. **Sentiment Analysis:**  
   Train models to classify customer sentiments for reviews and summaries.  

2. **Predictive Modeling:**  
   Predict customer behavior, purchase patterns, and product preferences based on reviews.  

3. **Text Classification:**  
   Develop models for spam detection, topic classification, and other text-based tasks.  

4. **Customer Service Insights:**  
   Identify customer complaints, issues, and suggestions to enhance service quality.  

5. **Machine Learning Evaluation:**  
   Evaluate and benchmark sentiment analysis models using this dataset.  

---

## Methodology  

### 1. **Data Understanding**  
   - Perform **Exploratory Data Analysis (EDA)** to uncover patterns and trends in the data.  
   - Assess data quality and identify missing or inconsistent values.  

### 2. **Data Preparation**  
   - Handle missing values, duplicates, and outliers.  
   - Normalize textual data and preprocess reviews for modeling (tokenization, stemming, and lemmatization).  
   - Split data into **training**, **validation**, and **test** sets.  

### 3. **Modeling**  
   - **Baseline Models:** Use models like Naive Bayes for initial benchmarking.  
   - **Advanced Models:** Train machine learning models such as Logistic Regression, SVM, and Random Forest.  
   - **Deep Learning Models:** Implement LSTMs or Transformers (e.g., BERT) for advanced text analysis.  

### 4. **Evaluation**  
   - Evaluate model performance using metrics like accuracy, precision, recall, F1-score, and ROC-AUC.  
   - Visualize confusion matrices and ROC curves for better understanding.  

---

## Tools and Technologies  

- **Programming Language:** Python  
- **Libraries:** pandas, numpy, matplotlib, seaborn, scikit-learn, nltk, spaCy, BeautifulSoup, TensorFlow/Keras  
- **Models:** VADER, Logistic Regression, Naive Bayes, SVM, Random Forest, LSTM, BERT  
- **Visualization Tools:** matplotlib, seaborn, plotly  

---

## Insights  

1. **Sentiment Distribution:** Majority of reviews are positive, with a smaller proportion being neutral or negative.  
2. **Product Insights:** Some categories (e.g., electronics) show higher customer satisfaction compared to others.  
3. **Customer Behavior:** Pricing and ratings significantly influence sentiment.  

---


## Conclusion  

The **Customer Sentiment Analysis Project** provides actionable insights into customer feedback, enabling businesses to refine their strategies, improve product offerings, and enhance customer experiences. By leveraging advanced NLP techniques and machine learning models, this project delivers significant value for customer-centric decision-making.

---


### Import Libraries


In [1]:
import pandas as pd
import numpy as np
import os
import warnings
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV

### Settings

In [2]:
# Warning
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
csv_path = os.path.join(data_path, "Cleaned_SA.csv")

### Load Dataset

In [3]:
df = pd.read_csv(csv_path)

In [4]:
# Check data
df.head()

Unnamed: 0,product_name,product_price,Rate,Review,Summary,Sentiment
0,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,super!,great cooler excellent air flow and for this p...,positive
1,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,5,awesome,best budget 2 fit cooler nice cooling,positive
2,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,the quality is good but the power of air is de...,positive
3,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,1,useless product,very bad product its a only a fan,negative
4,Candes 12 L Room/Personal Air Cooler??????(Whi...,3999,3,fair,ok ok product,neutral


### Natural Language Processing(NLP)

Preprocess the text with NLP to prepare the data for traing the model.

In [5]:
# Download NLTK(Natural Language Toolkit) resources
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/arghadeysarkar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arghadeysarkar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# Text Preprocessing
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text.lower())
    filtered_tokens = [word for word in tokens if word.isalpha() and word not in stop_words]
    return " ".join(filtered_tokens)

In [8]:
# Prepare data for training
def prepare_to_train(data):

    data['cleaned_review'] = data['Review'].apply(preprocess_text)
    X = data['cleaned_review']
    y = data['Sentiment']
    
    # Convert text to numerical data
    vectorizer = CountVectorizer()
    # vectorizer = TfidfVectorizer(max_features= 5000)
    X_vectorized = vectorizer.fit_transform(X)

    return vectorizer, X_vectorized, y

In [9]:
# Prepare text
vectorizer, X, y = prepare_to_train(df)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Training and Evaluation

In [10]:
def train_evaluate(model):
    # Train the model
    model.fit(X_train, y_train)

    # Evaluate model
    y_pred = model.predict(X_test)
    print("Model Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    return model

In [12]:
model = RandomForestClassifier()
model = train_evaluate(model)

Model Accuracy: 0.8970673917265488
Classification Report:
              precision    recall  f1-score   support

    negative       0.85      0.75      0.79      4734
     neutral       0.00      0.00      0.00      1679
    positive       0.91      0.99      0.94     24481

    accuracy                           0.90     30894
   macro avg       0.58      0.58      0.58     30894
weighted avg       0.85      0.90      0.87     30894



### Model Performance Analysis 

#### 1. Overall Model Performance

- **Accuracy:**

- The model achieves an accuracy of **88.88%**, which is relatively **high**.
- However, accuracy alone is not a reliable metric in cases of imbalanced datasets, as it can be skewed by the dominant class.

#### 2. Class-wise Performance Analysis

- **Positive Sentiment:**

    - **Precision: 0.90** The model is highly precise for positive sentiment, meaning most reviews predicted as positive are indeed positive.
    - **Recall: 0.98** The recall is very high, indicating the model correctly identifies most positive reviews.
    - **F1-Score: 0.94** The F1-score is excellent, reflecting a strong balance between precision and recall for this class.

- **Observation:**

    - The model performs extremely well for positive sentiments because they dominate the dataset, making the classifier biased toward this class.

- **Negative Sentiment:**

    - **Precision: 0.85** The model does well in predicting negative sentiments, meaning that when it predicts "negative," it is correct **85%** of the time.
    - **Recall: 0.71** The recall is lower, indicating the model misses some negative reviews.
    - **F1-Score: 0.77** This score shows decent but not exceptional performance for negative sentiment.
- **Observation:**

    - The model struggles slightly with negative sentiments, possibly due to their smaller proportion in the dataset, leading to underrepresentation in training.

- **Neutral Sentiment:**

    - **Precision: 0.08** The precision is very low, meaning most reviews predicted as neutral are actually misclassified.
    - **Recall: 0.00** The recall of 0 indicates that the model almost never predicts neutral reviews correctly.
    - **F1-Score: 0.01** The F1-score is essentially negligible, highlighting that the model fails to capture neutral sentiments.
- **Observation:**
    - The model performs poorly on neutral sentiments, likely due to their very small representation in the dataset and their inherently ambiguous nature, which makes them harder to distinguish.

#### 3. Macro vs. Weighted Averages

- **Macro Average:**

    - Precision, Recall, and F1-scores for macro averaging are around 0.57–0.61, reflecting poor performance across minority classes (negative and neutral).
    - This suggests that the model does not generalize well to all classes.

- **Weighted Average:**
    - Precision, Recall, and F1-scores are around 0.85–0.89, which are much higher due to the heavy influence of the dominant positive class.

- **Observation:**
    - The weighted average is inflated by the high performance on positive sentiments, masking the poor results for negative and neutral classes.

#### 4. Challenges with Class Imbalance
    
- The dataset is highly imbalanced, with the positive class making up the majority of the data (~80%).

- This imbalance leads to:
    - Excellent performance for the positive class.
    - Decent performance for the negative class.
    - Poor performance for the neutral class, as the model is biased toward the majority class.

#### 5. Recommendations for Improvement

- **Handle Class Imbalance:**

    - Apply oversampling techniques like **SMOTE or ADASYN** for the minority classes (neutral and negative).
    - Alternatively, use undersampling for the majority class (positive) to balance the dataset.

- **Feature Engineering:**

    - Explore additional text preprocessing steps, such as removing stop words, stemming/lemmatization, or using domain-specific dictionaries, to better capture nuances in neutral and negative sentiments.

- Use a Different Algorithm:

    - Consider more sophisticated models like **Logistic Regression with Class Weights, Random Forests, or Transformer-based models (e.g., BERT)**, which might handle imbalanced data better.

### Model Optimization

In [13]:
# Use SMOTE(Synthetic Minority Oversampling Technique) for balacing the dataset
smote = SMOTE(random_state= 42)
X_train, y_train = smote.fit_resample(X_train, y_train)

In [14]:
# Retain the model
model = train_evaluate(model)

Model Accuracy: 0.793001877387195
Classification Report:
              precision    recall  f1-score   support

    negative       0.84      0.75      0.79      4734
     neutral       0.13      0.45      0.21      1679
    positive       0.96      0.82      0.89     24481

    accuracy                           0.79     30894
   macro avg       0.65      0.67      0.63     30894
weighted avg       0.90      0.79      0.84     30894



In [18]:
# Use SMOTE(Synthetic Minority Oversampling Technique) for balacing the dataset
smote = SMOTE(random_state= 42)
X_balanced, y_balanced = smote.fit_resample(X, y)

In [19]:
# Hyperparameter Tuning with GridSearchCV
def tune_hyperparameter(model, param_grid, X, y):
    # Define CV
    gscv = GridSearchCV(model,
                       param_grid= param_grid,
                       cv= 5,
                       verbose= 1)
    # train the model
    gscv.fit(X, y)

    print(f"Best Score: {gscv.best_score_: 0.2f}")
    best_params = gscv.best_params_
    return best_params

In [None]:
# Define heper parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'criterion': ["gini", "entropy", "log_loss"],
    'min_samples_split': [2, 3]
}
model_ht = RandomForestClassifier()
best_params = tune_hyperparameter(model_ht, param_grid, X_balanced, y_balanced)
print(best_params)

Fitting 5 folds for each of 12 candidates, totalling 60 fits


In [17]:
# Retrain the model with best hyperparameter set
model_opt = MultinomialNB(**best_params)
model = train_evaluate(model_opt)

Model Accuracy: 0.7255130446041302
Classification Report:
              precision    recall  f1-score   support

    negative       0.75      0.75      0.75      4734
     neutral       0.11      0.50      0.18      1679
    positive       0.96      0.74      0.83     24481

    accuracy                           0.73     30894
   macro avg       0.61      0.66      0.59     30894
weighted avg       0.88      0.73      0.79     30894

