# Amazon Product Quality Drivers & Customers Sentiment Analysis

---

## Project Overview & Business Value

This project utilizes Amazon product data, including pricing and customer reviews, to uncover the key factors that **drive high customer ratings (4.0 and 5.0)**. By leveraging **Natural Language Processing (NLP)** and **Classification Modeling**,
the analysis identifies specific product features and customer sentiments that lead to satisfaction.

**The primary business objective is to provide actionable recommendations for:**
1. **Product Development:** Knowing where to focus quality improvements.
2. **Marketing Strategy:** Understanding which product attributes to highlight
3. **Pricing/Promotion:** Evaluating the impact of discounts on perceived value.

---

## Key Business Questions Answered

1. Which product categories show the strongest correlation between **price discounts** and high ratings?
2. What are the top 5 most influential product attributes (from the product description) that predict a 5-star rathing?
3. Does the sentiment score derived from the *review content* significantly outperform the numerical rating in predicting overall satisfaction?
4. What are the common topics or themes in low-rated reviews that demand immediate business attention?

---

## Dataset Column Description (Data Dictionary)

| Column Name | Data Type (Implied) | Description & Use in Project | 
| :--- | :--- | :--- |
| `product_id` | String | Unique identifier for the product. |
| `product_name` | String | Full name of the product. |
| `category` | String | Primary category of the product (e.g., Electronics, Fashion). **Used for segmentation.** |
| `discounted_price` | String/Numeric | The price the customer paid. **Used in pricing analysis.** |
| `actual_price` | String/Numeric | The original list price. |
| `discount_percentage` | String/Numeric | The calculated discount given. **Used as a key driver feature.** |
| `rating` | Float | The numerical rating (1.0 to 5.0). **Used to create the binary target variable.** |
| `rating_count` | Integer | The total number of votes for the rating. **Used as a proxy for product popularity/sales volume.** |
| `about_product` | String | Detailed product features/description. **Source for keyword-based feature engineering (e.g., "durable").** |
| `user_id` | String | Unique identifier for the reviewer. |
| `user_name` | String | Name of the reviewer. |
| `review_id` | String | Unique identifier for the specific review. |
| `review_title` | String | Short subject line of the review. **Used in NLP/sentiment analysis.** |
| `review_content` | String | Full text of the review. **Primary source for sentiment scoring and topic modeling (NLP).** |
| `img_link` | String | Link to the product image. |
| `product_link` | String | Link to the product's Amazon page. |

---

## Technical Stack & Methodology

| Component | Techniques / Libraries Used |
| :--- | :--- |
| **Data Cleaning** | Python (Pandas, NumPy) |
| **Feature Engineering** | Keyword Extraction, Text Length Calculation |
| **Sentiment Analysis** | **Natural Language Toolkit (NLTK)** - VADER Sentiment Lexicon |
| **Modeling** | **Scikit-learn (sklearn)** - Random Forest Classifier, Logistic Regression |
| **Evaluation** | Confusion Matrix, Feature Importance, Classification Report (Precision/Recall) |
| **Visualization** | Matplotlib, Seaborn |

In [1]:
# Data Manipulation Imports
import pandas as pd
import numpy as np

# Visualization Imports
import matplotlib.pyplot as plt
import seaborn as sns

# NLP and Text Processing Imports
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Machine Learning Imports
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# Model Evaluation Imports
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score

In [2]:
# NOTE: This cell only needs to be run once per machine
# to download the necessary lexicon for VADER sentiment analysis.
# You can comment it out after the first successful run.

nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\rharv\AppData\Roaming\nltk_data...


True

In [3]:
# Load in dataset
amazon_sales_raw = pd.read_csv('amazon_sales_dataset.csv')

In [None]:
# Displays head of the data for inspection
amazon_sales_raw.head()

Unnamed: 0,product_id,product_name,category,discounted_price,actual_price,discount_percentage,rating,rating_count,about_product,user_id,user_name,review_id,review_title,review_content,img_link,product_link
0,B07JW9H4J1,Wayona Nylon Braided USB to Lightning Fast Cha...,Computers&Accessories|Accessories&Peripherals|...,₹399,"₹1,099",64%,4.2,24269,High Compatibility : Compatible With iPhone 12...,"AG3D6O4STAQKAY2UVGEUV46KN35Q,AHMY5CWJMMK5BJRBB...","Manav,Adarsh gupta,Sundeep,S.Sayeed Ahmed,jasp...","R3HXWT0LRP0NMF,R2AJM3LFTLZHFO,R6AQJGUP6P86,R1K...","Satisfied,Charging is really fast,Value for mo...",Looks durable Charging is fine tooNo complains...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Wayona-Braided-WN3LG1-Sy...
1,B098NS6PVG,Ambrane Unbreakable 60W / 3A Fast Charging 1.5...,Computers&Accessories|Accessories&Peripherals|...,₹199,₹349,43%,4.0,43994,"Compatible with all Type C enabled devices, be...","AECPFYFQVRUWC3KGNLJIOREFP5LQ,AGYYVPDD7YG7FYNBX...","ArdKn,Nirbhay kumar,Sagar Viswanathan,Asp,Plac...","RGIQEG07R9HS2,R1SMWZQ86XIN8U,R2J3Y1WL29GWDE,RY...","A Good Braided Cable for Your Type C Device,Go...",I ordered this cable to connect my phone to An...,https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Ambrane-Unbreakable-Char...
2,B096MSW6CT,Sounce Fast Phone Charging Cable & Data Sync U...,Computers&Accessories|Accessories&Peripherals|...,₹199,"₹1,899",90%,3.9,7928,【 Fast Charger& Data Sync】-With built-in safet...,"AGU3BBQ2V2DDAMOAKGFAWDDQ6QHA,AESFLDV2PT363T2AQ...","Kunal,Himanshu,viswanath,sai niharka,saqib mal...","R3J3EQQ9TZI5ZJ,R3E7WBGK7ID0KV,RWU79XKQ6I1QF,R2...","Good speed for earlier versions,Good Product,W...","Not quite durable and sturdy,https://m.media-a...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Sounce-iPhone-Charging-C...
3,B08HDJ86NZ,boAt Deuce USB 300 2 in 1 Type-C & Micro USB S...,Computers&Accessories|Accessories&Peripherals|...,₹329,₹699,53%,4.2,94363,The boAt Deuce USB 300 2 in 1 cable is compati...,"AEWAZDZZJLQUYVOVGBEUKSLXHQ5A,AG5HTSFRRE6NL3M5S...","Omkar dhale,JD,HEMALATHA,Ajwadh a.,amar singh ...","R3EEUZKKK9J36I,R3HJVYCLYOY554,REDECAZ7AMPQC,R1...","Good product,Good one,Nice,Really nice product...","Good product,long wire,Charges good,Nice,I bou...",https://m.media-amazon.com/images/I/41V5FtEWPk...,https://www.amazon.in/Deuce-300-Resistant-Tang...
4,B08CF3B7N1,Portronics Konnect L 1.2M Fast Charging 3A 8 P...,Computers&Accessories|Accessories&Peripherals|...,₹154,₹399,61%,4.2,16905,[CHARGE & SYNC FUNCTION]- This cable comes wit...,"AE3Q6KSUK5P75D5HFYHCRAOLODSA,AFUGIFH5ZAFXRDSZH...","rahuls6099,Swasat Borah,Ajay Wadke,Pranali,RVK...","R1BP4L2HH9TFUP,R16PVJEXKV6QZS,R2UPDB81N66T4P,R...","As good as original,Decent,Good one for second...","Bought this instead of original apple, does th...",https://m.media-amazon.com/images/W/WEBP_40237...,https://www.amazon.in/Portronics-Konnect-POR-1...


In [11]:
amazon_sales_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1465 entries, 0 to 1464
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   product_id           1465 non-null   object
 1   product_name         1465 non-null   object
 2   category             1465 non-null   object
 3   discounted_price     1465 non-null   object
 4   actual_price         1465 non-null   object
 5   discount_percentage  1465 non-null   object
 6   rating               1465 non-null   object
 7   rating_count         1463 non-null   object
 8   about_product        1465 non-null   object
 9   user_id              1465 non-null   object
 10  user_name            1465 non-null   object
 11  review_id            1465 non-null   object
 12  review_title         1465 non-null   object
 13  review_content       1465 non-null   object
 14  img_link             1465 non-null   object
 15  product_link         1465 non-null   object
dtypes: obj

In [6]:
# Checks to see how many null values are in the dataset
amazon_sales_raw.isnull().sum()

product_id             0
product_name           0
category               0
discounted_price       0
actual_price           0
discount_percentage    0
rating                 0
rating_count           2
about_product          0
user_id                0
user_name              0
review_id              0
review_title           0
review_content         0
img_link               0
product_link           0
dtype: int64

In [10]:
# Sets all text columns to variable to check for empty entries
check_cols = ['review_title', 'review_content', 'about_product']

print('Data Quality Check: Inspecting Text Columns for Empty Strings')

# Iterate through stored columns to check
for col in check_cols:
    # Convert to string for safe handling, use .str.strip() to remove whitespace, check if empty
    empty_count =  amazon_sales_raw[col].astype(str).str.strip().eq('').sum()

    # Runs if there any empty entries in current column
    if empty_count > 0:
        print(f"'{col}': {empty_count} rows contain only whitespace or are empty strings.")

        # Replace these effectively null values with an actual empty string to prevent errors in later NLP steps
        amazon_sales_raw[col] = amazon_sales_raw[col].replace(r'^\s*$', '', regex=True)
        print(f"   -> {empty_count} rows replaced with an empty string ('') for NLP processing.")

    else:
        print(f"'{col}': Clean (0 empty/whitespace rows detected).")


Data Quality Check: Inspecting Text Columns for Empty Strings
'review_title': Clean (0 empty/whitespace rows detected).
'review_content': Clean (0 empty/whitespace rows detected).
'about_product': Clean (0 empty/whitespace rows detected).


## Data Cleaning & Standardization Summary

This section details all transformations required to prepare the raw data for numerical analysis and NLP modeling, based on the initial inspection where all key columns were found to be of the 'object' (string) data type.

| Issue (Column) | Why the Change is Needed or Preferred | How the Change is Implemented |
| :--- | :--- | :--- |
| **Missing Data** (`rating_count`) | Two rows are missing a value in this critical popularity metric. Dropping them is statistically negligible given the dataset size (>1K rows). | **Drop** the 2 rows where `rating_count` is null (`NaN`). |
| **Target Variable** (`rating`) | The model requires a binary target variable to classify success/failure. | **Create** a new boolean column, `High_Rating`, where **True** if `rating` $\ge 4.0$ (High Satisfaction) and **False** otherwise. |
| **Currency** (`discounted_price`, `actual_price`) | Columns are strings with the `₹` symbol and commas. Prices need to be numeric and standardized to USD for portfolio clarity. | 1. **Remove** `₹` and `,`. 2. **Convert** to `float` (INR). 3. **Divide** by a fixed exchange rate (e.g., $82.0$) to convert to **USD**. |
| **Percentage** (`discount_percentage`) | Column is a string with the `%` symbol. It must be a decimal for statistical modeling. | 1. **Remove** the `%` symbol. 2. **Convert** to `float`. 3. **Divide** by 100 to yield a decimal (e.g., $0.64$). |
| **Popularity Metric** (`rating_count`) | Column is a string containing commas (e.g., `24,269`). It needs to be a numerical integer type. | 1. **Remove** the comma (`,`). 2. **Convert** the resulting string to an `int`. |
| **Text Data** (All Text Columns) | All text columns were checked and found to be free of purely empty or whitespace-only strings. | **No action required** on missing text values. All text is retained and ready for direct use in NLP (Sentiment Analysis, Feature Engineering). |
| **Cleanup** (Raw Columns) | Retaining the original string columns (`discounted_price`, `rating_count`, etc.) is redundant after cleaning. | **Drop** the original raw columns once the new, clean numerical columns are created. |