# Sentiment Analysis of Amazon Product Reviews

**Author:** Caleb Nicholas Youhanna  
**Task:** Sentiment Analysis using spaCy  
**Dataset:** Datafiniti Amazon 

## Project Overview

This project performs sentiment analysis on Amazon product reviews using
Natural Language Processing (NLP) techniques.

The objectives are to:
- Load and clean product review data
- Preprocess text for NLP tasks
- Implement sentiment analysis using spaCy and TextBlob
- Evaluate results 

In [7]:
# Import Libraries
import pandas as pd
import spacy

from spacytextblob.spacytextblob import SpacyTextBlob  # type: ignore[attr-defined]

## Load Dataset

The dataset contains Amazon product reviews collected by Datafiniti.
We focus on the `reviews.text` column, which contains customer feedback.

In [13]:
# Load dataset using a relative path with encoding and zip fallbacks
import zipfile, io
file_path = ("Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv")
encodings = ['utf-8', 'cp1252', 'latin-1', 'ISO-8859-1']
data = None
def try_read_from_bytes(bts):
    import pandas as _pd, io as _io
    for enc in encodings:
        try:
            txt = bts.decode(enc)
            return _pd.read_csv(_io.StringIO(txt))
        except Exception:
            continue
    # final fallback: latin-1 with python engine and on_bad_lines='skip'
    try:
        txt = bts.decode('latin-1')
        return _pd.read_csv(_io.StringIO(txt), engine='python', on_bad_lines='skip')
    except Exception as e:
        raise
# If the file is actually a zip archive (some Datafiniti files are zipped), handle that
if zipfile.is_zipfile(file_path):
    with zipfile.ZipFile(file_path) as z:
        csv_names = [n for n in z.namelist() if n.lower().endswith('.csv')]
        if csv_names:
            bts = z.read(csv_names[0])
            data = try_read_from_bytes(bts)
            print(f"Loaded CSV from zip: {csv_names[0]}")
        else:
            raise ValueError('No CSV found inside zip archive')
else:
    # Read raw bytes then try decodings to avoid partial-read issues
    with open(file_path, 'rb') as f:
        bts = f.read()
    data = None
    try:
        data = try_read_from_bytes(bts)
    except Exception:
        # last-resort: ask pandas to guess with engine='python'
        import io as _io, pandas as _pd
        try:
            data = _pd.read_csv(_io.StringIO(bts.decode('latin-1')), engine='python', on_bad_lines='skip')
        except Exception as e:
            raise

# Display dataset shape
print("Dataset Shape:", data.shape)

# Preview dataset
data.head()

Loaded CSV from zip: Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv
Dataset Shape: (28332, 24)


Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht..."
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht..."
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht..."
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht..."


## Data Cleaning

Missing values are removed from the `reviews.text` column to ensure
clean input for sentiment analysis.

In [16]:
# Remove rows with missing review text
clean_data = data.dropna(subset=["reviews.text"])

# Extract the reviews column
reviews_data = clean_data["reviews.text"]

print("Number of reviews after cleaning:", len(reviews_data))

Number of reviews after cleaning: 28332


## Load spaCy NLP Model

We use the `en_core_web_md` model along with TextBlob
to analyse sentiment polarity.

In [18]:
# Load spaCy model with robust handling
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob  # type: ignore[attr-defined]
model_name = 'en_core_web_md'
try:
    nlp = spacy.load(model_name)
    print(f'Loaded {model_name}')
except OSError:
    try:
        import spacy.cli
        print(f'Downloading {model_name}...')
        spacy.cli.download(model_name)
        nlp = spacy.load(model_name)
        print(f'Downloaded and loaded {model_name}')
    except Exception as e:
        fallback = 'en_core_web_sm'
        print(f'Falling back to {fallback}. To install {model_name} manually: python -m spacy download {model_name}')
        nlp = spacy.load(fallback)

# Add TextBlob sentiment component
nlp.add_pipe('spacytextblob')
print('spacytextblob component added')

Loaded en_core_web_md
spacytextblob component added


## Text Preprocessing

The text is cleaned by:
- Converting to lowercase
- Removing stop words
- Removing extra whitespace

In [19]:
def preprocess_text(text):
    """
    Cleans input text by converting to lowercase,
    removing stop words and punctuation.
    """
    doc = nlp(text.lower().strip())

    cleaned_tokens = [
        token.text for token in doc
        if not token.is_stop and not token.is_punct
    ]

    return " ".join(cleaned_tokens)

## Sentiment Analysis Function

This function:
- Takes a review as input
- Returns polarity score
- Classifies sentiment as Positive, Negative, or Neutral

In [20]:
# Semtiment Function
def analyse_sentiment(review):
    """
    Analyses sentiment of a product review.
    Returns polarity score and sentiment label.
    """
    cleaned_review = preprocess_text(review)
    doc = nlp(cleaned_review)
    
    polarity = doc._.blob.polarity
    
    if polarity > 0:
        sentiment = "Positive"
    elif polarity < 0:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    
    return polarity, sentiment

## Testing the Sentiment Analysis
The model is tested on a small sample of product reviews.

In [None]:
for i in range(5):
    review_text = reviews_data.iloc[i]

    polarity, sentiment = analyse_sentiment(review_text)

    print(f"Review {i + 1}:")
    print(f"Text: {str(review_text)[:100]}...")
    print(f"Polarity: {polarity:.2f} | Sentiment: {sentiment}")
    print("-" * 70)

Review 1:
Text: I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pc...
Polarity: -0.45 | Sentiment: Negative
----------------------------------------------------------------------
Review 2:
Text: Bulk is always the less expensive way to go for products like these...
Polarity: -0.50 | Sentiment: Negative
----------------------------------------------------------------------
Review 3:
Text: Well they are not Duracell but for the price i am happy....
Polarity: 0.80 | Sentiment: Positive
----------------------------------------------------------------------
Review 4:
Text: Seem to work as well as name brand batteries at a much better price...
Polarity: 0.50 | Sentiment: Positive
----------------------------------------------------------------------
Review 5:
Text: These batteries are very long lasting the price is great....
Polarity: 0.25 | Sentiment: Positive
----------------------------------------------------------------------


## Review Similarity

spaCy allows us to measure how similar two reviews are
using word embeddings.

In [22]:
# Select two reviews
review_text_1 = reviews_data.iloc[0]
review_text_2 = reviews_data.iloc[1]

doc1 = nlp(str(review_text_1))
doc2 = nlp(str(review_text_2))

similarity_score = doc1.similarity(doc2)

print("Review 1 snippet:", str(review_text_1)[:100], "...")
print("Review 2 snippet:", str(review_text_2)[:100], "...")
print(f"\nSimilarity Score: {similarity_score:.2f}")

Review 1 snippet: I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pc ...
Review 2 snippet: Bulk is always the less expensive way to go for products like these ...

Similarity Score: 0.95
