# SENG 550 Final Project
### Use the Amazon Appliances reviews dataset to develop a classifier or sentiment analyzer that can predict whether a given review is favorable or not

## Abstract

Our project uses the Amazon Appliances [reviews dataset](https://amazon-reviews-2023.github.io/) to develop a sentiment analyzer classification model. By combining star ratings and textual content, the model is trained to predict whether a review leans positively or negatively towards a product. This approach offers a way to quickly grasp the general sentiment of a product and assist shoppers in filtering through large volumes of product feedback more efficiently.

## Introduction

### Selected Problem

The problem aims to distinguish between favourable and unfavourable appliance reviews based on their text and accompanying reviews.

### Why is it Important?

Spending time reading review after review on a product becomes a burden. It is easy to misinterpret the mood behind a set of comments online which can easily lead to poor purchase decisions. A quick, automated sentiment indicator can ease the burden and establish a neutral decision making process.

### What have Others Done in this Space?

Researchers have performed [sentiment analysis](https://medium.com/@nafisaidris413/a-beginners-guide-for-product-review-sentiment-analysis-0de1f451167d) using Machine Learning and Natural Language Processing to automatically classify reviews as positive, negative, or neutral. Not only has sentiment analysis been applied to product reviews, it has also been applied to [social media](https://buffer.com/social-media-terms/sentiment-analysis) to determine how people perceive and talk about products and brands. This proves that data-driven classifiers are able to provide sentiment analysis scores for assist people in their daily lives, whether its to determine how their personal brand is viewed or how a to make an informed purchase through product review.

### Existing gaps?

Current solutions rely on product ratings to provide consumers with a sense of trust and quality to help them make purchasing decisions. This can be seen simply by going to any Amazon product and checking the reviews. Some reviews are informative with many positives about the product, though the product receives less than a rating of 5-stars, or the review is not informative whatsoever with a rating of 5-stars. Other times the reviews are clearly biased or the customer who leaves the review is disgruntled, leading to a 1- or 2-star rating. Using a combination of product star rating and textual review content, we are attempting to reveal patterns in product reviews that a rating alone might miss.

### Data Analysis Questions

1. Does text-based features add value beyond just a numerical rating?
2. Are there certain words which portray a stronger positve or negative sentiment?
3. How will adding text preprocessing impact accuracy?
4. Which models work best with this data?

### What is Proposed

We are proposing a text classification pipeline that merges a product's star rating and textual features.

### What are your Main Findings?

To determine customer opinion on various products within the Appliance category in Amazon's online store.

## Methodology

### Exploration of Data Features and Refinement of Feature Space

In this section, we focused on understanding the raw data collected from the collected [datasets](https://amazon-reviews-2023.github.io/) and transform them into a format suitable for model training. We begin by loading the Amazon Appliance reviews dataset and its corresponding metadata. We will explore the structure of the data, examine the distribution of fields we are interested in (like ratings), and assess the overall quality of the text reviews associated with the products. After we gain a thorough understanding, we apply a series of preprocessing techniques to clean and refine the text data. The goal here is to ultimately develop a set of features that can be fed into a machine learning model for sentiment classification.

#### Key Steps

1. **Loading the Data:**
We will loaf the `Appliances.jsonl` (reviews) and `meta_Appliances.jsonl` (metadata) using Apache Spark to avoid memory overload

2. **Initial Inspection and Basic Statistics:**
We will look at a few sample rows, check data types, count missing values, and examine distributions.

3. **Textual Data Exploration:**
We consider the nature of each review such as its length, the character composition, and common words. This should help guide our text cleaning decision.

4. **Data Cleaning:**
We clean the text by methods such as lowercasing the characters, removing punctuation, stripping leading and/or trailing whitespaces.

5. **Feature Transformation:**
We will use Spark Machine Learning's feature extraction tools to convert raw text into numeric features that are typically suitable for machine learning models.

#### Load the Data

The datasets are provided in `*.jsonl` format, which means each line is a separate JSON object representing a single review or product's metadata. We will use `SparkSession` to read the files which will handle the data in a distributed manner, effectively avoiding a potential kernel crash. Spark's lazy evaluation, transformations, and actions will manage memory usage of the large `.jsonl` files.

The two main data sources:
1. **Review File (`Appliances.jsonl`):**
Contains user-level reviews with fields such as `rating`, `title`, `text`, and `helpful_vote`.

2. **Metada File (`meta_Appliaces.jsonl`):**
Contains product-level information like `main_category`, `average_rating`, and `price`.

In [1]:
import os
import numpy as np
from functools import reduce
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer
from pyspark.sql.functions import col, isnan, when, count, expr, sum, size, lower, regexp_replace, min, avg, max
from pyspark.sql.types import StructType, StructField, StringType, FloatType, BooleanType, IntegerType, ArrayType, MapType

In [2]:
# Start a Spark Session
spark = (
    SparkSession.builder
        .master("local[*]")
        .appName("Amazon Review Analysis") 
        .config("spark.ui.showConsoleProgress", "false")
        .config("spark.executor.memory", "4g")
        .config("spark.driver.memory", "4g")
        .getOrCreate()
)

def create_schema(fields):
    return StructType([StructField(name, dtype, True) for name, dtype in fields])

# Only using columns needed for analysis for Reviews
reviews_schema = create_schema([
    ("rating", FloatType()),
    ("title", StringType()),
    ("text", StringType()),
    ("helpful_vote", IntegerType()),
    ("asin", StringType()),
    ("parent_asin", StringType())
])

# Only using columns needed for analysis for Metadata
meta_schema = create_schema([
    ("main_category", StringType()),
    ("title", StringType()),
    ("average_rating", FloatType()),
    ("rating_number", IntegerType()),
    ("price", FloatType()),
    ("categories", ArrayType(StringType())),
    ("parent_asin", StringType())
])

# Point to the location where the .jsonl files are
data_files = {
    "reviews": "./datasets/Appliances.jsonl",
    "meta": "./datasets/meta_Appliances.jsonl"
    
}

# Use the schema when reading the JSON file for Reviews
df_reviews = spark.read.schema(reviews_schema).json(data_files["reviews"])

# Use the schema when reading the JSON file for Meta
df_meta = spark.read.schema(meta_schema).json(data_files["meta"])

24/12/11 02:15:27 WARN Utils: Your hostname, codespaces-85203f resolves to a loopback address: 127.0.0.1; using 10.0.1.2 instead (on interface eth0)
24/12/11 02:15:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/12/11 02:15:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


---

### Initial Inspection & Basic Statistics

First it is important to understand the size of the dataset we are dealing with and the distribution of ratings. Using Spark actions like `show()` and `count()` we determine some initial statistics about both datasets which will be helpful to visualize them. We also would like to know how many values in each column of the datasets are `null`, `None`, and `NaN`, in case we need to do some backfilling or should ignore those sets completely. We can also display a few rows of each dataset which makes sure that the datasets were loaded successfully.

#### Dataset Structure Inspection

For the structure of each dataset we will check the datasets dimensions, schema, and preview the data to understand each dataset.

In [None]:
# Dataset Structure Inpsection Function

def structure_inspection(df, name):
    # Print Dimensions
    print(f"{name} Dimensions: {df.count()} rows, {len(df.columns)} columns")
    
    # Print Schema
    print(f"\n{name} Schema:")
    df.printSchema()
    
    # Preview Data
    print(f"\n{name} Preview:")
    df.show(10, truncate=True)

24/12/11 02:15:40 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors


In [4]:
# Inspect Reviews
structure_inspection(df_reviews, "Appliance Reviews")

Appliance Reviews Dimensions: 2128605 rows, 6 columns

Appliance Reviews Schema:
root
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- text: string (nullable = true)
 |-- helpful_vote: integer (nullable = true)
 |-- asin: string (nullable = true)
 |-- parent_asin: string (nullable = true)


Appliance Reviews Preview:
+------+--------------------+--------------------+------------+----------+-----------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|
+------+--------------------+--------------------+------------+----------+-----------+
|   5.0|          Work great|work great. use a...|           0|B01N0TQ0OH| B01N0TQ0OH|
|   5.0|   excellent product|Little on the thi...|           0|B07DD2DMXB| B07DD37QPZ|
|   5.0|     Happy customer!|Quick delivery, f...|           0|B082W3Z9YK| B082W3Z9YK|
|   5.0|       Amazing value|I wasn't sure whe...|           0|B078W2BJY8| B078W2BJY8|
|   5.0|         Dryer parts|Easy to insta

In [5]:
# Inspect Metadata
structure_inspection(df_meta, "Appliance Metadata")

Appliance Metadata Dimensions: 94327 rows, 7 columns

Appliance Metadata Schema:
root
 |-- main_category: string (nullable = true)
 |-- title: string (nullable = true)
 |-- average_rating: float (nullable = true)
 |-- rating_number: integer (nullable = true)
 |-- price: float (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- parent_asin: string (nullable = true)


Appliance Metadata Preview:
+--------------------+--------------------+--------------+-------------+-----+--------------------+-----------+
|       main_category|               title|average_rating|rating_number|price|          categories|parent_asin|
+--------------------+--------------------+--------------+-------------+-----+--------------------+-----------+
|Industrial & Scie...|ROVSUN Ice Maker ...|           3.7|           61| NULL|[Appliances, Refr...| B08Z743RRD|
|Tools & Home Impr...|HANSGO Egg Holder...|           4.2|           75| NULL|[Appliances, Part

#### Missing Values Inspection

Missing values is one of the most common headaches in datasets. We need to check for them in both datasets to ensure that we can confidently use the data, else we have to consider backfilling the missing data or not using it at all.

In [11]:
def get_nulls_counter(df, col_dtypes):
    null_dfs = []
    for col_dtype, cols in col_dtypes.items():
        if col_dtype in ["float", "integer"]:
            null_dfs.append(df.select([sum(col(c).isNull().cast("int")).alias(c) for c in cols]))
        elif col_dtype == "string":
            null_dfs.append(df.select([sum((col(c).isNull() | (col(c) == "")).cast("int")).alias(c) for c in cols]))
        elif col_dtype == "array":
            null_dfs.append(df.select([sum((col(c).isNull() | expr(f"exists({c}, x -> x == '')")).cast("int")).alias(c) for c in cols]))

    # Combine all null DataFrames using reduce
    return reduce(lambda df1, df2: df1.crossJoin(df2), null_dfs)


In [17]:
def print_missing_values(df, name):
    col_dtypes = {
        "float": [c for c in df.columns if df.schema[c].dataType.simpleString() == "float"],
        "string": [c for c in df.columns if df.schema[c].dataType.simpleString() == "string"],
        "integer": [c for c in df.columns if df.schema[c].dataType.simpleString() == "int"],
        "array": [c for c in df.columns if df.schema[c].dataType.simpleString().startswith("array")]
    }
    
    null_counter = get_nulls_counter(df, col_dtypes)
    
    print(f"{name} Counted Missing Values per Column:")
    null_counter.show(1)

In [18]:
print_missing_values(df_reviews, "Appliance Reviews")

Appliance Reviews Counted Missing Values per Column:
+------+-----+----+----+-----------+------------+
|rating|title|text|asin|parent_asin|helpful_vote|
+------+-----+----+----+-----------+------------+
|     0|    0|  95|   0|          0|           0|
+------+-----+----+----+-----------+------------+
only showing top 1 row



In [19]:
print_missing_values(df_meta, "Appliance Metadata")

Appliance Metadata Counted Missing Values per Column:
+--------------+-----+-------------+-----+-----------+-------------+----------+
|average_rating|price|main_category|title|parent_asin|rating_number|categories|
+--------------+-----+-------------+-----+-----------+-------------+----------+
|             0|47601|         4676|    9|          0|            0|         0|
+--------------+-----+-------------+-----+-----------+-------------+----------+



#### Duplicates

Catching if there is duplicate data is important so that we do not have skewed data. All the data should be unique. Duplicate data will be handled during data cleanup.

In [25]:
def duplicate_data(df, name):
    total_count = df.count()
    distinct_count = df.distinct().count()
    duplicate_count = total_count - distinct_count
    print(f"{name} Duplicate Data: {duplicate_count}\n(Total: {total_count}, Distinct: {distinct_count})")       
    

In [21]:
duplicate_data(df_reviews, "Appliance Reviews")


Appliance Reviews Duplicate Data: 29492
(Total: 2128605, Distinct: 2099113)


In [26]:
duplicate_data(df_meta, "Appliance Metadata")

Appliance Metadata Duplicate Data: 0
(Total: 94327, Distinct: 94327)


#### Statistical Data

We examine the statistical summaries for both the Reviews and Metadata. This helps us spot outliers and check if there are unexpected ranges that we have to look out for.

In [27]:
print(f"Appliance Reviews Statistical Summary")
df_reviews.describe().show()

Appliance Reviews Statistical Summary


24/12/11 02:29:27 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+-------+------------------+--------------------+--------------------+------------------+--------------------+--------------------+
|summary|            rating|               title|                text|      helpful_vote|                asin|         parent_asin|
+-------+------------------+--------------------+--------------------+------------------+--------------------+--------------------+
|  count|           2128605|             2128605|             2128605|           2128605|             2128605|             2128605|
|   mean| 4.221502345432807|                 NaN|1.0294495574587156E9|0.9288867591685634|1.5550635848728814E9|1.5550635848728814E9|
| stddev|1.3808261737697332|                 NaN|1.064196457532015...|12.526794316769578|1.4548141211071746E9|1.4548141211071746E9|
|    min|               1.0|                   !|                    |                 0|          0967805929|          0967805929|
|    max|               5.0|🧊🥶 AMAZING 🤩  ...|🧐doesn’t filter ...|          

In [28]:
print(f"Appliance Metadata Statistical Summary")
df_meta.describe().show()

Appliance Metadata Statistical Summary
+-------+--------------+--------------------+------------------+------------------+------------------+--------------------+
|summary| main_category|               title|    average_rating|     rating_number|             price|         parent_asin|
+-------+--------------+--------------------+------------------+------------------+------------------+--------------------+
|  count|         89651|               94327|             94327|             94327|             46726|               94327|
|   mean|          NULL|  1.1113368793875E10| 4.118858857941276|136.36790102515715|  86.4799539034291|4.0776468745555553E9|
| stddev|          NULL|3.142601103602334...|0.8640397544170944| 977.5160999553561|325.31839674168526| 3.745278366512328E9|
|    min|AMAZON FASHION|                    |               1.0|                 1|              0.01|          0967805929|
|    max|   Video Games|𝟮𝟬𝟮𝟯𝙪𝙥𝙜𝙧?...|               5.0|             90203|          21095.62

---

### Textual Data Exploration

The main predictive feature of the model will likely be the `text` column in the Appliance Review dataset. It is important that we are able to understand its quality. We find answers to questions such as are the reviews too short or lone? Do they contain descriptive terms or just a few words? We also need to consider if there is text in different languages other than english.

What we will do is start by analyzing the length of the reviews in terms of word count. This will help guide us in the direction we want. If the text is too short, maybe we need to rely more on ratings or metadata. If text is rich, a text-based sentiment analysis may just work well.

In [None]:
df_reviews = df_reviews.fillna({"text": ""})
df_reviews = df_reviews.withColumn("text", col("text").cast(StringType()))

tokenizer_reviews = Tokenizer(inputCol="text", outputCol="words_raw")
df_tokenizer_reviews = tokenizer_reviews.transform(df_reviews)

df_tokenizer_reviews = df_tokenizer_reviews.withColumn("character_length", size(col("words_raw")))

print(f"Appliance Reviews with Raw Words and Character Length included")
df_tokenizer_reviews.show(10, truncate=True)

print(f"\nMinimum, Average, and Maximum character length of the Appliance Reviews")
df_tokenizer_reviews.select(min("character_length"), avg("character_length"), max("character_length")).show()

---

### Data Cleaning

In [None]:
spark.stop()