# SENG 550 Final Project
### Use the Amazon Appliances reviews dataset to develop a classifier or sentiment analyzer that can predict whether a given review is favorable or not

## Abstract

Our project uses the Amazon Appliances [reviews dataset](https://amazon-reviews-2023.github.io/) to develop a sentiment analyzer classification model. By combining star ratings and textual content, the model is trained to predict whether a review leans positively or negatively towards a product. This approach offers a way to quickly grasp the general sentiment of a product and assist shoppers in filtering through large volumes of product feedback more efficiently.

## Introduction

### Selected Problem

The problem aims to distinguish between favourable and unfavourable appliance reviews based on their text and accompanying reviews.

### Why is it Important?

Spending time reading review after review on a product becomes a burden. It is easy to misinterpret the mood behind a set of comments online which can easily lead to poor purchase decisions. A quick, automated sentiment indicator can ease the burden and establish a neutral decision making process.

### What have Others Done in this Space?

Researchers have performed [sentiment analysis](https://medium.com/@nafisaidris413/a-beginners-guide-for-product-review-sentiment-analysis-0de1f451167d) using Machine Learning and Natural Language Processing to automatically classify reviews as positive, negative, or neutral. Not only has sentiment analysis been applied to product reviews, it has also been applied to [social media](https://buffer.com/social-media-terms/sentiment-analysis) to determine how people perceive and talk about products and brands. This proves that data-driven classifiers are able to provide sentiment analysis scores for assist people in their daily lives, whether its to determine how their personal brand is viewed or how a to make an informed purchase through product review.

### Existing gaps?

Current solutions rely on product ratings to provide consumers with a sense of trust and quality to help them make purchasing decisions. This can be seen simply by going to any Amazon product and checking the reviews. Some reviews are informative with many positives about the product, though the product receives less than a rating of 5-stars, or the review is not informative whatsoever with a rating of 5-stars. Other times the reviews are clearly biased or the customer who leaves the review is disgruntled, leading to a 1- or 2-star rating. Using a combination of product star rating and textual review content, we are attempting to reveal patterns in product reviews that a rating alone might miss.

### Data Analysis Questions

1. Does text-based features add value beyond just a numerical rating?
2. Are there certain words which portray a stronger positve or negative sentiment?
3. How will adding text preprocessing impact accuracy?
4. Which models work best with this data?

### What is Proposed

We are proposing a text classification pipeline that merges a product's star rating and textual features.

### What are your Main Findings?

To determine customer opinion on various products within the Appliance category in Amazon's online store.

## Methodology

### Exploration of Data Features and Refinement of Feature Space

In this section, we focused on understanding the raw data collected from the collected [datasets](https://amazon-reviews-2023.github.io/) and transform them into a format suitable for model training. We begin by loading the Amazon Appliance reviews dataset and its corresponding metadata. We will explore the structure of the data, examine the distribution of fields we are interested in (like ratings), and assess the overall quality of the text reviews associated with the products. After we gain a thorough understanding, we apply a series of preprocessing techniques to clean and refine the text data. The goal here is to ultimately develop a set of features that can be fed into a machine learning model for sentiment classification.

#### Key Steps

1. **Loading the Data:**
We will loaf the `Appliances.jsonl` (reviews) and `meta_Appliances.jsonl` (metadata) using Apache Spark to avoid memory overload

2. **Initial Inspection and Basic Statistics:**
We will look at a few sample rows, check data types, count missing values, and examine distributions.

3. **Textual Data Exploration:**
We consider the nature of each review such as its length, the character composition, and common words. This should help guide our text cleaning decision.

4. **Data Cleaning:**
We clean the text by methods such as lowercasing the characters, removing punctuation, stripping leading and/or trailing whitespaces.

5. **Feature Transformation:**
We will use Spark Machine Learning's feature extraction tools to convert raw text into numeric features that are typically suitable for machine learning models.

#### Load the Data

The datasets are provided in `*.jsonl` format, which means each line is a separate JSON object representing a single review or product's metadata. We will use `SparkSession` to read the files which will handle the data in a distributed manner, effectively avoiding a potential kernel crash. Spark's lazy evaluation, transformations, and actions will manage memory usage of the large `.jsonl` files.

The two main data sources:
1. **Review File (`Appliances.jsonl`):**
Contains user-level reviews with fields such as `rating`, `title`, `text`, and `helpful_vote`.

2. **Metada File (`meta_Appliaces.jsonl`):**
Contains product-level information like `main_category`, `average_rating`, and `price`.

In [278]:
import os
import numpy as np
from functools import reduce
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count, expr, sum, size, lower, regexp_replace, min, avg, max, length, stddev, substring
from pyspark.sql.types import StructType, StructField, StringType, FloatType, IntegerType, ArrayType
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator


In [279]:
# Start a Spark Session
spark = (
    SparkSession.builder
        .master("local[*]")     
        .appName("Amazon Review Analysis")
        .config("spark.driver.memory", "4g")
        .config("spark.executor.memory", "4g")
        .config("spark.sql.autoBroadcastJoinThreshold", -1)
        .getOrCreate()
)

def create_schema(fields):
    return StructType([StructField(name, dtype, True) for name, dtype in fields])

# Only using columns needed for analysis for Reviews
reviews_schema = create_schema([
    ("rating", FloatType()),
    ("title", StringType()),
    ("text", StringType()),
    ("helpful_vote", IntegerType()),
    ("asin", StringType()),
    ("parent_asin", StringType())
])

# Only using columns needed for analysis for Metadata
meta_schema = create_schema([
    ("main_category", StringType()),
    ("title", StringType()),
    ("average_rating", FloatType()),
    ("rating_number", IntegerType()),
    ("price", FloatType()),
    ("categories", ArrayType(StringType())),
    ("parent_asin", StringType())
])

# Point to the location where the .jsonl files are
data_files = {
    "reviews": "./datasets/Appliances.jsonl",
    "meta": "./datasets/meta_Appliances.jsonl"
    
}

# Use the schema when reading the JSON file for Reviews
df_reviews = spark.read.schema(reviews_schema).json(data_files["reviews"])

# Use the schema when reading the JSON file for Meta
df_meta = spark.read.schema(meta_schema).json(data_files["meta"])

---

### Initial Inspection & Basic Statistics

First it is important to understand the size of the dataset we are dealing with and the distribution of ratings. Using Spark actions like `show()` and `count()` we determine some initial statistics about both datasets which will be helpful to visualize them. We also would like to know how many values in each column of the datasets are `null`, `None`, and `NaN`, in case we need to do some backfilling or should ignore those sets completely. We can also display a few rows of each dataset which makes sure that the datasets were loaded successfully.

#### Dataset Structure Inspection

For the structure of each dataset we will check the datasets dimensions, schema, and preview the data to understand each dataset.

In [280]:
# Dataset Structure Inpsection Function

def structure_inspection(df, name):
    # Print Dimensions
    print(f"{name} Dimensions: {df.count()} rows, {len(df.columns)} columns")
    
    # Print Schema
    print(f"\n{name} Schema:")
    df.printSchema()
    
    # Preview Data
    print(f"\n{name} Preview:")
    df.show(10, truncate=True)

In [281]:
# Inspect Reviews
structure_inspection(df_reviews, "Appliance Reviews")

Appliance Reviews Dimensions: 2128605 rows, 6 columns

Appliance Reviews Schema:
root
 |-- rating: float (nullable = true)
 |-- title: string (nullable = true)
 |-- text: string (nullable = true)
 |-- helpful_vote: integer (nullable = true)
 |-- asin: string (nullable = true)
 |-- parent_asin: string (nullable = true)


Appliance Reviews Preview:
+------+--------------------+--------------------+------------+----------+-----------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|
+------+--------------------+--------------------+------------+----------+-----------+
|   5.0|          Work great|work great. use a...|           0|B01N0TQ0OH| B01N0TQ0OH|
|   5.0|   excellent product|Little on the thi...|           0|B07DD2DMXB| B07DD37QPZ|
|   5.0|     Happy customer!|Quick delivery, f...|           0|B082W3Z9YK| B082W3Z9YK|
|   5.0|       Amazing value|I wasn't sure whe...|           0|B078W2BJY8| B078W2BJY8|
|   5.0|         Dryer parts|Easy to insta

In [282]:
# Inspect Metadata
structure_inspection(df_meta, "Appliance Metadata")

Appliance Metadata Dimensions: 94327 rows, 7 columns

Appliance Metadata Schema:
root
 |-- main_category: string (nullable = true)
 |-- title: string (nullable = true)
 |-- average_rating: float (nullable = true)
 |-- rating_number: integer (nullable = true)
 |-- price: float (nullable = true)
 |-- categories: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- parent_asin: string (nullable = true)


Appliance Metadata Preview:
+--------------------+--------------------+--------------+-------------+-----+--------------------+-----------+
|       main_category|               title|average_rating|rating_number|price|          categories|parent_asin|
+--------------------+--------------------+--------------+-------------+-----+--------------------+-----------+
|Industrial & Scie...|ROVSUN Ice Maker ...|           3.7|           61| NULL|[Appliances, Refr...| B08Z743RRD|
|Tools & Home Impr...|HANSGO Egg Holder...|           4.2|           75| NULL|[Appliances, Part

#### Missing Values Inspection

Missing values is one of the most common headaches in datasets. We need to check for them in both datasets to ensure that we can confidently use the data, else we have to consider backfilling the missing data or not using it at all.

In [283]:
def get_nulls_counter(df, col_dtypes):
    null_dfs = []
    for col_dtype, cols in col_dtypes.items():
        if col_dtype in ["float", "integer"]:
            null_dfs.append(df.select([sum(col(c).isNull().cast("int")).alias(c) for c in cols]))
        elif col_dtype == "string":
            null_dfs.append(df.select([sum((col(c).isNull() | (col(c) == "")).cast("int")).alias(c) for c in cols]))
        elif col_dtype == "array":
            null_dfs.append(df.select([sum((col(c).isNull() | expr(f"exists({c}, x -> x == '')")).cast("int")).alias(c) for c in cols]))

    # Combine all null DataFrames using reduce
    return reduce(lambda df1, df2: df1.crossJoin(df2), null_dfs)


In [284]:
def print_missing_values(df, name):
    col_dtypes = {
        "float": [c for c in df.columns if df.schema[c].dataType.simpleString() == "float"],
        "string": [c for c in df.columns if df.schema[c].dataType.simpleString() == "string"],
        "integer": [c for c in df.columns if df.schema[c].dataType.simpleString() == "int"],
        "array": [c for c in df.columns if df.schema[c].dataType.simpleString().startswith("array")]
    }
    
    null_counter = get_nulls_counter(df, col_dtypes)
    
    print(f"{name} Counted Missing Values per Column:")
    null_counter.show(1)

In [285]:
print_missing_values(df_reviews, "Appliance Reviews")

Appliance Reviews Counted Missing Values per Column:


[Stage 25:>                                                       (0 + 12) / 12]

+------+-----+----+----+-----------+------------+
|rating|title|text|asin|parent_asin|helpful_vote|
+------+-----+----+----+-----------+------------+
|     0|    0|  95|   0|          0|           0|
+------+-----+----+----+-----------+------------+
only showing top 1 row



                                                                                

In [286]:
print_missing_values(df_meta, "Appliance Metadata")

Appliance Metadata Counted Missing Values per Column:
+--------------+-----+-------------+-----+-----------+-------------+----------+
|average_rating|price|main_category|title|parent_asin|rating_number|categories|
+--------------+-----+-------------+-----+-----------+-------------+----------+
|             0|47601|         4676|    9|          0|            0|         0|
+--------------+-----+-------------+-----+-----------+-------------+----------+



#### Duplicates

Catching if there is duplicate data is important so that we do not have skewed data. All the data should be unique. Duplicate data will be handled during data cleanup.

In [287]:
def duplicate_data(df, name):
    total_count = df.count()
    distinct_count = df.distinct().count()
    duplicate_count = total_count - distinct_count
    print(f"{name} Duplicate Data: {duplicate_count}\n(Total: {total_count}, Distinct: {distinct_count})")       
    

In [288]:
duplicate_data(df_reviews, "Appliance Reviews")


[Stage 44:>                                                       (0 + 12) / 13]

Appliance Reviews Duplicate Data: 29492
(Total: 2128605, Distinct: 2099113)


                                                                                

In [289]:
duplicate_data(df_meta, "Appliance Metadata")

[Stage 51:>                                                       (0 + 12) / 12]

Appliance Metadata Duplicate Data: 0
(Total: 94327, Distinct: 94327)


                                                                                

#### Statistical Data

We examine the statistical summaries for both the Reviews and Metadata. This helps us spot outliers and check if there are unexpected ranges that we have to look out for.

In [290]:
print(f"Appliance Reviews Statistical Summary")
df_reviews.describe().show(truncate=True)

Appliance Reviews Statistical Summary


[Stage 57:====>                                                   (1 + 11) / 12]

+-------+------------------+--------------------+--------------------+------------------+--------------------+--------------------+
|summary|            rating|               title|                text|      helpful_vote|                asin|         parent_asin|
+-------+------------------+--------------------+--------------------+------------------+--------------------+--------------------+
|  count|           2128605|             2128605|             2128605|           2128605|             2128605|             2128605|
|   mean| 4.221502345432807|                 NaN|1.0294495574587156E9|0.9288867591685634|1.5550635848728814E9|1.5550635848728814E9|
| stddev|1.3808261737697285|                 NaN|1.064196457532015...|12.526794316769463|1.4548141211071749E9|1.4548141211071749E9|
|    min|               1.0|                   !|                    |                 0|          0967805929|          0967805929|
|    max|               5.0|🧊🥶 AMAZING 🤩  ...|🧐doesn’t filter ...|          

                                                                                

In [291]:
print(f"Appliance Metadata Statistical Summary")
df_meta.describe().show(truncate=True)

Appliance Metadata Statistical Summary


[Stage 60:>                                                       (0 + 12) / 12]

+-------+--------------+--------------------+------------------+------------------+------------------+--------------------+
|summary| main_category|               title|    average_rating|     rating_number|             price|         parent_asin|
+-------+--------------+--------------------+------------------+------------------+------------------+--------------------+
|  count|         89651|               94327|             94327|             94327|             46726|               94327|
|   mean|          NULL|  1.1113368793875E10| 4.118858857941276|136.36790102515715|  86.4799539034291|4.0776468745555553E9|
| stddev|          NULL|3.142601103602334E10|0.8640397544170938| 977.5160999553573|325.31839674168475| 3.745278366512328E9|
|    min|AMAZON FASHION|                    |               1.0|                 1|              0.01|          0967805929|
|    max|   Video Games|𝟮𝟬𝟮𝟯𝙪𝙥𝙜𝙧?...|               5.0|             90203|          21095.62|          B0CKR66M1V|
+-------+-------

                                                                                

---

### Textual Data Exploration

The main predictive feature of the model will likely be the `text` column in the Appliance Review dataset. It is important that we are able to understand its quality. We find answers to questions such as are the reviews too short or lone? Do they contain descriptive terms or just a few words? We also need to consider if there is text in different languages other than english.

What we will do is start by analyzing the length of the reviews in terms of word count. This will help guide us in the direction we want. If the text is too short, maybe we need to rely more on ratings or metadata. If text is rich, a text-based sentiment analysis may just work well.

In [292]:
# Fill null text fields and cast text to string
df_reviews = df_reviews.fillna({"text": ""}).withColumn("text", col("text").cast("string"))

# Tokenize text and calculate character length and word count
tokenizer = Tokenizer(inputCol="text", outputCol="words_raw")
df_tokenizer_reviews = (
    tokenizer
    .transform(df_reviews)
    .withColumn("character_length", length(col("text")))
    .withColumn("word_count", size(col("words_raw")))
    .filter(col("character_length") > 9)
)

# Show summary statistics for word count and character length
print("Appliance Reviews with Word Count and Character Length Included")
df_tokenizer_reviews.show(10, truncate=True)

print("\nSummary Statistics for Appliance Reviews:")
df_tokenizer_reviews.select(
    min("character_length").alias("min_char_length"),
    avg("character_length").alias("avg_char_length"),
    stddev("character_length").alias("stddev_char_length"),
    max("character_length").alias("max_char_length"),
    min("word_count").alias("min_word_count"),
    avg("word_count").alias("avg_word_count"),
    stddev("word_count").alias("stddev_word_count"),
    max("word_count").alias("max_word_count")
).show()

Appliance Reviews with Word Count and Character Length Included
+------+--------------------+--------------------+------------+----------+-----------+--------------------+----------------+----------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|           words_raw|character_length|word_count|
+------+--------------------+--------------------+------------+----------+-----------+--------------------+----------------+----------+
|   5.0|          Work great|work great. use a...|           0|B01N0TQ0OH| B01N0TQ0OH|[work, great., us...|              37|         8|
|   5.0|   excellent product|Little on the thi...|           0|B07DD2DMXB| B07DD37QPZ|[little, on, the,...|              23|         5|
|   5.0|     Happy customer!|Quick delivery, f...|           0|B082W3Z9YK| B082W3Z9YK|[quick, delivery,...|              32|         5|
|   5.0|       Amazing value|I wasn't sure whe...|           0|B078W2BJY8| B078W2BJY8|[i, wasn't, sure,...|             



+---------------+------------------+------------------+---------------+--------------+-----------------+-----------------+--------------+
|min_char_length|   avg_char_length|stddev_char_length|max_char_length|min_word_count|   avg_word_count|stddev_word_count|max_word_count|
+---------------+------------------+------------------+---------------+--------------+-----------------+-----------------+--------------+
|             10|175.58352814723935|283.32226996999367|          30004|             1|33.16792796286858|53.12268670108261|          3740|
+---------------+------------------+------------------+---------------+--------------+-----------------+-----------------+--------------+



                                                                                

---

### Data Cleaning

Data Cleaning is an important part of the process so that duplicated data or data with missing information is not included in the Machine Learning model. This keeps the model from skewing too much, in turn allowing the model to be as free from Bias and Variance as possible.

The `Code Block` below removes duplicate data from the `Appliances.jsonl` dataset.

In [293]:
# Removing Duplicates
df_reviews_clean = df_reviews.dropDuplicates()

df_reviews_clean = df_reviews_clean.filter(
    (col("text").isNotNull()) & (col("text") != "")
)

print("Original Row Count:", df_reviews.count())
print("Cleaned Row Count:", df_reviews_clean.count())


Original Row Count: 2128605


[Stage 70:>                                                       (0 + 12) / 12]

Cleaned Row Count: 2099018


                                                                                

The following `Code Block` just shows a sample of the Cleaned Dataset with no duplicates. This likely won't be much different from the previous Dataset sample, though we know that there is no duplicates in this sample.

In [294]:
print("Cleaned Reviews Dataset:")
df_reviews_clean.show(10, truncate=True)

Cleaned Reviews Dataset:




+------+--------------------+--------------------+------------+----------+-----------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|
+------+--------------------+--------------------+------------+----------+-----------+
|   5.0|Easy setup and wo...|I love how easy t...|           0|B00UXG4WR8| B00UXG4WR8|
|   5.0|              buy it|fit, look, & work...|           0|B001TH7GZU| B001TH7GZU|
|   5.0|             Filters|Yep, got what I n...|           0|B07CV7VNL8| B07CV7VNL8|
|   5.0|Flawless Version ...|We don't have any...|           2|B09649DDTN| B0C9TTZW3K|
|   5.0|A great dish. A g...|The Apusafe MWF w...|           0|B0892F62TR| B0892F62TR|
|   2.0|I wanted to love ...|I bought this to ...|           2|B08PYPQQ3Z| B08PYPQQ3Z|
|   5.0|   Holds jumbo eggs!|This particular e...|           4|B01EVRIK2C| B07MBQW54M|
|   5.0|    I have ice again|This solved my NO...|           0|B01GXOPMW2| B01GXOPMW2|
|   5.0|Useful to rinse q...|Quinoa is too 

                                                                                

---

### Feature Engineering

This step is similar to the ***Textual Data Exploration*** step, with the exception that it is performed on the cleaned data.

This step is used to transform the `text` column of the reviews dataset into numerical features that a machine learning model can understand. We used ***TF-IDF (Term Frequency-Inverse Document Frequency)*** to represent the importance of words in each review while reducing the influence of commonly used words. This step involves tokenizing the text, removing stopwords, and applying TF-IDF transformation to create a feature vector for each review.

First we `Tokenize` the text as shown in the following `Code Block`:

In [295]:
# Remove HTML tags from the 'text' column using regexp_replace
df_reviews_clean = df_reviews_clean.withColumn(
    "text", regexp_replace(col("text"), "<[^>]+>", "")  # Remove anything between < and >
)

# Tokenization - Split the 'text' column into individual words
tokenizer = Tokenizer(inputCol="text", outputCol="words_raw")
df_reviews_tokenized = tokenizer.transform(df_reviews_clean)

df_reviews_tokenized.show(10, truncate=True)



+------+--------------------+--------------------+------------+----------+-----------+--------------------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|           words_raw|
+------+--------------------+--------------------+------------+----------+-----------+--------------------+
|   5.0|Easy setup and wo...|I love how easy t...|           0|B00UXG4WR8| B00UXG4WR8|[i, love, how, ea...|
|   5.0|              buy it|fit, look, & work...|           0|B001TH7GZU| B001TH7GZU|[fit,, look,, &, ...|
|   5.0|             Filters|Yep, got what I n...|           0|B07CV7VNL8| B07CV7VNL8|[yep,, got, what,...|
|   5.0|Flawless Version ...|We don't have any...|           2|B09649DDTN| B0C9TTZW3K|[we, don't, have,...|
|   5.0|A great dish. A g...|The Apusafe MWF w...|           0|B0892F62TR| B0892F62TR|[the, apusafe, mw...|
|   2.0|I wanted to love ...|I bought this to ...|           2|B08PYPQQ3Z| B08PYPQQ3Z|[i, bought, this,...|
|   5.0|   Holds jumbo eggs!

                                                                                

After the text has been tokenized (which we see in the `words_raw` column), we want to remove common words like _"the"_, _"and"_, _"I"_, etc. To accomplish this, we use Spark's `StopWordsRemover` function. This is applied in the following `Code Block`:

In [296]:
# Stopwords Removal
stopwords_remover = StopWordsRemover(inputCol="words_raw", outputCol="words_cleaned")
df_reviews_cleaned_words = stopwords_remover.transform(df_reviews_tokenized)

df_reviews_cleaned_words.show(10, truncate=True)


[Stage 82:====>                                                   (1 + 11) / 12]

+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|           words_raw|       words_cleaned|
+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+
|   5.0|Easy setup and wo...|I love how easy t...|           0|B00UXG4WR8| B00UXG4WR8|[i, love, how, ea...|[love, easy, filt...|
|   5.0|              buy it|fit, look, & work...|           0|B001TH7GZU| B001TH7GZU|[fit,, look,, &, ...|[fit,, look,, &, ...|
|   5.0|             Filters|Yep, got what I n...|           0|B07CV7VNL8| B07CV7VNL8|[yep,, got, what,...|[yep,, got, needed.]|
|   5.0|Flawless Version ...|We don't have any...|           2|B09649DDTN| B0C9TTZW3K|[we, don't, have,...|[ice, maker, free...|
|   5.0|A great dish. A g...|The Apusafe MWF w...|           0|B0892F62TR| B0892F62TR|[the, apusa

                                                                                

We created a new column using the `StopWordsRemover` function, named `words_cleaned`, which has removed all the common words.

Next, we needed to make these words meaningful to the Machine Learning model. To do this we converted the words into a term frequency vector. Because of the Appliances.jsonl dataset being so large, we capped the vocab size of it at 5000 to save computer resources. We used Spark's `CountVectorizer` function to turn the words into a frequency vector in the following `Code Block`:

In [297]:
# CountVectorizer - Convert words into a term frequency vector
count_vectorizer = CountVectorizer(inputCol="words_cleaned", outputCol="raw_features", vocabSize=5000)
cv_model = count_vectorizer.fit(df_reviews_cleaned_words)
df_reviews_vectorized = cv_model.transform(df_reviews_cleaned_words)

df_reviews_vectorized.show(10, truncate=True)

[Stage 92:====>                                                   (1 + 11) / 12]

+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+--------------------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|           words_raw|       words_cleaned|        raw_features|
+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+--------------------+
|   5.0|Easy setup and wo...|I love how easy t...|           0|B00UXG4WR8| B00UXG4WR8|[i, love, how, ea...|[love, easy, filt...|(5000,[2,9,26,31,...|
|   5.0|              buy it|fit, look, & work...|           0|B001TH7GZU| B001TH7GZU|[fit,, look,, &, ...|[fit,, look,, &, ...|(5000,[1,16,155,5...|
|   5.0|             Filters|Yep, got what I n...|           0|B07CV7VNL8| B07CV7VNL8|[yep,, got, what,...|[yep,, got, needed.]|(5000,[41,310],[1...|
|   5.0|Flawless Version ...|We don't have any...|           2|B09649DDTN| B0C9TTZW3K|[we, don't, ha

                                                                                

Now that we have a Term Frequency shown in the `raw_features` column, we can use Spark's `IDF` function to apply an Inverse Document Frequency. This will weight the importance of each word in the document, so that we can determine the Sentiment Analysis of each review, if they are helpful or not.

In [298]:
# TF-IDF - Apply Inverse Document Frequency to weigh word importance
idf = IDF(inputCol="raw_features", outputCol="tfidf_features")
idf_model = idf.fit(df_reviews_vectorized)
df_reviews_final = idf_model.transform(df_reviews_vectorized)

df_reviews_final.show(10, truncate=True)

[Stage 99:>                                                       (0 + 12) / 12]

+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+
|rating|               title|                text|helpful_vote|      asin|parent_asin|           words_raw|       words_cleaned|        raw_features|      tfidf_features|
+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+
|   5.0|Easy setup and wo...|I love how easy t...|           0|B00UXG4WR8| B00UXG4WR8|[i, love, how, ea...|[love, easy, filt...|(5000,[2,9,26,31,...|(5000,[2,9,26,31,...|
|   5.0|              buy it|fit, look, & work...|           0|B001TH7GZU| B001TH7GZU|[fit,, look,, &, ...|[fit,, look,, &, ...|(5000,[1,16,155,5...|(5000,[1,16,155,5...|
|   5.0|             Filters|Yep, got what I n...|           0|B07CV7VNL8| B07CV7VNL8|[yep,, got, what,...|[yep,, got, needed.]|(5000,[41,310],[1

                                                                                

Here is an output of the raw text along with the `TF-IDF` features that will be used to determine the Sentiment Analysis of the review.

In [299]:
# Output Verification
print("Reviews Dataset with TF-IDF Features:")
df_verification = df_reviews_final.withColumn("text", substring("text", 1, 20))
df_verification.select("text", "tfidf_features").show(5, truncate=False)


Reviews Dataset with TF-IDF Features:




+--------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

---

### Experiment Setup

For the Sentiment Analysis task, we focused on predicting whether a given review is favorable (positive sentiment) or unfavorable (negative sentiment). We began by creating a binary label based on the rating column, where ratings ≥ 4 are considered favorable, and ratings < 4 are unfavorable. For quick experimentation, we’ll use Logistic Regression, a simple and effective model for text classification.

In [300]:
# Define the target label
# Reviews with rating >= 4 are favorable (label=1), otherwise unfavorable (label=0)
df_reviews_labeled = df_reviews_final.withColumn(
    "label", when(col("rating") >= 4, 1).otherwise(0)
)

df_reviews_labeled.show(10, truncate=True)

[Stage 105:>                                                      (0 + 12) / 12]

+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+-----+
|rating|               title|                text|helpful_vote|      asin|parent_asin|           words_raw|       words_cleaned|        raw_features|      tfidf_features|label|
+------+--------------------+--------------------+------------+----------+-----------+--------------------+--------------------+--------------------+--------------------+-----+
|   5.0|Easy setup and wo...|I love how easy t...|           0|B00UXG4WR8| B00UXG4WR8|[i, love, how, ea...|[love, easy, filt...|(5000,[2,9,26,31,...|(5000,[2,9,26,31,...|    1|
|   5.0|              buy it|fit, look, & work...|           0|B001TH7GZU| B001TH7GZU|[fit,, look,, &, ...|[fit,, look,, &, ...|(5000,[1,16,155,5...|(5000,[1,16,155,5...|    1|
|   5.0|             Filters|Yep, got what I n...|           0|B07CV7VNL8| B07CV7VNL8|[yep,, got, what,...|[yep,, g

                                                                                

We use a typic 80-20 split for Train-Test data. We use a value of 42 for Seed ([The answer to "Life, the universe and everything!"](https://medium.com/geekculture/the-story-behind-random-seed-42-in-machine-learning-b838c4ac290a)). We can use Sparks `randomSplit` function to accomplish this in the following `Code Block`:

In [301]:
# Train-test split (80-20 split)
train_data, test_data = df_reviews_labeled.randomSplit([0.8, 0.2], seed=42)

Of course, we must use Spark's `LogisticRegression` model as the Machine Learning model we are deciding to use, and define it appropriately in the following `Code Block`:

In [302]:
# Define the Logistic Regression model
lr = LogisticRegression(featuresCol="tfidf_features", labelCol="label")

While we are not creating an overly complex pipeline for this task, if the need arose where we wanted to add more variables or tweak the Machine Model learning further, we have one set up to do so in the `Code Block` below, using Spark's `Pipeline` function:

In [303]:
# Build a pipeline (no additional transformations needed here)
pipeline = Pipeline(stages=[lr])

We trained the data in the pipeline, which we will use to make predictions on the testing data

In [304]:
# Train the model
lr_model = pipeline.fit(train_data)



---

### Experimentation Factors and Process

As seen above, the Machine Learning model used for the experiment was the Logistic Regression model. We used this model because we needed to classify simply whether a given text review was Favourable or Not Favourable. We determined in the training of the model that reviews with a rating of 4 or greater must be Favourable, thus the rest would be deemed Not Favourable. Given the large nature of the dataset, we did not tune any Hyperparameters in the initial model. Though we will tune them when we run a Cross-Validation model. The hyperparameter tuned will be the Regularization parameter. We will tune it with Regularization parameters set to [0.01, 0.1, 1.0]. We will find the best model with this set and compare it to the baseline model that has no tuning. Based on this we will be able to determine which model produces the best results.

We have trained the data, now we make predictions using the Logistic Regression model.

In [305]:
# Make predictions on the test data
predictions = lr_model.transform(test_data)

The predictions on the baseline model will now be evaluated, and a value which defines the Area Under the ROC will be determined. The closer the value is to 1.0, the better, as this will indicate that the model can predict if a review is Favourable or Not Favourable with little bias and little variance. Though a value too close to 1.0 may indicate that there is variance in the training model.

In [306]:
# Evaluate the model
evaluator = BinaryClassificationEvaluator(labelCol="label", metricName="areaUnderROC")
roc_auc = evaluator.evaluate(predictions)

                                                                                

In [307]:
# Print the evaluation result
print(f"Test AUC (Area Under ROC): {roc_auc:.4f}")

Test AUC (Area Under ROC): 0.9283


The Baseline Model determined that the AUC was **0.9283**. This is indicative of a model with little variance and little bias. Making the Baseline Machine Learning model to perform well with its predictive power.

The below `Code Block` shows a sample of the predictive power of the Machine Learning model.

In [308]:
# Display some sample predictions
print("Sample Predictions:")
df_predictions = predictions.withColumn("text", substring("text", 1, 20))

df_predictions.select("text", "label", "prediction", "probability").show(20, truncate=False)

Sample Predictions:


[Stage 304:>                                                        (0 + 1) / 1]

+--------------------+-----+----------+-----------------------------------------+
|text                |label|prediction|probability                              |
+--------------------+-----+----------+-----------------------------------------+
|Good luck trying to |0    |0.0       |[0.9802595063074832,0.019740493692516847]|
|The filters don't ke|0    |1.0       |[0.2981005455029585,0.7018994544970415]  |
|The Heavy duty filte|0    |0.0       |[0.9999920802142824,7.91978571756946E-6] |
|Lasted maybe 2 weeks|0    |0.0       |[0.9999126699692661,8.733003073391199E-5]|
|It's a piece of plas|0    |1.0       |[0.14236229117230803,0.857637708827692]  |
|They don't work in t|0    |1.0       |[0.48171669748994517,0.5182833025100548] |
|*** THIS ICE MACHINE|0    |0.0       |[0.999996564128657,3.4358713429938348E-6]|
|[[VIDEOID:7101ea2ab5|0    |0.0       |[0.915184742549061,0.08481525745093899]  |
|These shelves are ex|0    |0.0       |[0.5893658951191919,0.4106341048808081]  |
|I just changed 

                                                                                

---

### Performance Metrics

Determining performance metrics of the Machine Learning model is important to determine whether it will be used to perform our Sentiment Analysis on the reviews or not. Below, the following metrics are calculated: Accuracy, Precision, Recall, and F1-Score. These metrics are analyzed below.

In [309]:
# Accuracy
evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator_acc.evaluate(predictions)

                                                                                

In [310]:
# Precision
evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
precision = evaluator_precision.evaluate(predictions)

                                                                                

In [311]:
# Recall
evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
recall = evaluator_recall.evaluate(predictions)

                                                                                

In [312]:
# F1-score
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
f1 = evaluator_f1.evaluate(predictions)

                                                                                

In [313]:
# Print metrics
print(f"Model Evaluation Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Model Evaluation Metrics:
Accuracy: 0.8969
Precision: 0.8928
Recall: 0.8969
F1-score: 0.8914


#### Accuracy
Accuracy is the proportion of correctly classified reviews (Favourable or Not Favourable) out of all of the reviews. Our model has an accuracy of 0.8969, meaning that ~89.7% of all reviews were correctly classified as Favourable (Label = 1.0) or Not Favourable (Label = 0.0).

#### Precision
Precision measures how many of the predicted Favourable reviews were actually correct. Our model has a precision of 0.8928, meaning that ~89.3% of all predictions were correct when the model predicted a review as Favourable.

#### Recall
Recall measures how many of the actual Favourable reviews were correctly identified by the model. Our model has a recall of 0.8969, meaning that ~89.7% of all actual Favourable reviews were correctly predicted as Favourable.

#### F1-Score
The F1-Score is the harmonic mean of precision and recall, balancing both metrics. Our F1-Score of 0.8914 is reflective of a good balance between our precision and recall values, showing that the model effectively captures Favourable reviews, while minimizing False Positives and False Negatives.

#### Overall
These high values for our metrics suggest that the baseline Machine Learning model is performing well overall.

---

### Cross-Validation and Hyperparameter Tuning

We tune the Regularization parameter using Cross-Validation to determine if we can find an optimal setting for the Baseline Machine Learning Model.

In [314]:
# Define the parameter grid
paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.01, 0.1, 1.0]).build()

In [315]:
# Cross-validation
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(labelCol="label"),
                          numFolds=3) 

In [316]:
# Run cross-validation to choose the best model
cv_model = crossval.fit(train_data)



In [317]:
# Evaluate the best model
best_model = cv_model.bestModel
cv_predictions = best_model.transform(test_data)
cv_roc_auc = evaluator.evaluate(cv_predictions)

print(f"Best Model Test AUC (Cross-Validated): {cv_roc_auc:.4f}")

                                                                                

Best Model Test AUC (Cross-Validated): 0.9289


After Hyperparameter Tuning with Cross-Validation, the best model appears to have an AUC of 0.9289, only slightly higher than the baseline model. This is indicative of the predictive power of the model being slightly better after hyperparameter tuning.

---

### Results

#### Baseline

The baseline machine learning model performed well using logistic regression on the following metrics:

| **Metric**    | **Score**  | 
| ------------- | ---------- |
| Accuracy      | 0.8969     |
| Precision     | 0.8928     |
| Recall        | 0.8969     |     
| F1-Score      | 0.8914     |
| AUC           | 0.9283     |

#### Cross-Validated (After Hyperparameter Tuning)

Best Model Test AUC: 0.9289

Hyperparameter tuning appeared to slightly improved the model’s predictive power by increasing the AUC to 0.9289

This demonstrates that tuning the regularization parameter (regParam) allowed the model to generalize better on unseen test data.

#### Key Findings

The Accuracy, Precision, Recall, and F1-Score metrics all indicated that the Machine Learning Model had high predictive power when it came to labelling a review as Favourable (Label = 1.0) or Not Favourable (Label = 0.0). 

The TF-IDF method seemed to work well and we were able to utilize it to create a valuable model with the predictive power we wanted. The dataset was large so we utilized Apache Spark to determine create our model with the TF-IDF method. We were required to lower our vocabulary due to system constraints (to a vocabulary of 5000), but this did not seem to interfere with the creation of our model.

#### Future Work

In the future, with greater computing resources, it would be ideal to create a model using Random Forests or deep learning models. It would also be ideal to perform more Hyperparameter tuning, though with such large datasets, it becomes cumbersome to test multiple hyperparameters on a grid, as the time it takes could take hours, or even days with large datasets and scarce computing resources.

### Conclusion

The Machine Learning model created by our team appears to have been successful in determining whether a given review is Favourable or Not Favourable, with high metrics in Accuracy, Precision, Recall, and F1-Score. Also after tuning the Regularization Hyperparameter, the AUC score stayed relatively the same, indicating that there was little variance. If we had greater computing resources, we would certainly like to try various different classification models and tune different hyperparameters to fit the models needs, and see if we could achieve a higher AUC score with higher metrics also.


In [318]:
spark.stop()