# Text Analysis & Summarization Dashboard

This notebook provides an end-to-end text-analytics pipeline over product review data, combining Spark NLP, topic modeling, and abstractive summarization. You can interactively explore different categories, products, rating filters, and review-count thresholds to surface dominant themes and generate concise summaries.

---

## Step-by-Step Process

1. **Environment Setup**  
   - Import Spark, Spark NLP, Sumy, SpaCy, Gensim & Transformers (BART)  
   - Initialize a local Spark session with the JohnSnowLabs Spark NLP package  

2. **Data Load & Caching**  
   - Read the master Parquet dataset of reviews  
   - Cache the DataFrame for fast iterative queries  

3. **Data Preparation**  
   - Aggregate review counts by `main_category` and select the top 10 categories  
   - Filter out missing text/rating records  
   - Clean review text (lowercase, remove non-alphabetic characters)  

4. **Top Products Selection**  
   - Compute review counts per `(main_category, Product_title)`  
   - Rank and keep the top 10 products in each category  
   - Join back to focus downstream analysis on these high-volume items  

5. **Tokenization & Stop-Word Removal**  
   - Use Spark’s `Tokenizer` and `StopWordsRemover` to split and filter tokens  
   - Convert the cleaned, tokenized DataFrame to a Pandas DataFrame for dashboarding  

6. **Interactive Topic Modeling Dashboard**  
   - Build dropdowns & sliders for category, product, star-rating, and minimum reviews  
   - On each selection change, run LDA on the filtered tokens and render with `pyLDAvis`  
   - Visualize topic distributions and term relevance in real time  

7. **Interactive Summarization Dashboard**  
   - Load two summarizers: Sumy’s TextRank and HuggingFace’s BART  
   - Provide controls for category, product, star rating, and number of reviews to include  
   - Generate both extractive (TextRank) and abstractive (BART) summaries with one click  

8. **Clean-Up**  
   - Stop the Spark session once you’re done  

---

### Empowerment & Exploration  
By adjusting the widgets at the top—category selector, product list, rating slider, and review-count controls—you can:

- **Unearth shifting topic themes** across different product segments  
- **Compare extractive vs. abstractive summaries** on the fly  
- **Validate hypotheses** about what customers care about by varying your filters  

Use this notebook as a living dashboard to iterate, refine, and derive actionable insights from your review corpus.  


## 1. Environment Setup

In [None]:
# ─── Notebook Setup: Cleaned Imports ───────────────────────────────────────

# Core NLP frameworks
import sparknlp
import spacy
from transformers import pipeline

# Spark & Spark NLP
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import (
    col, count, desc, element_at, explode, from_json,
    lower, regexp_replace, row_number, split, size, when
)
from pyspark.sql.types import ArrayType, StringType
from pyspark.ml import Pipeline as SparkPipeline
from pyspark.ml.feature import (
    Tokenizer, RegexTokenizer, CountVectorizer, IDF, NGram, StopWordsRemover
)
from pyspark.ml.clustering import LDA
from pyspark.ml.functions import vector_to_array
from pyspark.ml.feature import Tokenizer as SparkTokenizer


# Spark NLP components
from sparknlp.base import DocumentAssembler, Finisher
from sparknlp.annotator import *

# Data manipulation
import pandas as pd

# Visualization
import plotly.express as px
import plotly.graph_objects as go
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
import gensim
from gensim import corpora

# Interactive widgets & display
import ipywidgets as widgets
from IPython.display import display, clear_output

# Text summarization (Sumy)
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer as SumyTokenizer
from sumy.summarizers.text_rank import TextRankSummarizer

# Enable LDA visualization in notebook
pyLDAvis.enable_notebook()

In [None]:
# 1) start Spark w/ Spark NLP
spark = (SparkSession.builder
         .appName("NLPDashboard")
         .master("local[*]")
         .config("spark.jars.packages",
                 "JohnSnowLabs:spark-nlp:4.4.0")  # adjust version
         .getOrCreate())

25/04/28 20:13:10 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


## 2. Data Load & Caching

In [None]:
path = "gs://ba843-group1-project/master_data.parquet"
master_df = spark.read.parquet(path)

                                                                                

In [None]:
master_df.cache()

DataFrame[parent_asin: string, asin: string, helpful_vote: bigint, images: array<struct<attachment_type:string,large_image_url:string,medium_image_url:string,small_image_url:string>>, rating: double, text: string, timestamp: bigint, title: string, user_id: string, verified_purchase: boolean, main_category: string, average_rating: double, rating_number: double, features: string, description: string, price: double, videos: string, store: string, categories: string, bought_together: string, Product_title: string]

----

## 3. Data Preparation

In [None]:
# Parse categories properly
top_n = 10
category_counts = master_df.groupBy("main_category") \
    .count()

# Grab the top-10 by that count
top10 = category_counts.orderBy(desc("count")).limit(top_n)

In [None]:
# Option A: collect into a Python list and use isin()
top10_list = [row["main_category"] for row in top10.collect()]
top10_df = master_df.filter(col("main_category").isin(top10_list))

                                                                                

## 4. Top Products Selection

In [None]:
# Top 10 Categories by number of reviews
filtered_df = master_df.filter(col("main_category").isin(top10_list))
filtered_df = filtered_df.na.drop(subset=["text", "rating_number", "rating"])
filtered_df = filtered_df.withColumn("clean_text", regexp_replace(lower(col("text")), "[^a-zA-Z\\s]", ""))

In [None]:
# Top 10 Product_title per category
product_review_counts = filtered_df.groupBy("main_category", "Product_title").agg(count("*").alias("product_review_count"))
windowSpec = Window.partitionBy("main_category").orderBy(desc("product_review_count"))
ranked_products_df = product_review_counts.withColumn("rank", row_number().over(windowSpec))
top_products_df = ranked_products_df.filter(col("rank") <= top_n).drop("rank")

filtered_top_df = filtered_df.join(top_products_df, on=["main_category", "Product_title"], how="inner")

## 5. Tokenization & Stop-Word Removal

In [None]:
# Tokenization & Stopword Removal
tokenizer = SparkTokenizer(inputCol="clean_text", outputCol="tokens")
tokenized_df = tokenizer.transform(filtered_top_df)
remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens")
filtered_tokens_df = remover.transform(tokenized_df)

# Convert to Pandas
dashboard_df = filtered_tokens_df.select("main_category", "Product_title", "rating_number", "rating", "filtered_tokens").toPandas()
dashboard_df['joined_tokens'] = dashboard_df['filtered_tokens'].apply(lambda x: ' '.join(x))
dashboard_df['tokenized'] = dashboard_df['joined_tokens'].apply(lambda x: x.split())
dashboard_df = dashboard_df[dashboard_df['joined_tokens'].str.strip() != ""]

  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
  if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
                                                                                

## 6. Interactive Topic Modeling Dashboard

In [None]:
pyLDAvis.enable_notebook()

# Category Dropdown
category_dropdown = widgets.Dropdown(
    options=dashboard_df["main_category"].unique(),
    description='Category:',
    disabled=False
)

# Product Dropdown
def get_titles_for_category(category):
    titles = dashboard_df[dashboard_df['main_category'] == category]['Product_title'].unique()
    return titles[:10]

title_dropdown = widgets.Dropdown(
    options=[],
    description='Product:',
    disabled=False
)

# Min Reviews Slider
review_count_slider = widgets.IntSlider(
    value=1,
    min=1,
    max=int(dashboard_df["rating_number"].max()),
    step=1,
    description='Min Reviews:',
    continuous_update=False
)

# Star Rating Dropdown (1 to 5 stars)
star_rating_dropdown = widgets.Dropdown(
    options=[1, 2, 3, 4, 5],
    value=5,
    description='Star Rating:',
    disabled=False
)

output = widgets.Output()

# Update Product Titles
def update_titles(*args):
    selected_category = category_dropdown.value
    title_dropdown.options = get_titles_for_category(selected_category)

category_dropdown.observe(update_titles, names='value')

# Update Topic Chart
def update_topic_chart(change):
    with output:
        output.clear_output()
        selected_category = category_dropdown.value
        selected_title = title_dropdown.value
        min_rating = review_count_slider.value
        selected_star_rating = star_rating_dropdown.value

        df_filtered = dashboard_df[
            (dashboard_df['main_category'] == selected_category) &
            (dashboard_df['Product_title'] == selected_title) &
            (dashboard_df['rating_number'] >= min_rating) &
            (dashboard_df['rating'] == selected_star_rating)
        ]

        print(f"Filtered rows: {len(df_filtered)}")

        if not df_filtered.empty and len(df_filtered) >= 3:
            try:
                tokenized_data = df_filtered['tokenized'].tolist()
                dictionary = corpora.Dictionary(tokenized_data)
                corpus = [dictionary.doc2bow(text) for text in tokenized_data]

                lda_model = gensim.models.LdaModel(
                    corpus=corpus,
                    id2word=dictionary,
                    num_topics=3,
                    random_state=42,
                    passes=10,
                    iterations=100
                )

                vis_data = gensimvis.prepare(lda_model, corpus, dictionary)
                display(vis_data)

            except Exception as e:
                print(f"Gensim LDA visualization failed: {e}")
        else:
            print("Not enough data for topic modeling (need at least 3 rows).")

# Bind all widgets
category_dropdown.observe(update_topic_chart, names='value')
title_dropdown.observe(update_topic_chart, names='value')
review_count_slider.observe(update_topic_chart, names='value')
star_rating_dropdown.observe(update_topic_chart, names='value')

# Display all widgets
display(widgets.VBox([
    category_dropdown,
    title_dropdown,
    review_count_slider,
    star_rating_dropdown,
    output
]))

# Initialize dropdowns and chart
update_titles()
update_topic_chart(None)

VBox(children=(Dropdown(description='Category:', options=('Health & Personal Care', 'Sports & Outdoors', 'Amaz…

## LDA Dashboard Sample:

## LDA: Topic Model
![LDA: Topic Model](https://github.com/billburr958/images-temp/blob/main/Text%20Analysis/topic_model.png?raw=true)

----

## 7. Interactive Summarization Dashboard

In [None]:
# Load the summarizer pipeline (this may take a minute the first time)
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

Device set to use cpu


In [None]:
nlp = spacy.load("en_core_web_sm")

# Category Dropdown
summary_category_dropdown = widgets.Dropdown(
    options=dashboard_df["main_category"].unique(),
    description='Category:',
    disabled=False
)

# Product Dropdown
def get_titles_for_category(category):
    titles = dashboard_df[dashboard_df['main_category'] == category]['Product_title'].unique()
    return titles[:10]

summary_title_dropdown = widgets.Dropdown(
    options=[],
    description='Product:',
    disabled=False
)

# Star Rating Dropdown (1 to 5 stars)
summary_star_rating_dropdown = widgets.Dropdown(
    options=[1, 2, 3, 4, 5],
    value=5,
    description='Star Rating:',
    disabled=False
)

# Min Reviews Slider (on rating_number column)
summary_review_count_slider = widgets.IntSlider(
    value=1,
    min=1,
    max=int(dashboard_df["rating_number"].max()),
    step=1,
    description='Min Reviews:',
    continuous_update=False
)

# 🆕 Number of Reviews to Include in Summary
num_reviews_slider = widgets.IntSlider(
    value=10,
    min=1,
    max=500,
    step=1,
    description='Reviews to Use:',
    continuous_update=False
)

summary_output = widgets.Output()

# Update Product Titles when Category changes
def update_summary_titles(*args):
    selected_category = summary_category_dropdown.value
    summary_title_dropdown.options = get_titles_for_category(selected_category)

summary_category_dropdown.observe(update_summary_titles, names='value')

def abstractive_summarize_reviews(change):
    with summary_output:
        summary_output.clear_output()
        selected_category = summary_category_dropdown.value
        selected_title = summary_title_dropdown.value
        selected_star_rating = summary_star_rating_dropdown.value
        min_review_count = summary_review_count_slider.value
        num_reviews_to_use = num_reviews_slider.value

        df_filtered = dashboard_df[
            (dashboard_df['main_category'] == selected_category) &
            (dashboard_df['Product_title'] == selected_title) &
            (dashboard_df['rating'] == selected_star_rating) &
            (dashboard_df['rating_number'] >= min_review_count)
        ]

        print(f"Filtered rows available: {len(df_filtered)}")

        if not df_filtered.empty:
            df_filtered['review_length'] = df_filtered['joined_tokens'].apply(len)
            df_filtered_sorted = df_filtered.sort_values(by='review_length', ascending=False).head(num_reviews_to_use)

            combined_reviews = ' '.join(df_filtered_sorted['joined_tokens'].tolist())
            print(f"Using top {len(df_filtered_sorted)} reviews for summarization")

            if len(combined_reviews.split()) > 100:
                # BART/T5 has a max token limit (~1024 tokens)
                combined_reviews = combined_reviews[:3000]

                try:
                    summary = summarizer(combined_reviews, max_length=150, min_length=50, do_sample=False)
                    print("\nAbstractive Summary:\n")
                    print(summary[0]['summary_text'])
                except Exception as e:
                    print(f"Summarization failed: {e}")
            else:
                print("Not enough text for summarization.")
        else:
            print("No data for this selection.")


# summary_category_dropdown.observe(extractive_summarize_reviews, names='value')

# Bind to abstractive summarizer now
summary_category_dropdown.observe(abstractive_summarize_reviews, names='value')
summary_title_dropdown.observe(abstractive_summarize_reviews, names='value')
summary_star_rating_dropdown.observe(abstractive_summarize_reviews, names='value')
summary_review_count_slider.observe(abstractive_summarize_reviews, names='value')
num_reviews_slider.observe(abstractive_summarize_reviews, names='value')

# Display widgets (no change)
display(widgets.VBox([
    summary_category_dropdown,
    summary_title_dropdown,
    summary_star_rating_dropdown,
    summary_review_count_slider,
    num_reviews_slider,
    summary_output
]))

# Initialize
update_summary_titles()
abstractive_summarize_reviews(None)

VBox(children=(Dropdown(description='Category:', options=('Health & Personal Care', 'Sports & Outdoors', 'Amaz…

## Summarizer Dashboard Sample:


## Bart: Review Summarizer
![Bart: Review Summarizer](https://github.com/billburr958/images-temp/blob/main/Text%20Analysis/summarizer.png?raw=true)



----

# Review Text Analysis: Generic Insights & Strategic Recommendations

This section illustrates the kinds of insights—and the data‐driven actions—they support, based on a typical Spark NLP + topic‐modeling + summarization workflow over product reviews.

---

## 1. Volume & Sentiment Snapshot
- **Top Categories by Review Count**  
  - Apparel, Electronics, Home & Kitchen, Health & Beauty typically drive 70–80% of total review volume.
- **Overall Sentiment Distribution**  
  - **Positive (4–5 stars):** ~60% — praise for comfort, reliability, design  
  - **Neutral (3 stars):** ~15%-25% — moderate satisfaction, minor usability notes  
  - **Negative (1–2 stars):** ~20%-25% — common complaints around fit, durability, shipping delays  

---

## 2. Dominant Themes (LDA Topics)
| Category          | Top Topics                                             |
|-------------------|--------------------------------------------------------|
| **Apparel**       | Fit & sizing, fabric quality, color accuracy           |
| **Electronics**   | Battery life, ease of setup, audio/performance issues  |
| **Home & Kitchen**| Assembly instructions, material durability, finish     |
| **Health & Beauty**| Texture consistency, ingredient transparency, packaging |

---

## 3. Representative Summaries
- **Abstractive (BART) Example:**  
  > “Customers love the softness and stretch of these leggings, though many recommend sizing up for the perfect fit.”

---

## 4. Actionable Recommendations

### a. Product & Design
- **Size Guidance:**  
  - Add “Runs Small” or “True to Size” badges based on topic-model flags.  
  - Update size charts with real user measurements.
- **Material Upgrades:**  
  - Introduce premium fabric variants for high-volume apparel items to address pilling/stretch concerns.
- **Instructional Content:**  
  - Revise assembly manuals for home goods: include clearer diagrams and step-by-step videos.

### b. Marketing & Merchandising
- **Customer Quotes in Listings:**  
  - Surface high-impact snippets (“I love the comfort and stretch!”) to boost social proof.  
- **Dynamic Bundling:**  
  - Cross-sell complementary SKUs mentioned together in reviews (e.g. leggings + matching sports bra).
- **Targeted Campaigns:**  
  - Segment email lists by sentiment or topic (“Fit Issues” → “Extended-Size Run Preview”).

### c. Customer Experience
- **NLP Chatbot for FAQs:**  
  - Deploy a chatbot trained on cleaned review tokens to answer pre-purchase questions about fit, features, and setup.
- **Review Triage:**  
  - Prioritize customer service outreach on high-lift complaint topics (e.g. “arrived damaged,” “battery won’t hold”).

### d. Roadmap & Inventory
- **Feature Development:**  
  - Invest in faster-charging modules for electronics if “slow charge” topics persist.  
- **SKU Rationalization:**  
  - Consider phasing out products with repeated negative themes (e.g. durability failures).

---

> **Note:**  You can refine these insights further by adjusting LDA topic count, support/confidence thresholds, and summary-length parameters—enabling deeper dives into seasonal trends, niche product lines, or evolving customer priorities.  


In [None]:
spark.stop()